Why dont cpus have a big L1 cache shared between all cores to get around cache coherence problems and have bigger bandwidth between cores to boost message passing performance? Wouldn't losing a few cpu cycles on L1 be worthy if we get 3x-5x throughput?

I was implementing direct mapped, lru, n way set associative caches in C++ and found out that direct mapped cache is inherently parallelizable and easy to achieve 2.4 billions of lookups per second. Even the slower n way set associative surpassed 70 million lookups per second. If software can do this, why not hardware?

Writing to a variable in thread 1, reading it from thread 2, within just L1 latency instead of locking for 1000 nanoseconds nor fiddling with atomics. Just writing reading. Just imagine the good old concurrent queue doing 5 billion pushes and pops per second. Concurrent tree access for game scene graph. Concurrent anything would be blazingly scalable. Only if L1 cache has enough number of sets(if its n way set associated) or tags (if its direct mapped) like 64 or 128.

Also when not multiple-threading, all the bandwidth would be available to single thread for some really good handling of async io for web sites or maybe just boost the out of order execution capabilities? I really wouldn't care if excel loads a bit late. Its not pentium 2 after all. There must be a way of hiding the extra latency behind something else, like doing more L1 accesses at a time or working on more instructions at a time, maybe not as massively parallel as a gpu.

If its not possible nor cheap, then why dont they add hardware pipes that connect cores directly? Just like calling an assembly instruction to send a data directly to a core. Gpus have shared memory or local memory that can share data between all pipelines in multiprocessor. Gpu designers even optimize atomic functions by hardware to make many atomics run in parallel. Even if just atomic works like gpu, lockless concurrent data structures would get decent boost.

What about stacking cores on third dimension just above L1 to shorten path? Would it work? Maybe it's not "just a few more cycles" as I guessed. But is it possible even for higher price per chip?

What about putting carbon-nano-tubes(or open-ended tubes / pipes) between stacked cores and pump some cooling gas / liquid in them?

What about routing power via "wireless" way? Is it possible to make a standing-wave / resonance between stacks and feed transistors with e/m waves?

If lining the stacks is a problem, can we carve stacks out of a single crystalline structure that somehow can work as transistors, capacitors, etc with just some extra atoms as capacitance, etc? This may have gotten too far on the fantasy, but buying a computer (IBM personal c

... keep reading on reddit ➑

πŸ‘︎ 85
πŸ’¬︎
πŸ‘€︎ u/tugrul_ddr
πŸ“…︎ Oct 15 2021
🚨︎ report
Cache Coherence and Cache Policy

I'm searching in the privileged and unprivileged specs about cache coherence and which cache policy RISC-V implements, but I have one question:

- Does RISC-V (as an ISA) defines any cache coherence policy at all?

I read about PMA's and how they define memory regions as cacheable, but I'm not sure if that defines a coherence scheme or if it is a base structure for a hardware/software cache coherence implementation (i.e, snooping/directory).

πŸ‘︎ 6
πŸ’¬︎
πŸ‘€︎ u/akatekitos
πŸ“…︎ Mar 24 2021
🚨︎ report
Confused about cache coherence

Is cache coherence a problem with processor registers? (Or is it only an issue between their caches and memory?) If so, how do systems deal with cache coherence for the registers?

Thanks!

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/Leanador
πŸ“…︎ Feb 10 2021
🚨︎ report
(Free Book) - A Primer on Memory Consistency and Cache Coherence, Second Edition morganclaypool.com/doi/ab…
πŸ‘︎ 32
πŸ’¬︎
πŸ‘€︎ u/EngrToday
πŸ“…︎ Aug 26 2020
🚨︎ report
Cache coherence and DMA on RISC-V

There's been a bit of discussion about RISC-V recently, and its performance and such.

There was a thread over on /r/programming where /u/memgrid was discussing some design deficiencies in the base RISC-V regarding DMA and cache coherence.

https://old.reddit.com/r/programming/comments/isgpw9/arm_ukbased_chip_designer_sold_to_us_firm_nvidia/g582e73/

Questions:

  1. Is this actually a problem in practice?

  2. Is there a proposal in the works for a standardized extension that addresses the issue?

  3. What else is being done in this area.

I'd welcome any papers, videos or other information regarding how this is handled in current and next-generation RISC-V core implementations.

Edit: Provide more context in linked thread.

πŸ‘︎ 9
πŸ’¬︎
πŸ‘€︎ u/ansible
πŸ“…︎ Sep 16 2020
🚨︎ report
First new cache-coherence mechanism in 30 years news.mit.edu/2015/first-n…
πŸ‘︎ 17
πŸ’¬︎
πŸ‘€︎ u/CrankyBear
πŸ“…︎ Sep 10 2015
🚨︎ report
BagriDB. Document DataBase built on top of distributed cache solution like Hazelcast or Coherence github.com/dsukhoroslov/b…
πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/gar_den
πŸ“…︎ May 01 2017
🚨︎ report
[Phoronix] Linux 5.10 To Support AMD SME Hardware-Enforced Cache Coherency phoronix.com/scan.php?pag…
πŸ‘︎ 67
πŸ’¬︎
πŸ‘€︎ u/InvincibleBird
πŸ“…︎ Sep 18 2020
🚨︎ report
Linux 5.10 To Support AMD SME Hardware-Enforced Cache Coherency phoronix.com/scan.php?pag…
πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/megamanxtreme
πŸ“…︎ Sep 21 2020
🚨︎ report
Linux 5.10 To Support AMD SME Hardware-Enforced Cache Coherency phoronix.com/scan.php?pag…
πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/megamanxtreme
πŸ“…︎ Sep 21 2020
🚨︎ report
Brace yourselves, cache coherency is coming

https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.2-AMD-Zen-P2P-DMA

> With the Linux 5.2 kernel an AMD-supplied change by AMDGPU developer Christian KΓΆnig allows for supporting peer-to-peer DMA between any devices on AMD Zen systems.

Preparing for future Zen 3 datacenter APUs with heterogeneous CPU+GPU clusters.

πŸ‘︎ 29
πŸ’¬︎
πŸ‘€︎ u/franky9876
πŸ“…︎ May 21 2019
🚨︎ report
What are some good resources (books/courses) that cover concepts of cache, multi-cores and coherency?

I tried looking around. Most courses like Georgia's HPCA only cover the broader overview of Coherency in their general computer architecture course. ARM does not have a lot of open resources/tutorials on this either.

Are there any specific books or resources (preferably that captures modern architectures) that cover it in specific detail?

πŸ‘︎ 12
πŸ’¬︎
πŸ‘€︎ u/anonymou5guy
πŸ“…︎ Mar 06 2020
🚨︎ report
I make zero progress on real projects because I'm easily overwhelmed by the multitude of ways I can potentially make things work, obsessing over cache coherency and performance even though I know it won't make enough difference to matter. How can I overcome this kind of destructive perfectionism?

I see the flaws in everything I do and I refactor endlessly, each time with a higher expectation. I know it's unhealthy but I actually don't know how to get my mind to rest knowing that whatever I'm working on isn't the best possible solution to whatever insignificant problem I happen to be working on.

Sometimes I think that what I need is a heuristic other than performance to constrain myself with, but I have a hard time doing that because there isn't anything that provides the kind of objective constraint that performance does.

I'd love to hear what you all have to say - I can't be the only one that's struggled with this issue.

πŸ‘︎ 20
πŸ’¬︎
πŸ“…︎ Dec 14 2018
🚨︎ report
Performance of writes to write-combined memory(uncached, host-coherent) vs cached memory?

In my example, I have memory which is host coherent and uncached(and thus, it's write-combined to still provide fast CPU writes but not reads). I'm mainly curious if write-combined memory can achieve the same performance as ordinary writes to cached memory can achieve.

Write combining with uncached memory, in theory, is supposed to be nearly as fast as writes to cached memory would be, but I'm curious if folks have noticed that there's still some overhead with write combining that we wouldn't see if we just cached the memory? I'm assuming that anything which triggers a flush of the WC memory before the cache lines can be full would obviously be bad for performance.

Thanks.

πŸ‘︎ 15
πŸ’¬︎
πŸ‘€︎ u/evader9992
πŸ“…︎ Feb 01 2020
🚨︎ report
[Wikichip] Intel Stratix 10 DX Adds PCIe Gen 4.0, Cache Coherency: UPI As Stopgap fuse.wikichip.org/news/26…
πŸ‘︎ 15
πŸ’¬︎
πŸ‘€︎ u/yummycandy2
πŸ“…︎ Sep 20 2019
🚨︎ report
Cache Coherent Memory Fabric Online Meetup - April 16

RISC-V has enabled many new open system architectures. OmniXend is an Ethernet based cache coherent memory fabric which leverages TileLink from RISC-V. This April 16 online event provides an overview and details. https://www.meetup.com/Bay-Area-RISC-V-Meetup/events/269617034/

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/RISC-V_Marketing
πŸ“…︎ Apr 07 2020
🚨︎ report
"Anyone else start thinking of Rust's mutable/immutable borrow system when reading the [cache coherency] MESI algorithm?" news.ycombinator.com/item…
πŸ‘︎ 8
πŸ’¬︎
πŸ‘€︎ u/umop_aplsdn
πŸ“…︎ Nov 21 2019
🚨︎ report
Newly Discovered Variants Of Meltdown/Spectre Exploit Cache Coherency Across Cores tomshardware.com/news/new…
πŸ‘︎ 48
πŸ’¬︎
πŸ“…︎ Feb 16 2018
🚨︎ report
Koji je sluzbenih​ prijevod rijeci "cache coherency"

Pozzz Zanima me pravi, strucno uvazeni prijevod. Pojam je vezan uz arhitekturu racunalnih procesora, hvala

πŸ‘︎ 8
πŸ’¬︎
πŸ‘€︎ u/littuelPrincess
πŸ“…︎ Oct 23 2017
🚨︎ report
Please explain: cache coherency in reading vs writing

I use the term "vs" because a quick examination of the cache coherency wikipedia page revealed this:

> Write Propagation : Changes to the data in any cache must be propagated to other copies(of that cache line) in the peer caches.

I knew that cache reading didn't require additional runs to main memory unless that data got updated (unless I'm wrong).

Bottom Line Question: When operating on an SSBO in a compute shader, and you know that a particular item in the buffer will be operated on by one compute invocation and only one invocation, is it more cache friendly to:

  1. Create a copy of that structure, write to it, and then write the modified structure back to memory.
  2. Access each piece of data one by one?

My reading of the wikipedia page is that the latter would require a cache-SSBO synchronization on every write to every member, while the former would require a chache synchronization only once (when the data is written back to the SSBO).

πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/amdreallyfast
πŸ“…︎ Jan 14 2017
🚨︎ report
Why Is My Perfectly Good Shellcode Not Working? Cache Coherency on MIPS and ARM blog.senr.io/blog/why-is-…
πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/qznc_bot
πŸ“…︎ Feb 06 2019
🚨︎ report
Cache coherency primer, at The Ryg Blog fgiesen.wordpress.com/201…
πŸ‘︎ 91
πŸ’¬︎
πŸ‘€︎ u/joebaf
πŸ“…︎ Jul 07 2014
🚨︎ report
Diving into Cache Coherency and it's performance implications psy-lob-saw.blogspot.com/…
πŸ‘︎ 15
πŸ’¬︎
πŸ‘€︎ u/nitsanw
πŸ“…︎ Sep 23 2013
🚨︎ report
Busy waiting and the Go Memory Model
var a string
var done bool

func setup() {
  a = "hello world"
  done = true
}

func main() {
  go setup()
  
  for !done {
  }
  
  print(a)
}

Apologies, because this is someone raising their hand over something that is mere misinterpretation.

https://go.dev/ref/mem

At the end of The Go Memory Model there is this claim about the above code: "there is no guarantee that the write to 'done' ever will be observed by main, since there are no synchronization events between the two threads. The loop in main is not guaranteed to finish."

In what scenario would it not finish?

The code is an obvious code-smell for not using sync primitives, but I can't justify the loop claim based on these assumptions:

  1. observability: the for-condition observes the volatile variable, 'done'.
  2. scheduling: 'setup' eventually runs, and 'done' is eventually true.

Therefore, regardless of the order in which the 'setup' or 'main' go routines are scheduled, the loop exits and 'print(a)' executes.

That being said, the question contributes little value except in so far as it directs others to review the link, so thanks a ton :P. For some reason I found the author's other examples to be perfect candidates of what slithers into a code base, especially tests, and want to understand in case I missed something.

πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/niceyeti
πŸ“…︎ Jan 25 2022
🚨︎ report
Cache coherency primer fgiesen.wordpress.com/201…
πŸ‘︎ 27
πŸ’¬︎
πŸ‘€︎ u/personman
πŸ“…︎ Oct 20 2016
🚨︎ report
How is HPCA useful for devops work?

My manager said the company is willing pay for my HPCA course if I can justify how it's useful to my job. I do general devops/CI/CD work: writing scripts for safe code deploys, testing code deploys, maintaining and building out our CI agents, releasing our mobile apps. We use AWS, Terraform, Ansible, Docker, Kubernetes, puppet, Python/Ruby/bash.

Any suggestions for how I can justify the company paying for it? It would be a dream to not have to pay out of pocket. Every OMSCS course I've taken has been tremendously helpful to my work but this is less directly related, and I haven't actually taken the course yet so I only have a vague idea of what I'll be learning.

πŸ‘︎ 6
πŸ’¬︎
πŸ‘€︎ u/abrbbb
πŸ“…︎ Jan 13 2022
🚨︎ report
NUMA Deep Dive Part 3: Cache Coherency - frankdenneman.nl frankdenneman.nl/2016/07/…
πŸ‘︎ 12
πŸ’¬︎
πŸ‘€︎ u/frankdenneman
πŸ“…︎ Jul 11 2016
🚨︎ report
Diving into Cache Coherency and it's performance implications psy-lob-saw.blogspot.com/…
πŸ‘︎ 31
πŸ’¬︎
πŸ‘€︎ u/nitsanw
πŸ“…︎ Sep 23 2013
🚨︎ report
Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems (MICRO-48 Best Paper Award) dl.acm.org/citation.cfm?d…
πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/mttd
πŸ“…︎ Jan 20 2016
🚨︎ report
Cache coherency primer fgiesen.wordpress.com/201…
πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/lukego
πŸ“…︎ Jan 30 2016
🚨︎ report
Programmers who obsess about cache coherency and pipelining produce more solid, more robust systems in a smaller amount of time news.ycombinator.com/item…
πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/laghgal
πŸ“…︎ Jul 19 2014
🚨︎ report
Diving into Cache Coherency and it's performance implications psy-lob-saw.blogspot.com/…
πŸ‘︎ 6
πŸ’¬︎
πŸ‘€︎ u/nitsanw
πŸ“…︎ Sep 23 2013
🚨︎ report
Requesting suggestions for languages, libraries, and architectures for parallel (and sometimes non parallel) numerical and scientific computations

A lot of the work I am interested to do will be mostly built from scratch by myself, provided there is fair support for numerical types (like complex numbers) and high precision numerical operations (if not, I'll be happy to write those routines as well). Many of my areas of interest are computationally demanding (python codes choke for large enough datasets) but are often parallelizable, and I am looking for guidance on implementing the same. I love math and physics, especially domains that involve rigorous analysis, ranging from physical/mathematic concepts like turbulence, topology, wave optics, electromagnetism and quantum physics to computational concepts like cryptography and information theory. I also love signal processing, especially relating to random and sparse signals. They require a decent amount of precision while simultaneously being fast enough. I wish to be able to run the code on low power and high power manycore or SIMD processors, with the sequential parts being run on a general-purpose processor or highly pipelined FPGA. Energy efficiency is one of my key targets along with speed (many scenarios are energy constrained) even if it requires a longer and customized code. Another area of interest I have, while not my primary goal is to implement redundancy using parallelism (including different compression storage methods: eg RAID). I would like to have some control over the different memory allocations (hierarchies of caches and scratchpad memories) and if possible, some of the caching schemes while still being usable across multiple architectures. If possible, options to optimize for burst and broadcasting, prescence or absence of hardware lockstep, depending upon hardware support (use of switches to do different routines for different hardware when it comes to memory copy and allocation, basically caching). Sorry for the long and open ended question, I realized it would be hard to really come to a decision without getting a holistic picture of the whole domain atleast to a fair level of depth. I am looking for suggestions in both hardware and software for the same.

My primary concern is software - including but not limited to languages, compilers and directives. Being not from a software programming background, I find it hard to search for the proper areas (and keywords). I would like to share my current understanding of the scenario in terms of software - I have currently explored Cuda, OpenCL, Fortran, OpenMP, OpenMPI, OpenACC, Julia

... keep reading on reddit ➑

πŸ‘︎ 8
πŸ’¬︎
πŸ‘€︎ u/manueljenkin
πŸ“…︎ Jan 16 2022
🚨︎ report
Infinity Architecture CPU-GPU Coherence tomshardware.com/news/amd…
πŸ‘︎ 31
πŸ’¬︎
πŸ‘€︎ u/CoffeeAndKnives
πŸ“…︎ Nov 09 2021
🚨︎ report
SERIOUS: This subreddit needs to understand what a "dad joke" really means.

I don't want to step on anybody's toes here, but the amount of non-dad jokes here in this subreddit really annoys me. First of all, dad jokes CAN be NSFW, it clearly says so in the sub rules. Secondly, it doesn't automatically make it a dad joke if it's from a conversation between you and your child. Most importantly, the jokes that your CHILDREN tell YOU are not dad jokes. The point of a dad joke is that it's so cheesy only a dad who's trying to be funny would make such a joke. That's it. They are stupid plays on words, lame puns and so on. There has to be a clever pun or wordplay for it to be considered a dad joke.

Again, to all the fellow dads, I apologise if I'm sounding too harsh. But I just needed to get it off my chest.

πŸ‘︎ 17k
πŸ’¬︎
πŸ‘€︎ u/anywhereiroa
πŸ“…︎ Jan 15 2022
🚨︎ report

Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.