A list of puns related to "Cache coherence"
I was implementing direct mapped, lru, n way set associative caches in C++ and found out that direct mapped cache is inherently parallelizable and easy to achieve 2.4 billions of lookups per second. Even the slower n way set associative surpassed 70 million lookups per second. If software can do this, why not hardware?
Writing to a variable in thread 1, reading it from thread 2, within just L1 latency instead of locking for 1000 nanoseconds nor fiddling with atomics. Just writing reading. Just imagine the good old concurrent queue doing 5 billion pushes and pops per second. Concurrent tree access for game scene graph. Concurrent anything would be blazingly scalable. Only if L1 cache has enough number of sets(if its n way set associated) or tags (if its direct mapped) like 64 or 128.
Also when not multiple-threading, all the bandwidth would be available to single thread for some really good handling of async io for web sites or maybe just boost the out of order execution capabilities? I really wouldn't care if excel loads a bit late. Its not pentium 2 after all. There must be a way of hiding the extra latency behind something else, like doing more L1 accesses at a time or working on more instructions at a time, maybe not as massively parallel as a gpu.
If its not possible nor cheap, then why dont they add hardware pipes that connect cores directly? Just like calling an assembly instruction to send a data directly to a core. Gpus have shared memory or local memory that can share data between all pipelines in multiprocessor. Gpu designers even optimize atomic functions by hardware to make many atomics run in parallel. Even if just atomic works like gpu, lockless concurrent data structures would get decent boost.
What about stacking cores on third dimension just above L1 to shorten path? Would it work? Maybe it's not "just a few more cycles" as I guessed. But is it possible even for higher price per chip?
What about putting carbon-nano-tubes(or open-ended tubes / pipes) between stacked cores and pump some cooling gas / liquid in them?
What about routing power via "wireless" way? Is it possible to make a standing-wave / resonance between stacks and feed transistors with e/m waves?
If lining the stacks is a problem, can we carve stacks out of a single crystalline structure that somehow can work as transistors, capacitors, etc with just some extra atoms as capacitance, etc? This may have gotten too far on the fantasy, but buying a computer (IBM personal c
... keep reading on reddit β‘I'm searching in the privileged and unprivileged specs about cache coherence and which cache policy RISC-V implements, but I have one question:
- Does RISC-V (as an ISA) defines any cache coherence policy at all?
I read about PMA's and how they define memory regions as cacheable, but I'm not sure if that defines a coherence scheme or if it is a base structure for a hardware/software cache coherence implementation (i.e, snooping/directory).
Is cache coherence a problem with processor registers? (Or is it only an issue between their caches and memory?) If so, how do systems deal with cache coherence for the registers?
Thanks!
There's been a bit of discussion about RISC-V recently, and its performance and such.
There was a thread over on /r/programming where /u/memgrid was discussing some design deficiencies in the base RISC-V regarding DMA and cache coherence.
https://old.reddit.com/r/programming/comments/isgpw9/arm_ukbased_chip_designer_sold_to_us_firm_nvidia/g582e73/
Questions:
Is this actually a problem in practice?
Is there a proposal in the works for a standardized extension that addresses the issue?
What else is being done in this area.
I'd welcome any papers, videos or other information regarding how this is handled in current and next-generation RISC-V core implementations.
Edit: Provide more context in linked thread.
https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.2-AMD-Zen-P2P-DMA
> With the Linux 5.2 kernel an AMD-supplied change by AMDGPU developer Christian KΓΆnig allows for supporting peer-to-peer DMA between any devices on AMD Zen systems.
Preparing for future Zen 3 datacenter APUs with heterogeneous CPU+GPU clusters.
I tried looking around. Most courses like Georgia's HPCA only cover the broader overview of Coherency in their general computer architecture course. ARM does not have a lot of open resources/tutorials on this either.
Are there any specific books or resources (preferably that captures modern architectures) that cover it in specific detail?
I see the flaws in everything I do and I refactor endlessly, each time with a higher expectation. I know it's unhealthy but I actually don't know how to get my mind to rest knowing that whatever I'm working on isn't the best possible solution to whatever insignificant problem I happen to be working on.
Sometimes I think that what I need is a heuristic other than performance to constrain myself with, but I have a hard time doing that because there isn't anything that provides the kind of objective constraint that performance does.
I'd love to hear what you all have to say - I can't be the only one that's struggled with this issue.
In my example, I have memory which is host coherent and uncached(and thus, it's write-combined to still provide fast CPU writes but not reads). I'm mainly curious if write-combined memory can achieve the same performance as ordinary writes to cached memory can achieve.
Write combining with uncached memory, in theory, is supposed to be nearly as fast as writes to cached memory would be, but I'm curious if folks have noticed that there's still some overhead with write combining that we wouldn't see if we just cached the memory? I'm assuming that anything which triggers a flush of the WC memory before the cache lines can be full would obviously be bad for performance.
Thanks.
RISC-V has enabled many new open system architectures. OmniXend is an Ethernet based cache coherent memory fabric which leverages TileLink from RISC-V. This April 16 online event provides an overview and details. https://www.meetup.com/Bay-Area-RISC-V-Meetup/events/269617034/
Pozzz Zanima me pravi, strucno uvazeni prijevod. Pojam je vezan uz arhitekturu racunalnih procesora, hvala
I use the term "vs" because a quick examination of the cache coherency wikipedia page revealed this:
> Write Propagation : Changes to the data in any cache must be propagated to other copies(of that cache line) in the peer caches.
I knew that cache reading didn't require additional runs to main memory unless that data got updated (unless I'm wrong).
Bottom Line Question: When operating on an SSBO in a compute shader, and you know that a particular item in the buffer will be operated on by one compute invocation and only one invocation, is it more cache friendly to:
My reading of the wikipedia page is that the latter would require a cache-SSBO synchronization on every write to every member, while the former would require a chache synchronization only once (when the data is written back to the SSBO).
var a string
var done bool
func setup() {
a = "hello world"
done = true
}
func main() {
go setup()
for !done {
}
print(a)
}
Apologies, because this is someone raising their hand over something that is mere misinterpretation.
https://go.dev/ref/mem
At the end of The Go Memory Model there is this claim about the above code: "there is no guarantee that the write to 'done' ever will be observed by main, since there are no synchronization events between the two threads. The loop in main is not guaranteed to finish."
In what scenario would it not finish?
The code is an obvious code-smell for not using sync primitives, but I can't justify the loop claim based on these assumptions:
Therefore, regardless of the order in which the 'setup' or 'main' go routines are scheduled, the loop exits and 'print(a)' executes.
That being said, the question contributes little value except in so far as it directs others to review the link, so thanks a ton :P. For some reason I found the author's other examples to be perfect candidates of what slithers into a code base, especially tests, and want to understand in case I missed something.
My manager said the company is willing pay for my HPCA course if I can justify how it's useful to my job. I do general devops/CI/CD work: writing scripts for safe code deploys, testing code deploys, maintaining and building out our CI agents, releasing our mobile apps. We use AWS, Terraform, Ansible, Docker, Kubernetes, puppet, Python/Ruby/bash.
Any suggestions for how I can justify the company paying for it? It would be a dream to not have to pay out of pocket. Every OMSCS course I've taken has been tremendously helpful to my work but this is less directly related, and I haven't actually taken the course yet so I only have a vague idea of what I'll be learning.
A lot of the work I am interested to do will be mostly built from scratch by myself, provided there is fair support for numerical types (like complex numbers) and high precision numerical operations (if not, I'll be happy to write those routines as well). Many of my areas of interest are computationally demanding (python codes choke for large enough datasets) but are often parallelizable, and I am looking for guidance on implementing the same. I love math and physics, especially domains that involve rigorous analysis, ranging from physical/mathematic concepts like turbulence, topology, wave optics, electromagnetism and quantum physics to computational concepts like cryptography and information theory. I also love signal processing, especially relating to random and sparse signals. They require a decent amount of precision while simultaneously being fast enough. I wish to be able to run the code on low power and high power manycore or SIMD processors, with the sequential parts being run on a general-purpose processor or highly pipelined FPGA. Energy efficiency is one of my key targets along with speed (many scenarios are energy constrained) even if it requires a longer and customized code. Another area of interest I have, while not my primary goal is to implement redundancy using parallelism (including different compression storage methods: eg RAID). I would like to have some control over the different memory allocations (hierarchies of caches and scratchpad memories) and if possible, some of the caching schemes while still being usable across multiple architectures. If possible, options to optimize for burst and broadcasting, prescence or absence of hardware lockstep, depending upon hardware support (use of switches to do different routines for different hardware when it comes to memory copy and allocation, basically caching). Sorry for the long and open ended question, I realized it would be hard to really come to a decision without getting a holistic picture of the whole domain atleast to a fair level of depth. I am looking for suggestions in both hardware and software for the same.
My primary concern is software - including but not limited to languages, compilers and directives. Being not from a software programming background, I find it hard to search for the proper areas (and keywords). I would like to share my current understanding of the scenario in terms of software - I have currently explored Cuda, OpenCL, Fortran, OpenMP, OpenMPI, OpenACC, Julia
... keep reading on reddit β‘I don't want to step on anybody's toes here, but the amount of non-dad jokes here in this subreddit really annoys me. First of all, dad jokes CAN be NSFW, it clearly says so in the sub rules. Secondly, it doesn't automatically make it a dad joke if it's from a conversation between you and your child. Most importantly, the jokes that your CHILDREN tell YOU are not dad jokes. The point of a dad joke is that it's so cheesy only a dad who's trying to be funny would make such a joke. That's it. They are stupid plays on words, lame puns and so on. There has to be a clever pun or wordplay for it to be considered a dad joke.
Again, to all the fellow dads, I apologise if I'm sounding too harsh. But I just needed to get it off my chest.
Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.