Can one CPU core observe others' modification immediately? [duplicate]

Can one CPU core observe others' modification immediately? [duplicate] - multithreading

Under the total store order(TSO) memory consistency model, a x86 cpu will have a write buffer to buffer write requests and can serve reordered read requests from the write buffer. And it says that the write requests in the write buffer will exit and be issued toward cache hierarchy in FIFO order, which is the same as program order.
I am curious about:
To serve the write requests issued from the write buffer, does L1 cache controller handle the write requests, finish the cache coherence of the write requests and insert data into L1 cache in the same order as the issue order?

Your terminology is unusual. You say "finish the cache coherence"; what actually happens is that the core has to get (exclusive) ownership of the cache line before it can modify it. At the instant/cycle that modification happens, it becomes part of the view of memory contents shared by all participants in the cache-coherency protocol.
So yes, you do "finish the cache coherence" = get exclusive ownership before the store can even enter cache and become globally visible = available for requests to share that cache line. Cache always maintains coherence (that's the point of MESI), not gets out of sync and then wait for coherence. I think your confusion stems from your mental model not matching that reality.
(Weakly-ordered architectures have mind-bending possibilities like not all cores seeing the stores from two other cores in the same order; that can happen by private store-forwarding between SMT threads on one physical core letting another logical core see a store ahead of commit to L1d = global visibility.)
I think you know some of this, but let me start from the basics.
L1 cache in each core participates in the cache-coherency protocol that keeps its cache coherent with the other caches in the coherency domain (e.g. L2 and L3, and L1 in other cores, but not video-RAM caches inside the GPU).
Loads become globally visible at the instant their data is read from L1 cache (or from the store buffer or from uncacheable RAM or MMIO). MFENCE can force them to wait for earlier stores to become globally visible before sampling L1, to avoid StoreLoad reordering.
Stores become globally visible at the instant their data is committed to L1 cache. The conditions required before this can happen are:
It's finished executing: the data+address is in a store buffer entry. (i.e. the store-address and store-data uops executed on the appropriate ports once their inputs were ready, writing the address and data into the store buffer, aka Memory Order Buffer on Intel CPUs).
It's retired from the out-of-order part of the core, and thus known to be non-speculative. Before retirement, we don't know that it and all preceding instructions won't fault, or that it wasn't in the shadow of a branch mispredict or other mis-speculation.
Retirement can only happen after it's finished executing, but isn't tied to commitment to L1d. The store buffer can continue to track a non-speculative store that will definitely happen eventually even after the ROB (out-of-order execution ReOrder Buffer) has forgotten about the store instruction.
All preceding loads/stores/fences are already globally visible (because of x86's memory ordering rules). This excludes weakly-ordered ops (NT stores); other loads/stores can pass them.
The cache line is in the Exclusive or Modified state of the MESI/MESIF/MOESI cache-coherence protocol, in the L1d cache of the current core. This can take a long time if the RFO (read for ownership) encounters a cache miss in outer levels of cache, or contention with other cores that also want exclusive access to write, or atomically RMW, a cache line.
See wikipedia's MESI article for diagrams of allowed state transitions, and details. The key point is that coherency is achieved by only allowing a core to modify its copy of a cache line when it's sure that no other caches contain that line, so that it's impossible for two conflicting copies of the same line to exist.
Intel CPUs actually use MESIF, while AMD CPUs actually use MOESI to allow cache->cache data transfer of dirty data instead of write-back to a shared outer cache like the basic MESI protocol requires.
Also note that modern Intel designs (before Skylake-AVX512) implement use a large shared inclusive L3 cache as a backstop for cache-coherency, so snoop requests don't actually have to be broadcast to all cores; they just check L3 tags (which contain extra metadata to track which core is caching what.
Intel's L3 is tag-inclusive even for lines that inner caches have in Exclusive or Modified state and thus are Invalid in L3. See this paper for more details of a simplified version of what Intel does).
Also related: I wrote an answer recently about why we have small/fast L1 + larger L2/L3, instead of one big cache, including some links to other cache-related stuff.
Back to the actual question:
Yes, stores are committed to L1 in program order, because that's the order that x86 requires them to become globally visible. L1-commit order is the same thing as global-visibility order.
Instead of "finish the cache coherence", instead you should say "get ownership of the cache line". This involves communicating with other caches using the cache coherency protocol, so I guess you probably meant "finish getting exclusive ownership using the cache coherency protocl".
The memory ordering part of the MESI wiki article points out that buffering stores in a store queue is separate from out-of-order execution in general.
The store buffer decouples commit to L1d from OoO exec retirement. This can potentially hide a lot more store latency than the regular out-of-order window size. However, retired stores must eventually happen (in the right order) even if an interrupt arrives, so allowing lots of retired but not committed stores can increase interrupt latency.
The store buffer tries to commit retired stores to L1d as quickly as it can, but it's restricted by the memory ordering rules. (i.e. other cores will see stores soon; you don't need a fence to flush the store buffer unless you need the current thread to wait for that to happen before a later load in this thread. e.g. for sequentially-consistent stores.)
On a weakly-ordered ISA, later stores can commit to L1d while an earlier store is still waiting for a cache miss. (But you'd still need a memory order buffer to preserve the illusion of a single core running instructions in program order.)
The store buffer can have multiple cache misses in flight at once, because even on strongly-ordered x86 it can send an RFO for a cache line before that store is the oldest one in the buffer.

Yes in a model like x86-TSO stores are likely committed to the L1 in program order, and Peter's answer covers it well. That is, the store buffer is maintained in program order, and the core will commit only the oldest store (or perhaps several consecutive oldest stores if they are all going to the same cache line) to L1 before moving on.1
However, you mention in the comments your concern that this might impact performance by essentially making the store buffer commit a blocking (serialized) process:
And why I am confused about this problem is that cache controller
could handle the requests in a non-blocking way. But, to conform to
the TSO and make sure data globally visible on a multi-core system,
should cache controller follow the store ordering? Because if there
are two variable A and B being updated sequentially on core 1 and core
2 get the updated B from core 1, then core 2 must also can see the
updated A. And to achieve this, I think the private cache hierarchy on
core 1 have to finishes the cache coherence of the variable A and B in
order and make them globally visible. Am I right?
The good news is that even though the store buffer might commit in a ordered way only the oldest store to L1, it can still get plenty of parallelism with respect to the rest of the memory subsystem by looking ahead in the store buffer and making prefetch RFO requests: trying to get the line in the E state in the local core even before the store first in line to commit to L1.
This approach doesn't violate ordering, since the stores are still written in program order, but it allows full parallelism when resolving L1 store misses. It is L1 store misses that really matter anyways: stores hits in L1 can commit rapidly, at least 1 per cycle, so committing a bunch of hits doens't help much: but getting MLP on store misses is very important, especially for scattered stores the prefetcher can't deal with.
Do x86 chips actually use a technique like this? Almost certainly. Most convincingly, tests of a long series of random writes show a much better average latency than the full memory latency, implying MLP significantly better than one. You can also find patents like this one or this one where Intel describes pretty much exactly this method.
Still, nothing is perfect. There is some evidence that ordering concerns causes weird performance hiccups when stores are missing L1, even if they hit in L2.
1 It is certainly possible that it can commit stores out of order if in maintains the illusion of in-order commit, e.g., by not relinquishing ownership of cache lines written out of order until order is restored, but this is prone to deadlocks and other complicated cases, and I have no evidence that x86 does so.

Related

How to put data in L2 cache with A72 Core?

I have an array of data that looks like this :
uint32_t data[128]; //Could be more than L1D Cache size
In order to do computation on it, I want to put the data as close as possible to my computing unit so in the L2 Cache.
My target runs with a linux kernel and some additionnal apps
I know that I can get an access to a certain area of the memory with mmap and I have succesfully done it in some part of my available memory shared between cores.
How to do the same thing but in L2 Cache area ?
I've read part of gcc documentation and AArch64 assembly instruction set but cannot figure out the way to achieve this.

How to do the same thing but in L2 Cache area ?
Your hardware doesn't support that.
In general, the ARMv8 architecture doesn't make any guarantees about the contents of caches and does not provide any means to explicitly manipulate or query them - it only makes guarantees and provides tools for dealing with coherency.
Specifically, from section D4.4.1 "General behavior of the caches" of the spec:
[...] the architecture cannot guarantee whether:
• A memory location present in the cache remains in the cache.
• A memory location not present in the cache is brought into the cache.
Instead, the following principles apply to the behavior of caches:
• The architecture has a concept of an entry locked down in the cache.
How lockdown is achieved is IMPLEMENTATION DEFINED, and lockdown might
not be supported by:
— A particular implementation.
— Some memory attributes.
• An unlocked entry in a cache might not remain in that cache. The
architecture does not guarantee that an unlocked cache entry remains in
the cache or remains incoherent with the rest of memory. Software must
not assume that an unlocked item that remains in the cache remains dirty.
• A locked entry in a cache is guaranteed to remain in that cache. The
architecture does not guarantee that a locked cache entry remains
incoherent with the rest of memory, that is, it might not remain dirty.
[...]
• Any memory location is not guaranteed to remain incoherent with the rest of memory.
So basically you want cache lockdown. Consulting the manual of your CPU though:
• The Cortex-A72 processor does not support TLB or cache lockdown.
So you can't put something in cache on purpose. Now, you might be able to tell whether something has been cached by trying to observe side effects. The two common side effects of caches are latency and coherency. So you could try and time access times or modify the contents of DRAM and check whether you see that change in your cached mapping... but that's still a terrible idea.
For one, both of these are destructive operations, meaning they will change the property you're measuring, by measuring it. And for another, just because you observe them once does not mean you can rely on that happening.
Bottom line: you cannot guarantee that something is held in any particular cache by the time you use it.

Cache - is not a place where data should be stored, it's just... cache? :)
I mean, your processor decide which data it should cache and where (L1/L2/L3) and logic depends on CPU implementation.
If you wanted to, you could try to find the algorithm of placing and replacing data in cache and play with this (without guaranties, of course) by using dedicated instructions to prefetch your data and then maintain the cache with non-caching instructions for your other program.
Maybe for modern ARM there are easier ways, I spoke from x86/x64 perspective, but my whole point is "are you really sure that you need this"?
CPU's smart enough to cache the data which they need and they do it better and better year by year.
I'd recommend you to use any profiler that can show you cache misses to be sure than your data is not presented in the cache already.
If it don't, the first thing to optimize - is an algorithm. Try to figure out why there was a cache miss - maybe you should load less data in loop by using temp variables, for example, or even unroll the loop manually to control where and what being accessed.

What does it even mean to make an image visible to a writes (or available to reads)?

An availability operation can be compared to a cache flush - cache contents are released to main memory.
Similarly, visibility operation can be compared to a cache invalidation - cache consumes contents of main memory.
(it doesn't have to be a 1:1 hw mapping, but you get the idea)
It seems nonsensical to perform a visibility operation before a write (since we're about to override whatever is in our imaginary or no-so-imaginary cache either way) or an availability operation after a read (nothing has changed!).
I saw code, which includes memory writes in dstAccessMask and/or memory reads in srcAccessMask. What's the point?

It seems nonsensical to perform a visibility operation before a write (since we're about to override whatever is in our imaginary or no-so-imaginary cache either way) or an availability operation after a read (nothing has changed!).
So which not-so-imaginary cache are you talking about? And how are you enforcing coherency with other not-so-imaginary caches in the system? GPUs have many levels of parallel caching, some of which are coherent in hardware and some of which are not. What you get is entirely implementation-defined behavior ...
memory writes in dstAccessMask
Missing invalidation before a write can cause write-after-write hazards on the GPU (you can end up with mismatches in parallel copies of data in two different caches). Can also get issues if HOST modified the data, which is why HOST is a possible stage.
memory reads in srcAccessMask
Can't think of a case where you'd need memory operations after a read-only usage.

What happens when different CPU cores write to the same RAM address without synchronization?

Let's assume that 2 cores are trying to write different values to the same RAM address (1 byte), at the same moment of time (plus-minus eta), and without using any interlocked instructions or memory barriers. What happens in this case and what value will be written to the main RAM? The first one wins? The last one wins? Undetermined behavior?

x86 (like every other mainstream SMP CPU architecture) has coherent data caches. It's impossible for two difference caches (e.g. L1D of 2 different cores) to hold conflicting data for the same cache line.
The hardware imposes an order (by some implementation-specific mechanism to break ties in case two requests for ownership arrive in the same clock cycle from different cores). In most modern x86 CPUs, the first store won't be written to RAM, because there's a shared write-back L3 cache to absorb coherency traffic without a round-trip to memory.
Loads that appear after both the stores in the global order will see the value stored by whichever store went second.
(I'm assuming we're talking about normal (not NT) stores to cacheable memory regions (WB, not USWC, UC, or even WT). The basic idea would be the same in either case, though; one store would go first, the next would step on it. The data from the first store could be observed temporarily if a load happened to get between them in the global order, but otherwise the data from the store that the hardware chose to do 2nd would be the long-term effect.
We're talking about a single byte, so the store can't be split across two cache lines, and thus every address is naturally aligned so everything in Why is integer assignment on a naturally aligned variable atomic on x86? applies.
Coherency is maintained by requiring a core to acquire exclusive access to that cache line before it can modify it (i.e. make a store globally visible by committing it from the store queue to L1D cache).
This "acquiring exclusive access" stuff is done using (a variant of) the MESI protocol. Any given line in a cache can be Modified (dirty), Exclusive (owned by not yet written), Shared (clean copy; other caches may also have copies so an RFO (Read / Request For Ownership) is required before write), or Invalid. MESIF (Intel) / MOESI (AMD) add extra states to optimize the protocol, but don't change the fundamental logic that only one core can change a line at any one time.
If we cared about ordering of multiple changes to two different lines, then memory ordering an memory barriers would come into play. But none of that matters for this question about "which store wins" when the stores execute or retire in the same clock cycle.
When a store executes, it goes into the store queue. It can commit to L1D and become globally visible at any time after it retires, but not before; unretired instructions are treated as speculative and thus their architectural effects must not be visible outside the CPU core. Speculative loads have no architectural effect, only microarchitectural1.
So if both stores become ready to commit at "the same time" (clocks are not necessarily synchronized between cores), one or the other will have its RFO succeed first and gain exclusive access, and make its store data globally visible. Then, soon after, the other core's RFO will succeed and update the cache line with its data, so its store comes second in the global store order observed by all other cores.
x86 has a total-store-order memory model where all cores observe the same order even for stores to different cache lines (except for always seeing their own stores in program order). Some weakly-ordered architectures like PowerPC would allow some cores to see a different total order from other cores, but this reordering can only happen between stores to different lines. There is always a single modification order for a single cache line. (Reordering of loads with respect to each other and other stores means that you have to be careful how you go about observing things on a weakly ordered ISA, but there is a single order of modification for a cache line, imposed by MESI).
Which one wins the race might depend on something as prosaic as the layout of the cores on the ring bus relative to which slice of shared L3 cache that line maps to. (Note the use of the word "race": this is the kind of race which "race condition" bugs describe. It's not always wrong to write code where two unsynchronized stores update the same location and you don't care which one wins, but it's rare.)
BTW, modern x86 CPUs have hardware arbitration for the case when multiple cores contend for atomic read-modify-write to the same cache line (and thus are holding onto it for multiple clock cycles to make lock add byte [rdi], 1 atomic), but regular loads/stores only need to own a cache line for a single cycle to execute a load or commit a store. I think the arbitration for locked instructions is a different thing from which core wins when multiple cores are trying to commit stores to the same cache line. Unless you use a pause instruction, cores assume that other cores aren't modifying the same cache line, and speculatively load early, and thus will suffer memory-ordering mis-speculation if it does happen. (What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)
IDK if anything similar happens when two threads are both just storing without loading, but probably not because stores aren't speculatively reordered and are decoupled from out-of-order execution by the store queue. Once a store instruction retires, the store is definitely going to happen, so OoO exec doesn't have to wait for it to actually commit. (And in fact it has to retirem from the OoO core before it can commit, because that's how the CPU knows it's non-speculative; i.e. that no earlier instruction faulted or was a mispredicted branch)
Footnotes:
Spectre blurs that line by using a cache-timing attack to read microarchitectural state into the architectural state.

They will wind up being sequenced, likely between the L1 caches. One write will come first and the other will come second. Whichever one comes second will be the result that subsequent reads will see.

What will be used for data exchange between threads are executing on one Core with HT?

Hyper-Threading Technology is a form of simultaneous multithreading
technology introduced by Intel.
These resources include the execution engine, caches, and system bus
interface; the sharing of resources allows two logical processors to
work with each other more efficiently, and allows a stalled logical
processor to borrow resources from the other one.
In the Intel CPU with Hyper-Threading, one CPU-Core (with several ALUs) can execute instructions from 2 threads at the same clock. And both 2 threads share: store-buffer, caches L1/L2 and system bus.
But if two thread execute simultaneous on one Core, thread-1 stores atomic value and thread-2 loads this value, what will be used for this exchange: shared store-buffer, shared cache L1 / L2 or as usual cache L3?
What will be happen if both 2 threads from one the same process (the same virtual address space) and if from two different processes (the different virtual address space)?
Sandy Bridge Intel CPU - cache L1:
32 KB - cache size
64 B - cache line size
512 - lines (512 = 32 KB / 64 B)
8-way
64 - number sets of ways (64 = 512 lines / 8-way)
6 bits [11:6] - of virtual address (index) defines current set number (this is tag)
4 K - each the same (virtual address / 4 K) compete for the same set (32 KB / 8-way)
low 12 bits - significant for determining the current set number
4 KB - standard page size
low 12 bits - the same in virtual and physical addresses for each address

I think you'll get a round-trip to L1. (Not the same thing as store->load forwarding within a single thread, which is even faster than that.)
Intel's optimization manual says that store and load buffers are statically partitioned between threads, which tells us a lot about how this will work. I haven't tested most of this, so please let me know if my predictions aren't matching up with experiment.
Update: See this Q&A for some experimental testing of throughput and latency.
A store has to retire in the writing thread, and then commit to L1 from the store buffer/queue some time after that. At that point it will be visible to the other thread, and a load to that address from either thread should hit in L1. Before that, the other thread should get an L1 hit with the old data, and the storing thread should get the stored data via store->load forwarding.
Store data enters the store buffer when the store uop executes, but it can't commit to L1 until it's known to be non-speculative, i.e. it retires. But the store buffer also de-couples retirement from the ROB (the ReOrder Buffer in the out-of-order core) vs. commitment to L1, which is great for stores that miss in cache. The out-of-order core can keep working until the store buffer fills up.
Two threads running on the same core with hyperthreading can see StoreLoad re-ordering if they don't use memory fences, because store-forwarding doesn't happen between threads. Jeff Preshing's Memory Reordering Caught in the Act code could be used to test for it in practice, using CPU affinity to run the threads on different logical CPUs of the same physical core.
An atomic read-modify-write operation has to make its store globally visible (commit to L1) as part of its execution, otherwise it wouldn't be atomic. As long as the data doesn't cross a boundary between cache lines, it can just lock that cache line. (AFAIK this is how CPUs do typically implement atomic RMW operations like lock add [mem], 1 or lock cmpxchg [mem], rax.)
Either way, once it's done the data will be hot in the core's L1 cache, where either thread can get a cache hit from loading it.
I suspect that two hyperthreads doing atomic increments to a shared counter (or any other locked operation, like xchg [mem], eax) would achieve about the same throughput as a single thread. This is much higher than for two threads running on separate physical cores, where the cache line has to bounce between the L1 caches of the two cores (via L3).
movNT (Non-Temporal) weakly-ordered stores bypass the cache, and put their data into a line-fill buffer. They also evict the line from L1 if it was hot in cache to start with. They probably have to retire before the data goes into a fill buffer, so a load from the other thread probably won't see it at all until it enters a fill-buffer. Then probably it's the same as an movnt store followed by a load inside a single thread. (i.e. a round-trip to DRAM, a few hundred cycles of latency). Don't use NT stores for a small piece of data you expect another thread to read right away.
L1 hits are possible because of the way Intel CPUs share the L1 cache. Intel uses virtually indexed, physically tagged (VIPT) L1 caches in most (all?) of their designs. (e.g. the Sandybridge family.) But since the index bits (which select a set of 8 tags) are below the page-offset, it behaves exactly like a PIPT cache (think of it as translation of the low 12 bits being a no-op), but with the speed advantage of a VIPT cache: it can fetch the tags from a set in parallel with the TLB lookup to translate the upper bits. See the "L1 also uses speed tricks that wouldn't work if it was larger" paragraph in this answer.
Since L1d cache behaves like PIPT, and the same physical address really means the same memory, it doesn't matter whether it's 2 threads of the same process with the same virtual address for a cache line, or whether it's two separate processes mapping a block of shared memory to different addresses in each process. This is why L1d can be (and is) competitively by both hyperthreads without risk of false-positive cache hits. Unlike the dTLB, which needs to tag its entries with a core ID.
A previous version of this answer had a paragraph here based on the incorrect idea that Skylake had reduced L1 associativity. It's Skylake's L2 that's 4-way, vs. 8-way in Broadwell and earlier. Still, the discussion on a more recent answer might be of interest.
Intel's x86 manual vol3, chapter 11.5.6 documents that Netburst (P4) has an option to not work this way. The default is "Adaptive mode", which lets logical processors within a core share data.
There is a "shared mode":
In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if the
logical processors use identical CR3 registers and paging modes.
In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear address in the cache
can point to different physical locations. The mechanism for resolving aliasing can lead to thrashing. For this
reason, IA32_MISC_ENABLE[bit 24] = 0 is the preferred configuration for processors based on the Intel NetBurst
microarchitecture that support Intel Hyper-Threading Technology
It doesn't say anything about this for hyperthreading in Nehalem / SnB uarches, so I assume they didn't include "slow mode" support when they introduced HT support in another uarch, since they knew they'd gotten "fast mode" to work correctly in netburst. I kinda wonder if this mode bit only existed in case they discovered a bug and had to disable it with microcode updates.
The rest of this answer only addresses the normal setting for P4, which I'm pretty sure is also the way Nehalem and SnB-family CPUs work.
It would be possible in theory to build an OOO SMT CPU core that made stores from one thread visible to the other as soon as they retired, but before they leaves the store buffer and commit to L1d (i.e. before they become globally visible). This is not how Intel's designs work, since they statically partition the store queue instead of competitively sharing it.
Even if the threads shared one store-buffer, store forwarding between threads for stores that haven't retired yet couldn't be allowed because they're still speculative at that point. That would tie the two threads together for branch mispredicts and other rollbacks.
Using a shared store queue for multiple hardware threads would take extra logic to always forward to loads from the same thread, but only forward retired stores to loads from the other thread(s). Besides transistor count, this would probably have a significant power cost. You couldn't just omit store-forwarding entirely for non-retired stores, because that would break single-threaded code.
Some POWER CPUs may actually do this; it seems like the most likely explanation for not all threads agreeing on a single global order for stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?.
As #BeeOnRope points out, this wouldn't work for an x86 CPU, only for an ISA that doesn't guarantee a Total Store Order, because this this would let the SMT sibling(s) see your store before it becomes globally visible to other cores.
TSO could maybe be preserved by treating data from sibling store-buffers as speculative, or not able to happen before any cache-miss loads (because lines that stay hot in your L1D cache can't contain new stores from other cores). IDK, I haven't thought this through fully. It seems way overcomplicated and probably not able to do useful forwarding while maintaining TSO, even beyond the complications of having a shared store-buffer or probing sibling store-buffers.

MESI cache protocol

I was reading about the MESI snooping cache coherence protocol, which I guess is the protocol that is used in modern multicore x86 processors (please correct me if I'm wrong). Now that article says this at one place.
A cache that holds a line in the Modified state must snoop (intercept) all
attempted reads (from all of the other caches in the system) of the
corresponding main memory location and insert the data that it holds. This is
typically done by forcing the read to back off (i.e. retry later), then writing
the data to main memory and changing the cache line to the Shared state.
Now what I don't understand is why the data needs to be written in the main memory. Cant the cache coherence just keeps the content in the caches synchronized without going to the memory (unless the cache line is truly evicted ofcourse)? I mean if one core is constantly reading and the other constantly writing, why not keep the data in the cache memory, and keep updating the data in the cache. Why incur the performance of writing back to the main memory?
In other words, can't the cores reading the data, directly read from the cache of the writing core and modify their cache accordingly?

Now what I don't understand is why the data needs to be written in the main memory. Cant
the cache coherence just keeps the content in the caches synchronized without going to
the memory (unless the cache line is truly evicted ofcourse)?
This does happen.
I have on my laptop an iCore 5 which looks like this;
M
N
S
L3U
L2U L2U
L1D L1D
L1I L1I
P P
L L L L
M = Main memory
N = NUMA node
S = Socket
L3U = Level 3 Unified
L2U = Level 2 Unified
L1D = Level 1 Data
L1I = Level 1 Instruction
P = Processor
L = Logical core
When two logical cores are operating on the same data, they don't move out to main memory; they exchange over the L1 and L2 caches. Likewise, when cores in the two processors are working, they exchange data over the L3 cache. Main memory isn't used unless eviction occurs.
But a simpler CPU could indeed be less clever about things.

The MESI protocol doesn't allow more than one caches to keep the same cache line in a modified state. So, if one cache line is modified and wants to be read from other processor´s cache, then it must be first written to main memory and then read, so that both processor´s caches now share that line (shared state)

Because caches typically aren't able to write directly into each other (as this would take more bandwidth).

what I don't understand is why the data needs to be written in the
main memory
Let's processor A has that modified line. Now processor B is trying to read that same cache (modified by A) line from main memory. Since the content in the main memory is invalid now (because A modified the content), A's snooping the any other read attempts for that line. So in order to allow processor B (and others) to read that line, A has to write it back to main memory.
I mean if one core is constantly reading and the other constantly writing,
why not keep the data in the cache memory, and keep updating the data in the cache.
You are right and this is what usually done. But here, that's not the case. Someone else (processor B in our example) is trying to read. So A has to write it back and make the cache line status to shared because both A and B are sharing the cache line now.

So actually I don't think the reading cache has to go to main memory.
In MESI, when a processor request a block modified by one of it's peers, it issue a read miss on the bus (or any interconnect medium), which is broadcasted to every processor.
The processor which hold the block in the "Modified" state catch the call, and issue a copy back on the bus - holding the block ID and the value - while changing it's own copy state to "shared". This copy back is received by the requesting processor, which will write the block in it's local cache and tag it as "shared".
It depends upon the implementation if the copy back issued by the owning processor goes to main memory, or not.
edit: note that the MOESI protocol add a "Owned" state that is very similar to "Shared", and allows a processor holding a copy of a block, in owned state, to copy back the value on the bus if it catches a broadcasted write/read miss for this block.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string