About LR/SC in two soft threads on same hart

About LR/SC in two soft threads on same hart - riscv

If I have two soft threads on the same hart, Thread #1 first execute an LR instruction, then #2 execute an LR instruction with the same address, and finally #1 SC to that address. Will this SC succeed? Which LR will be paired(#1 or #2) if successful?
The aq/rl bit is not set and all address of LR/SC are same.

Two threads running on the same hart means that there has to be a context switch.
Reading RISC-V Unprivileged ISA V20191213, I found this recommendation on page 50:
A store-conditional instruction to a scratch word of memory should be used to forcibly invalidate any existing load reservation:
during a preemptive context switch, and
if necessary when changing virtual to physical address mappings, such as when migrating pages that might contain an active reservation.
The commit message from the commit that introduced this text elaborates:
Commit 170f3c5 clarified that reservations can be cleared with an SC to
a dummy memory location. As discussed
<170f3c5#commitcomment-29386537>,
this patch makes it clear that the reservation "should" be cleared in
this way during a preemptive context switch.
Anyone writing preemptive context switch code should be forcibly
clearing load reservations. Although other mechanisms might be used,
this is the standard way of doing it (hence "should" rather than
"must").
The Linux kernel does this in entry.S:
/*
* The current load reservation is effectively part of the processor's
* state, in the sense that load reservations cannot be shared between
* different hart contexts. We can't actually save and restore a load
* reservation, so instead here we clear any existing reservation --
* it's always legal for implementations to clear load reservations at
* any point (as long as the forward progress guarantee is kept, but
* we'll ignore that here).
*
* Dangling load reservations can be the result of taking a trap in the
* middle of an LR/SC sequence, but can also be the result of a taken
* forward branch around an SC -- which is how we implement CAS. As a
* result we need to clear reservations between the last CAS and the
* jump back to the new context. While it is unlikely the store
* completes, implementations are allowed to expand reservations to be
* arbitrarily large.
*/
REG_L a2, PT_EPC(sp)
REG_SC x0, a2, PT_EPC(sp)
This code does a regular load of some memory, then uses an SC to write the same value back to the same location. The SC clears any reservations, and it doesn't matter if it succeeds or not.
Anyway, in preemptive multitasking systems, a program can expect the operating system kernel to invalidate any reservations during context switches. In such a system, if you're switched out in the middle of your LR/SC sequence, your SC instruction will always fail.
In cooperative multitasking systems (where your program is only switched out when it chooses to yield), your SC instruction could succeed -- but only if you yield in the middle of your LR/SC sequence. And the other thread does the same.
We would have a code sequence like this:
Thread A Thread B
----------------------
LR
yield
LR
yield
SC
yield
SC
As I understand the spec, A's SC will succeed, while B's SC will fail. The SC instruction will always pair with the most recent LR instruction, so the successful SC from thread A will pair with the LR from thread B.
But this code is nonsense. The spec even used to contain recommendations against it:
Cooperative user-level context switches might not cause a load reservation
to be yielded, so user-level threads should generally avoid voluntary
context switches in the middle of an LR/SC sequence.

Related

How to reproduce and test the function of memory barrier in X86 Linux [duplicate]

I've been studying the memory model and saw this (quote from https://research.swtch.com/hwmm):
Litmus Test: Write Queue (also called Store Buffer)
Can this program see r1 = 0, r2 = 0?
// Thread 1 // Thread 2
x = 1 y = 1
r1 = y r2 = x
On sequentially consistent hardware: no.
On x86 (or other TSO): yes!
Fact 1: This is the store buffer litmus test mentioned in many articles. They all say that both r1 and r2 being zero could happen on TSO because of the existence of the store buffer. They seem to assume that all the stores and loads are executed in order, and yet the result is both r1 and r2 being zero. This later concludes that "store/load reordering could happen", as a "consequence of the store buffer's existence".
Fact 2: However we know that OoO execution could also reorder the store and the load in both threads. In this sense, regardless of the store buffer, this reordering could result in both r1 and r2 being zero, as long as all four instructions retire without seeing each other's invalidation to x or y. And this seems to me that "store/load reordering could happen", just because "they are executed out of order". (I might be very wrong about this since this is the best I know of speculation and OoO execution.)
I wonder how these two facts converge (assuming I happen to be right about both): Is store buffer or OoO execution the reason for "store/load reordering", or both are?
Alternatively speaking: Say I somehow observed this litmus test on an x86 machine, was it because of the store buffer, or OoO execution? Or is it even possible to know which?
EDIT: Actually my major confusion is the unclear causality among the following points from various literatures:
OoO execution can cause the memory reordering;
Store/load reordering is caused by the store buffer and demonstrated by a litmus test (and thus named as "store buffer");
Some program having the exact same instructions as the store buffer litmus test is used as an observable OoO execution example, just as this article https://preshing.com/20120515/memory-reordering-caught-in-the-act does.
1 + 2 seems to imply that the store buffer is the cause, and OoO execution is the consequence. 3 + 1 seems to imply that OoO execution is the cause, and memory reordering is the consequence. I can no more tell which causes which. And it is that litmus test sitting in the middle of this mystery.

It makes some sense to call StoreLoad reordering an effect of the store buffer because the way to prevent it is with mfence or a locked instruction that drains the store buffer before later loads are allowed to read from cache. Merely serializing execution (with lfence) would not be sufficient, because the store buffer still exists. Note that even sfence ; lfence isn't sufficient.
Also I assume P5 Pentium (in-order dual-issue) has a store buffer, so SMP systems based on it could have this effect, in which case it would definitely be due to the store buffer. IDK how thoroughly the x86 memory model was documented in the early days before PPro even existed, but any naming of litmus tests done before that might well reflect in-order assumptions. (And naming after might include still-existing in-order systems.)
You can't tell which effect caused StoreLoad reordering. It's possible on a real x86 CPU (with a store buffer) for a later load to execute before the store has even written its address and data to the store buffer.
And yes, executing a store just means writing to the store buffer; it can't commit from the SB to L1d cache and become visible to other cores until after the store retires from the ROB (and thus is known to be non-speculative).
(Retirement happens in-order to support "precise exceptions". Otherwise, chaos ensues and discovering a mis-predict might mean rolling back the state of other cores, i.e. a design that's not sane. Can a speculatively executed CPU branch contain opcodes that access RAM? explains why a store buffer is necessary for OoO exec in general.)
I can't think of any detectable side-effect of the load uop executing before the store-data and/or store-address uops, or before the store retires, rather than after the store retires but before it commits to L1d cache.
You could force the latter case by putting an lfence between the store and the load, so the reordering is definitely caused by the store buffer. (A stronger barrier like mfence, a locked instruction, or a serializing instruction like cpuid, will all block the reordering entirely by draining the store buffer before the later load can execute. As an implementation detail, before it can even issue.)
A normal out of order exec treats all instructions as speculative, only becoming non-speculative when they retire from the ROB, which is done in program order to support precise exceptions. (See Out-of-order execution vs. speculative execution for a more in-depth exploration of that idea, in the context of Intel's Meltdown vulnerability.)
A hypothetical design with OoO exec but no store buffer would be possible. It would perform terribly, with each store having to wait for all previous instructions to be definitively known to not fault or otherwise be mispredicted / mis-speculated before the store can be allowed to execute.
This is not quite the same thing as saying that they need to have already executed, though (e.g. just executing the store-address uop of an earlier store would be enough to know it's non-faulting, or for a load doing the TLB/page-table checks will tell you it's non-faulting even if the data hasn't arrived yet). However, every branch instruction would need to be already executed (and known-correct), as would every ALU instruction like div that can.
Such a CPU also doesn't need to stop later loads from running before stores. A speculative load has no architectural effect / visibility, so it's ok if other cores see a share-request for a cache line which was the result of a mis-speculation. (On a memory region whose semantics allow that, such as normal WB write-back cacheable memory). That's why HW prefetching and speculative execution work in normal CPUs.
The memory model even allows StoreLoad ordering, so we're not speculating on memory ordering, only on the store (and other intervening instructions) not faulting. Which again is fine; speculative loads are always fine, it's speculative stores that we must not let other cores see. (So we can't do them at all if we don't have a store buffer or some other mechanism.)
(Fun fact: real x86 CPUs do speculate on memory ordering by doing loads out of order with each other, depending on addresses being ready or not, and on cache hit/miss. This can lead to memory order mis-speculation "machine clears" aka pipeline nukes (machine_clears.memory_ordering perf event) if another core wrote to a cache line between when it was actually read and the earliest the memory model said we could. Or even if we guess wrong about whether a load is going to reload something stored recently or not; memory disambiguation when addresses aren't ready yet involves dynamic prediction so you can provoke machine_clears.memory_ordering with single-threaded code.)
Out-of-order exec in P6 didn't introduce any new kinds of memory re-ordering because that could have broken existing multi-threaded binaries. (At that time mostly just OS kernels, I'd guess!) That's why early loads have to be speculative if done at all. x86's main reason for existence it backwards compat; back then it wasn't the performance king.
Re: why this litmus test exists at all, if that's what you mean?
Obviously to highlight something that can happen on x86.
Is StoreLoad reordering important? Usually it's not a problem; acquire / release synchronization is sufficient for most inter-thread communication about a buffer being ready to read, or more generally a lock-free queue. Or to implement mutexes. ISO C++ only guarantees that mutexes lock / unlock are acquire and release operations, not seq_cst.
It's pretty rare that an algorithm depends on draining the store buffer before a later load.
Say I somehow observed this litmus test on an x86 machine,
Fully working program that verifies that this reordering is possible in real life on real x86 CPUs: https://preshing.com/20120515/memory-reordering-caught-in-the-act/. (The rest of Preshing's articles on memory ordering are also excellent. Great for getting a conceptual understanding of inter-thread communication via lockless operations.)

What happens with CPU context when Goroutines are switching?

If I correctly understand how goroutines work on top of system threads - they run from queue one by one. But does it mean that every goroutine loads\unloads it's context to CPU? If yes what's difference between system threads and goroutines?
The most significant problem is time-cost of context-switching. Is it correct?
What mechanism lays under detecting which data was requested by which goroutine? For example: I am sending request to DB from goroutine A and doesn't wait for response and at the same time occurred switch to a next goroutine. How system understands that a request came from A and not from B or C?

Goroutines, memory and OS threads
Go has a segmented stack that grows as needed. Go runtime does the scheduling, not the OS. The runtime multiplexes the goroutines onto a relatively small number of real OS threads.
Goroutines switch cost
Goroutines are scheduled cooperatively and when a switch occurs, only 3 registers need to be saved/restored - Program Counter, Stack Pointer, and DX. From the OS's perspective Go program behaves as an event-driven program.
Goroutines and CPU
You cannot directly control the number of threads that the runtime will create. It is possible to set the number of processor cores used by the program by setting the variable GOMAXPROCS with a call of runtime.GOMAXPROCS(n).
Program Counter
and a completely different story
In computing, a program is a specific set of ordered operations for a computer to perform. An instruction is an order given to a computer processor by a program. Within a computer, an address is a specific location in memory or storage. A program counter register is one of a small set of data holding places that the processor uses.
This is a different story of how programs work and communicate with each other and it doesn't directly relate to a goroutine topic.
Sources:
http://blog.nindalf.com/how-goroutines-work/
https://gobyexample.com/goroutines
http://tleyden.github.io/blog/2014/10/30/goroutines-vs-threads/
http://whatis.techtarget.com/definition/program-counter

Gs, Ms, Ps
A "G" is simply a goroutine. It's represented by type g. When a goroutine exits, its g object is returned to a pool of free gs and can later be reused for some other goroutine.
An "M" is an OS thread that can be executing user Go code, runtime code, a system call, or be idle. It's represented by type m. There can be any number of Ms at a time since any number of threads may be blocked in system calls.
Finally, a "P" represents the resources required to execute user Go code, such as scheduler and memory allocator state. It's represented by type p. There are exactly GOMAXPROCS Ps. A P can be thought of like a CPU in the OS scheduler and the contents of the p type like per-CPU state. This is a good place to put state that needs to be sharded for efficiency, but doesn't need to be per-thread or per-goroutine.
The scheduler's job is to match up a G (the code to execute), an M (where to execute it), and a P (the rights and resources to execute it). When an M stops executing user Go code, for example by entering a system call, it returns its P to the idle P pool. In order to resume executing user Go code, for example on return from a system call, it must acquire a P from the idle pool.
All g, m, and p objects are heap allocated, but are never freed, so their memory remains type stable. As a result, the runtime can avoid write barriers in the depths of the scheduler.
User stacks and system stacks
Every non-dead G has a user stack associated with it, which is what user Go code executes on. User stacks start small (e.g., 2K) and grow or shrink dynamically.
Every M has a system stack associated with it (also known as the M's "g0" stack because it's implemented as a stub G) and, on Unix platforms, a signal stack (also known as the M's "gsignal" stack). System and signal stacks cannot grow, but are large enough to execute runtime and cgo code (8K in a pure Go binary; system-allocated in a cgo binary).
Runtime code often temporarily switches to the system stack using systemstack, mcall, or asmcgocall to perform tasks that must not be preempted, that must not grow the user stack, or that switch user goroutines. Code running on the system stack is implicitly non-preemptible and the garbage collector does not scan system stacks. While running on the system stack, the current user stack is not used for execution.
Ref: https://github.com/golang/go/blob/master/src/runtime/HACKING.md

Out-of-order execution and reordering: can I see what after barrier before the barrier?

According to wikipedia: A memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.
Usually, articles talking about something like (I will use monitors instead of membars):
class ReadWriteExample {
int A = 0;
int Another = 0;
//thread1 runs this method
void writer () {
lock monitor1; //a new value will be stored
A = 10; //stores 10 to memory location A
unlock monitor1; //a new value is ready for reader to read
Another = 20; //#see my question
}
//thread2 runs this method
void reader () {
lock monitor1; //a new value will be read
assert A == 10; //loads from memory location A
print Another //#see my question
unlock monitor1;//a new value was just read
}
}
But I wonder is it possible that compiler or cpu will shuffle the things around in a such way that code will print 20? I don't need guarantee.
I.e. by definition operations issued prior to barrier can't be pushed down by compiler, but is it possible that operations issued after barrier would be occasionally seen before barrier? (just a probability)
Thanks

My answer below only addresses Java's memory model. The answer really can't be made for all languages as each may define the rules differently.
But I wonder is it possible that compiler or cpu will shuffle the things around in a such way that code will print 20? I don't need guarantee.
Your answer seems to be "Is it possible for the store of A = 20, be re-ordered above the unlock monitor?"
The answer is yes, it can be. If you look at the JSR 166 Cookbook, the first grid shown explains how re-orderings work.
In your writer case the first operation would be MonitorExit the second operation would be NormalStore. The grid explains, yes this sequence is permitted to be re-ordered.
This is known as Roach Motel ordering, that is, memory accesses can be moved into a synchronized block but cannot be moved out
What about another language? Well, this question is too broad to answer all questions as each may define the rules differently. If this is the case you would need to refine your question.

In Java there is the concept of happens-before. You can read all the details about it on in the Java Specification. A Java compiler or runtime engine can re-order code but it must abide by the happens-before rules. These rules are important for a Java developer that wants to have detailed control on how their code is re-ordered. I myself have been burnt by re-ordering code, turns out I was referencing the same object via two different variables and the runtime engine re-ordered my code not realizing that the operations were on the same object. If I had either a happens-before (between the two operations) or used the same variable, then the re-ordering would not have occurred.
Specifically:
It follows from the above definitions that:
An unlock on a monitor happens-before every subsequent lock on that monitor.
A write to a volatile field (§8.3.1.4) happens-before every subsequent
read of that field.
A call to start() on a thread happens-before any actions in the
started thread.
All actions in a thread happen-before any other thread successfully
returns from a join() on that thread.
The default initialization of any object happens-before any other
actions (other than default-writes) of a program.

Short answer - yes. This is very compiler and CPU architecture dependent. You have here the definition of a Race Condition. The scheduling Quantum won't end mid-instruction (can't have two writes to same location). However - the quantum could end between instructions - plus how they are executed out-of-order in the pipeline is architecture dependent (outside of the monitor block).
Now comes the "it depends" complications. The CPU guarantees little (see race condition). You might also look at NUMA (ccNUMA) - it is a method to scale CPU & Memory access by grouping CPUs (Nodes) with local RAM and a group owner - plus a special bus between Nodes.
The monitor doesn't prevent the other thread from running. It only prevents it from entering the code between the monitors. Therefore when the Writer exits the monitor-section it is free to execute the next statement - regardless of the other thread being inside the monitor. Monitors are gates that block access. Also - the quantum could interrupt the second thread after the A== statement - allowing Another to change value. Again - the quantum won't interrupt mid-instruction. Always think of threads executing in perfect parallel.
How do you apply this? I'm a bit out of date (sorry, C#/Java these days) with current Intel processors - and how their Pipelines work (hyperthreading etc). Years ago I worked with a processor called MIPS - and it had (through compiler instruction ordering) the ability to execute instructions that occurred serially AFTER a Branch instruction (Delay Slot). On this CPU/Compiler combination - YES - what you describe could happen. If Intel offers the same - then yes - it could happen. Esp with the NUMA (both Intel & AMD have this, I'm most familiar with AMD implementation).
My point - if threads were running across NUMA nodes - and access was to the common memory location then it could occur. Of course the OS tries hard to schedule operations within the same node.
You might be able to simulate this. I know C++ on MS allows access to NUMA technology (I've played with it). See if you can allocate memory across two nodes (placing A on one, and Another on the other). Schedule the threads to run on specific Nodes.
What happens in this model is that there are two pathways to RAM. I suppose this isn't what you had in mind - probably only a single path/Node model. In which case I go back to the MIPS model I described above.
I assumed a processor that interrupts - there are others that have a Yield model.

Memory Barrer and Visibility - x64

I have read Intel document about memory orderings on x64: http://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf .They says that locked instructions cause full barriers which makes processors to see e.g. updates in specified order. But there is nothing about visibility caused by barriers. Does barriers cause that other processors will see updates of variables immediately or maybe updates will propagate to other processors only in specified order but with not specified time?
E.g.
Thread1:
flag = true;
MemoryBarrier();
Thread 2:
MemoryBarrier();
tmp = flag;
Does thread 2 will always flag=true if Thread 1 will execute its code before Thread 2?

The barriers guarantee that other processors will see updates in the specified order, but not when that happens.
Which brings the follow-up question, how do you define "immediately" in a multiprocessor system [1], or how do you ensure that Thread 1 executes before Thread 2? In this case, one answer would be that Thread 1 uses an atomic instruction such as xchg to do the store to the flag variable, and then Thread 2 spins on the flag, and proceeds when it notices that the value changes (due to the way the x86 memory model works, Thread 2 can spin using normal load instructions, it is sufficient that the store is done with an atomic)
[1] One can think of it in terms of relativistic physics, each observer (thread) sees events through its own "light cone". Hence one must abandon concepts such as a single universal time for all observers.

Does the linux scheduler needs to be context switched?

I have a general question about the linux scheduler and some other similar kernel system calls.
Is the linux scheduler considered a "process" and every call to the scheduler requires a context switch like its just another process?
Say we have a clock tick which interrupts the current running user mode process, and we now have to call the scheduler. Does the call to the scheduler itself provokes a context switch? Does the scheduler has its own set of registers and U-area and whatnot which it has to restore at every call?
And the said question applies to many other system calls. Do kernel processes behave like regular processes in regard to context switching, the only difference is that they have more permissions and access to the cpu?
I ask this because context switch overhead is expensive. And it sounds odd that calling the scheduler itself provokes a context switch to restore the scheduler state, and after that the scheduler calls another process to run and again another context switch.

That's a very good question, and the answer to it would be "yes" except for the fact that the hardware is aware of the concept of an OS and task scheduler.
In the hardware, you'll find registers that are restricted to "supervisor" mode. Without going into too much detail about the internal CPU architecture, there's a copy of the basic program execution registers for "user mode" and "supervisor mode," the latter of which can only be accessed by the OS itself (via a flag in a control register that the kernel sets which says whether or not the kernel or a user mode application is currently running).
So the "context switch" you speak of is the process of swapping/resetting the user mode registers (instruction register, stack pointer register, etc.) etc. but the system registers don't need to be swapped out because they're stored apart from the user ones.
For instance, the user mode stack in x86 is USP - A7, whereas the supervisor mode stack is SSP - A7. So the kernel itself (which contains the task scheduler) would use the supervisor mode stack and other supervisor mode registers to run itself, setting the supervisor mode flag to 1 when it's running, then perform a context switch on the user mode hardware to swap between apps and setting the supervisor mode flag to 0.
But prior to the idea of OSes and task scheduling, if you wanted to do a multitasking system then you'd have had to use the basic concept that you outlined in your question: use a hardware interrupt to call the task scheduler every x cycles, then swap out the app for the task scheduler, then swap in the new app. But in most cases the timer interrupt would be your actual task scheduler itself and it would have been heavily optimized to make it less of a context switch and more of a simple interrupt handler routine.

Actually you can check the code for the schedule() function in kernel/sched.c. It is admirably well-written and should answer most of your question.
But bottom-line is that the Linux scheduler is invoked by calling schedule(), which does the job using the context of its caller. Thus there is no dedicated "scheduler" process. This would make things more difficult actually - if the scheduler was a process, it would also have to schedule itself!
When schedule() is invoked explicitly, it just switches the contexts of the caller thread A with the one of the selected runnable thread B such as it will return into B (by restoring register values and stack pointers, the return address of schedule() will become the one of B instead of A).

Here is an attempt at a simple description of what goes on during the dispatcher call:
The program that currently has context is running on the processor. Registers, program counter, flags, stack base, etc are all appropriate for this program; with the possible exception of an operating-system-native "reserved register" or some such, nothing about the program knows anything about the dispatcher.
The timed interrupt for dispatcher function is triggered. The only thing that happens at this point (in the vanilla architecture case) is that the program counter jumps immediately to whatever the PC address in the BIOS interrupt is listed as. This begins execution of the dispatcher's "dispatch" subroutine; everything else is left untouched, so the dispatcher sees the registers, stack, etc of the program that was previously executing.
The dispatcher (like all programs) has a set of instructions that operate on the current register set. These instructions are written in such a way that they know that the previously executing application has left all of its state behind. The first few instructions in the dispatcher will store this state in memory somewhere.
The dispatcher determines what the next program to have the cpu should be, takes all of its previously stored state and fills registers with it.
The dispatcher jumps to the appropriate PC counter as listed in the task that now has its full context established on the cpu.
To (over)simplify in summary; the dispatcher doesn't need registers, all it does is write the current cpu state to a predetermined memory location, load another processes' cpu state from a predetermined memory location, and jumps to where that process left off.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string