how is a memory barrier in linux kernel is used - linux

There is an illustration in kernel source Documentation/memory-barriers.txt, like this:
CPU 1 CPU 2
======================= =======================
{ B = 7; X = 9; Y = 8; C = &Y }
STORE A = 1
STORE B = 2
<write barrier>
STORE C = &B LOAD X
STORE D = 4 LOAD C (gets &B)
LOAD *C (reads B)
Without intervention, CPU 2 may perceive the events on CPU 1 in some
effectively random order, despite the write barrier issued by CPU 1:
+-------+ : : : :
| | +------+ +-------+ | Sequence of update
| |------>| B=2 |----- --->| Y->8 | | of perception on
| | : +------+ \ +-------+ | CPU 2
| CPU 1 | : | A=1 | \ --->| C->&Y | V
| | +------+ | +-------+
| | wwwwwwwwwwwwwwww | : :
| | +------+ | : :
| | : | C=&B |--- | : : +-------+
| | : +------+ \ | +-------+ | |
| |------>| D=4 | ----------->| C->&B |------>| |
| | +------+ | +-------+ | |
+-------+ : : | : : | |
| : : | |
| : : | CPU 2 |
| +-------+ | |
Apparently incorrect ---> | | B->7 |------>| |
perception of B (!) | +-------+ | |
| : : | |
| +-------+ | |
The load of X holds ---> \ | X->9 |------>| |
up the maintenance \ +-------+ | |
of coherence of B ----->| B->2 | +-------+
+-------+
: :
I don't understand, since we have a write barrier, so, any store must take effect when C = &B is executed, which means whence B would equals 2. For CPU 2, B should have been 2 when it gets the value of C, which is &B, why would it perceive B as 7. I am really confused.

The key missing point is the mistaken assumption that for the sequence:
LOAD C (gets &B)
LOAD *C (reads B)
the first load has to precede the second load. A weakly ordered architectures can act "as if" the following happened:
LOAD B (reads B)
LOAD C (reads &B)
if( C!=&B )
LOAD *C
else
Congratulate self on having already loaded *C
The speculative "LOAD B" can happen, for example, because B was on the same cache line as some other variable of earlier interest or hardware prefetching grabbed it.

From the section of the document titled "WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?":
There is no guarantee that any of the memory accesses specified before a
memory barrier will be complete by the completion of a memory barrier
instruction; the barrier can be considered to draw a line in that CPU's
access queue that accesses of the appropriate type may not cross.
and
There is no guarantee that a CPU will see the correct order of effects
from a second CPU's accesses, even if the second CPU uses a memory
barrier, unless the first CPU also uses a matching memory barrier (see
the subsection on "SMP Barrier Pairing").
What memory barriers do (in a very simplified way, of course) is make sure neither the compiler nor in-CPU hardware perform any clever attempts at reordering load (or store) operations across a barrier, and that the CPU correctly perceives changes to the memory made by other parts of the system. This is necessary when the loads (or stores) carry additional meaning, like locking a lock before accessing whatever it is we're locking. In this case, letting the compiler/CPU make the accesses more efficient by reordering them is hazardous to the correct operation of our program.
When reading this document we need to keep two things in mind:
That a load means transmitting a value from memory (or cache) to a CPU register.
That unless the CPUs share the cache (or have no cache at all), it is possible for their cache systems to be momentarily our of sync.
Fact #2 is one of the reasons why one CPU can perceive the data differently from another. While cache systems are designed to provide good performance and coherence in the general case, but might need some help in specific cases like the ones illustrated in the document.
In general, like the document suggests, barriers in systems involving more than one CPU should be paired to force the system to synchronize the perception of both (or all participating) CPUs. Picture a situation in which one CPU completes loads or stores and the main memory is updated, but the new data had yet to be transmitted to the second CPU's cache, resulting in a lack of coherence across both CPUs.
I hope this helps. I'd suggest reading memory-barriers.txt again with this in mind and particularly the section titled "THE EFFECTS OF THE CPU CACHE".

Related

Optimally traverse a DAG with weighted vertices in parallel

There is a graph where vertices represent pieces of code and edges represent dependencies between them. Additionally, each vertex has two numbers: how many threads the corresponding piece of code can use (1, 2, ..., or "as many as there are cores"), and how much time it is estimated to take if it gets that many threads (compared to others - for example, 1, 0.1 or 10). The idea is to run the pieces of code minding their dependencies in parallel, giving them such numbers of threads that the total execution time is the smallest.
Is there some existing algorithm which would do that or which I could use as a base?
So far I was thinking as follows. For example, we have 8 threads total (so NT = 8T) and the following graph.
+----------------+ +----------------+
+-+ A: 0.2x, 1T +----+ | F: 0.1x, 1T |
| +---+------------+ | +---+------------+
| | | |
| +---v------------+ | +---v------------+
| | B: 0.1x, 2T +-+ | | G: 0.3x, NT +-+
| +----------------+ | | +----------------+ |
| | | |
| +----------------+ | | +----------------+ |
+-> C: 0.4x, 1T | | +----> H: 0.1x, 1T | |
+--+-------------+ | +--+-------------+ |
+----+ | | |
| +----------------+ | +--v-------------+ |
| | D: 0.1x, 1T <-+ | J: 1.5x, 4T <-+
| +--+-------------+ +-------+--------+
| | |
| +--v-------------+ |
+-> E: 1.0x, 4T +------------+ |
+----------------+ | |
+--v----v--------+
+ I: 0.01x, 1T |
+----------------+
At task I we have 2 dependencies, E and J. As J dependencies, we have F-G and A-H. For E, A-C and A-B-D. To get to J, we need 0.3x on A-H and 0.4x on F-G, but G needs many threads for that. We could first run A and F in parallel (each with a single thread). Then we would run G with 7 threads and as A finishes, H with 1 thread. However there's also the E branch. Ideally, we would like it to be ready 0.5 later than J. In this case, it's quite easy because the longest path to E when we have already processed A takes 0.4 using one thread, and the other path takes less than that and uses just 2 threads - so we can run these calculations when J is running. But if, say, D took 0.6x, we would probably need to run it in parallel with G as well.
So I think I could start with the sink vertex and balance the weights of subgraphs on which it depends. But given these "N-thread" tasks, it's not particularly clear how. And considering that the x-numbers are just estimates, it would be good if it could make adjustments if particular tasks took more or less time than anticipated.
You can model this problem as a job shop scheduling problem (flexible job shop problem in particular, where the machines are processors, and the jobs are slices of programs to be run).
First, you have to modify a bit your DAG, in order to transform it into another DAG which is the disjunctive graph representing your problem.
This transformation is very simple. For any node i, t, nb_t representing the job i, that need t seconds to be performed with 1 thread, and that can be parallelized into nb_t threads, do the following:
Replace i, t, nb_t by nb_t vertices i_1, t/nb_t, ..., i_(nb_t), t/nb_t. For each incoming/outgoing edge of the node i, create an incoming/outgoing edge from/to all the newly created nodes. Basically, we just split each job that can be parallelized into smaller jobs that can be handled by several processors (machines) simultaneously.
You then have your disjuntive graph, which is the input to the job shop problem.
Then, all you need to do is to solve this well-known problem, there are different options available ....
I would advice using a MILP solver, but from the small search I just did, it seems like many meta-heuristics can tackle the problem (simulated annealing, genetic programming, ...).

Flipped switch statements?

Consider you have 10 boolean variables of which only one can be true at a time, and each time any one is 'switched on', all others must be 'turned off'. One of the problems that immediately arises is;
How can you quickly test which variable is true without necessarily
having to linearly check all the variable states each time?
For this, I was thinking if it was possible to have something like:
switch(true)
{
case boolean1:
//do stuff
...
//other variables
}
This looks like a bad way of testing for 10 different states of an object, but I think there're cases where this kind of feature may prove useful and would like to know if there's any programming language that supports this kind of feature?
There isn't a language feature that offers this behavior. But as an alternative, you could use the Command Pattern, in conjunction with a Priority Queue. This assumes that you would be able to prioritize what checks should be done.
Traditionally, when you have such radio button boolean values you use an integer to represent them:
+------------+---------+--------------------+
| BINARY | DECIMAL | BINARY-LOGARITHMIC |
+------------+---------+--------------------+
| 0000000001 | 1 | 0 |
| 0000000010 | 2 | 1 |
| 0000000100 | 4 | 2 |
| 0000001000 | 8 | 3 |
| 0000010000 | 16 | 4 |
| 0000100000 | 32 | 5 |
| 0001000000 | 64 | 6 |
| 0010000000 | 128 | 7 |
| 0100000000 | 256 | 8 |
| 1000000000 | 512 | 9 |
+------------+---------+--------------------+
Let's call the variable holding this boolean value flag. We can quickly jump to some code based on the flag by indexing a random access array of functions:
var functions = [ function0
, function1
, function2
, function3
, function4
, function5
, function6
, function7
, function8
, function9
];
functions[flag](); // quick jump
However, you will have to pay for the function call overhead.

Manage multiple signal speed in a Gnu-Radio flow graph

I am currently working on Z-Wave protocol.
With my HackRF One and scapy-radio I try to sniff the communications between two devices.
However devices can transmit at different speeds :
9,6 kbps
40 kbps
100 kbps
As I can only decode communications at 40 kbps, I imagine my graph is unable to manage other speeds.
Some informations about Z-Wave communications :
Frequency (EU) : 868.4 MHz
Modulation : GFSK
And my GRC graph :
So my question is : How to modify the graph to decode and sniff 9,6 and 100 kbps signal too ?
As an easy workaround, I would suggest to take the input stream from the HackRF and connect it into 3 different decoders, each one with the desired parameters. Then each Packet sink block will publish messages at the same Socket PDU block.
I am not familiar with the Z-Wave, but if the 3 different data rates share the same spectrum bandwidth, then there is no more job for you and you are done.
But if they do, which I believe that is true for your case, you need some extra steps.
First of all you have to sample the time domain signal with the maximum sampling rate required by the Z-Wave. For example, if for the 3 different data rates the spectrum bandwidth is 4, 2 and 1 MHz you have to sample with 4e6 samples/s. Then you perform SRC (Source Rate Conversion), also known as re-sampling, for each of the different streams. So for the second rate you may want to re-sample your input stream of 4e6 samples/s to 2e6 samples/s.
Then you connect re-sampled streams at the corresponding decoding procedures
+---------------+
|Rest blocks 0 |
+---------------------------------> |
| | |
| +---------------+
|
+------------+ +--------------+ +---------------+
| | | | |Rest blocks 1 |
| Source +----------> Resampler 1+-------------> |
| | | | | |
+------------+ +--------------+ +---------------+
|
| +--------------+ +---------------+
| | | |Rest blocks 2 |
+-----> Resampler 2+--------------> |
| | | |
+--------------+ +---------------+
GNU Radio already ships with some resamplers, you can start using the Rational Resampler block.

Game Boy: What is the purpose of instructions that don't modify anything (e.g. AND A)?

I've been working on a Game Boy emulator, and I've noticed that there are certain opcodes that exist that would never change any values, such as LD A, A, LD B, B, etc. and also AND A. The first ones obviously don't change anything as they load the value of registers into the same registers, and since the AND is being compared with the A register, AND A will always return A. Is there any purpose for these operations, or are the essentially the same as NOP after each cycle?
As Jeffrey Bosboom and Hans Passant pointed out on their comments, the reason is simplicity. More specifically hardware simplicity.
LD r,r' instructions copy the content of source register (r') to destination register (r). LD r,r' opcodes follow this form:
-------------------------------
BIT | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
-------------------------------
OPCODE | 0 | 1 | r | r' |
-------------------------------
Destination and source registers can assume these values:
-----------
| BIT | REG |
-----------
| 111 | A |
-----------
| 000 | B |
-----------
| 001 | C |
-----------
| 010 | D |
-----------
| 011 | E |
-----------
| 100 | H |
-----------
| 101 | L |
-----------
In order to implement these instructions in hardware we just need a multiplexer that receives bits 0-2 to select the source register and another multiplexer that receives bits 3-5 to select the destination register.
If you want to verify if bits 0-2 and bits 3-5 are pointing to the same register you would have to add more logic to the CPU. And as we all know, ressources were more limited in the 80's :P
Please note that loading instructions such as LD A,A, LD B,B, LD C,C, LD D,D, LD E,E, LD H,H, and LD L,L behave like NOP. However AND A and OR A DO NOT behave like NOP, since they affect the flag register, and their execution might change the internal machine state.
Instructions like LD A,A and AND A may appear to be NOPs but they might also change the processor flags and be used for testing the value of a register.
Be sure to check the instruction set documentation carefully for such side effects.
There is actually purpose in AND A (as well as OR A) instruction -- it sets flag Z when A is zero and clears otherwise. So both AND A and OR A are frequently used for this purpose.

How is a stack belonging to a thread different from a stack of a process

Can anybody please tell me what is the difference b/w the two types of stacks.
If I see /proc/<pid>/map and proc/pid/task/<tid> I see same map. Is there a way we can see the stack belonging to thread exclusively (I mean not the stack of process thread) or if there is any gdb command to find out thread specific stack.
Thanks,
Kapil
Is there a way we can see the stack belonging to thread exclusively
There is no such thing: all the threads share the entire address space, so the stack doesn't "belong exclusively" to any given thread. In fact, you can take an address of a local variable, and pass that address to a different thread, which can then read or write values to it.
What I believe you are asking is "how to tell which memory region in /proc/<pid>/maps is thread X currently using as its stack?". If that's the question, you can print $sp to find out current stack pointer for the thread you are interested in, and then find a region in /proc/<pid>/maps that overlaps $sp.
you can list all threads using info threads
and switch to a specific thread using thread <id>
you can type thread apply all info registers to print the current registers of all threads.
or for instance thread apply all bt to print backtraces for all threads.
#Employedrussian
There is no such thing: all the threads share the entire address space, so the stack
doesn't "belong exclusively" to any given thread. In fact, you can take an address of a
local variable, and pass that address to a different thread, which can then read or write
values to it.
What I believe you are asking is "how to tell which memory region in /proc/<pid>/maps is
thread X currently using as its stack?". If that's the question, you can print $sp to
find out current stack pointer for the thread you are interested in, and then find a
region in /proc/<pid>/maps that overlaps $sp.
Right, they share entire address space and its also true that the threads have the stack of their own, but still this does not explains how the stack of a thread different from that of a another thread or athe process thread. I mean, if this is the way we can visualize it:
+--------+ stack vma start
| +--+ |
| +--+ <------- stack of process
| +--+ |
| +--+ |
| : : |
| |
| |
| +--+ |
| +--+ <------- stack of thread1
| +--+ |
| +--+ |
| : : |
| |
| |
| +--+ |
| +--+ |
| +--+ <------ stack of thread2
| +--+ |
| : : |
: :
: :
+--------+ stack vma end
(may be that i am completely wrong in this, but this is just an attempt to clarify the things)
Regarding passing of an address (of a local variable), When you pass that as an address you can you read or write to that memory location, that's inherent property with pointer.
Just for the sake of completeness, I am posint here what ever i could understand.
The diagram which is posted above is wrong and should be modified this way:
Process address Space:
+----------------------------------------------------+
| |
: :
: :
| |
| +--------+ thread2 stack vma start |
| | +--+ | |
| | +--+ | |
| | +--+ | |
| | +--+ | | stack grows downwards |
| | : : | | |
| : : V |
| : : |
| +--------+ thread2 stack vma ends |
| |
| |
| +--------+ thread1 stack vma start |
| | +--+ | |
| | +--+ | |
| | +--+ | |
| | +--+ | | stack grows downwards |
| | : : | | |
| : : V |
| : : |
| +--------+ thread1 stack vma ends |
| |
| |
| +--------+ Process stack vma start |
| | +--+ | |
| | +--+ | |
| | +--+ | |
| | +--+ | | stack grows downwards |
| | : : | | |
| : : V |
: : : :
: +--------+ Process stack vma ends :
: :
+----------------------------------------------------+
The thereads get their separate stacks from the mmap'd memory. This i am talking about the POSIX implementation in glibc. For better reference consult function allocate_stack () in
nptl in glibc.

Resources