Understanding "A software combining tree barrier with optimized wakeup" algorithm - multithreading

I am reading this pseudo code for a barrier synchronization algorithm from this paper, and I could not fully understand it.
The goal of the code is to create a barrier for multiple threads (threads can't pass the barrier unless all threads have completed) using something called "software combining tree" (not sure what it means)
Here is the pseudo code (though I encourage you to look at the article as well)
type node = record
k : integer // fan-in of this node
count : integer // initialized to k
locksense : Boolean // initially false
parent : ^node // pointer to parent node; nil if root
shared nodes : array [0..P-1] of node
// each element of nodes allocated in a different memory module or cache line
processor private sense : Boolean := true
processor private mynode : ^node // my group's leaf in the combining tree
procedure combining_barrier
combining_barrier_aux (mynode) // join the barrier
sense := not sense // for next barrier
procedure combining_barrier_aux (nodepointer : ^node)
with nodepointer^ do
if fetch_and_decrement (&count) = 1 // last one to reach this node
if parent != nil
combining_barrier_aux (parent)
count := k // prepare for next barrier
locksense := not locksense // release waiting processors
repeat until locksense = sense
I understand that it implies building a binary tree but I didn't understand a few things.
Is P the number of threads?
What is k? What is "fan-in of this node"
The article mentions that threads are organized as groups on the leaves of the tree, what groups?
Is there exactly one node for each thread?
How do I get "my group's leaf in the combining tree"?

Is P the number of threads?
YES
What is k? What is "fan-in of this node".
This is called the ary-ness of the tree being constructed. This is a binary tree and hence k = 2. The idea is that for a smaller number of processors, the binary tree will suffice. But as the number of processors increases, then the levels in the tree will grow a lot. This is balanced by increasing the ary-ness by increasing the value of k. This will essentially enable more than two processors to be part of a leaf or a group. As the system scales to thousands of processors with interconnect, this may be important. The downside to this is the increased contention. As more processors become part of the same tree, they will be spinning on the same variable.
The article mentions that threads are organized as groups on the leaves of the tree, what groups?
The threads are organized into groups equal to number of leaves. The number of leaves is essentially the number of threads divided by the ary-ness or k. So for a set of 8 threads, with k=2, the number of leaves will be 4. This allows the threads in a group to spin on a single variable and also to distribute the spinning across multiple variables, rather than a single shared variable as in the basic centralized barrier algorithm.
Is there exactly one node for each thread?
The answer is NO. Of course there are at least as many nodes as the leaves. For a 8-thread problem, there will be at 4 nodes for the 4 leaves. Now, after this flat level, the "winners"(thread that comes last) will climb up to its parent in a recursive manner. The number of levels will be log P to base k. For each parent, there will be a node, eventually climbing up to the true root or the parent.
for e.g. for the 8 thread problem, there will be 4 + 2 + 1 = 7 nodes.
How do I get "my group's leaf in the combining tree"?
This is a little tricky part. There is a formula based on modulus and some integer division. However, I have not seen a publicly available implementation. sorry, I can't reveal what I have seen only in a class, as that may not be appropriate. May be a google search or someone else can fill in this.

Just a guess, not an real answer
The nodes probably constitute a complete binary tree with P/2 leaves where P is number of threads. Two threads are assigned to each leaf.
The threads at each leaf meet up, and decide which will be "active" and which will be "passive". The passive thread waits while the active thread combines both of their inputs and moves up to the parent node to meet up with the active thread from the sibling leaf. Then those two elect an active member, and the process repeats, moving up the tree until one thread is elected "active" at the root.
The active thread at the root does the final computation, producing a result, and then
Then the whole thing unwinds, and all of the passive threads are notified of the result, and released to do whatever it is they're going to do.

Related

What is thread synchronization and how does it differ form atomicity?

Atomicity can be achieved with machine level instructions such as compare and swap (CS).
It could also be achieved with the use of a mutex/lock for a large blocks of code with the OS providing help on it.
On the other hand we also have the concept of memory model. Some machines could have a relaxed model like Arm which could re-order load/stores on a single thread, and some have a more strict model like x86.
I want to confirm my understanding of the term synchronization. Is it pretty much the
promise of both atomicity and the memory model? i.e only using atomic ops on a thread doesn't necessary make it synchronized with other threads?
Something atomic is indivisible. Things that are synchronized are happening together in time.
Atomicity
I like to think of this like having a data structure representing a 2-dimensional point with x, y coordinates. For my purposes, in order for my data to be considered "valid" it must always be a point along the x = y line. x and y must always be the same.
Suppose that initially I have a point { x = 10, y = 10 } and I want to update my data structure so that it represents the point {x = 20, y = 20}. And suppose that the implementation of the update operation is basically these two separate steps:
x = 20
y = 20
If my implementation writes x and y separately like that, then some other thread could potentially observe my point data structure data after step 1 but before step 2. If it is allowed to read the value of the point after I change x but before I change y then that other observer might observe the value {x = 20, y = 10}.
In fact there are three values that could be observed
{x = 10, y = 10} (the original value) [VALID]
{x = 20, y = 10} (x is modified but y is not yet modified) [INVALID x != y]
{x = 20, y = 20} (both x and y are modified) [VALID]
I need a way of updating the two values together so that it is impossible for an outside observer observe {x = 20, y = 10}.
I don't really care when the other observer looks at the value of my point. It is fine it it observes { x = 10, y = 10 } and it is also fine if it observes { x = 20, y = 20 }. Both have the property of x == y, which makes them valid in my scenario.
Simplest atomic operation
The most simple atomic operation is a test and set of a single bit. This operation atomically reads a value of a bit and overwrites it with a 1, returning the state of the bit we overwrote. But we are offered the guarantee that if our operation has concluded then we have the value that we overwrote and any other observer will observe a 1. If many agents attempt this operation simultaneously, only one agent will return 0, and the others will all return 1. Even if it's two CPU's writing on the exact same clock tick, something in the electronics will guarantee that the operation is concluded logically atomically according to our rules.
That's it to logical atomicity. That's all atomic means. It means you have the capability of performing an uninterrupted update with valid data before and after the update and the data cannot be observed by another observer in any intermediate state it may take on during the update. It may be a single bit or it may be an entire database.
x86 Example
A good example of something that can be done on x86 atomically is the 32-bit interlocked increment.
Here a 32-bit (4-byte) value must be incremented by 1. This could potentially need to modify all 4 bytes for this to work correctly. If the value is to be modified from 0x000000FF to 0x00000100, it's important that the 0x00 becomes a 0x00 and the 0xFF becomes a 0x00 atomically. Otherwise I risk observing the value 0x00000000 (if the LSB is modified first) or 0x000001FF (if the MSB is modified first).
The hardware guarantees that we can test and modify 4 bytes at a time to achieve this. The CPU and memory provide a mechanism by which this operation can be performed even if there are other CPUs sharing the same memory. The CPU can assert a lock condition that prevents other CPUs from interfering with this interlocked operation.
Synchronization
Synchronization just talks about how things happen together in time. In the context you propose, it's about the order in which various sections of our program get executed and the order in which various components of our system change state. Without synchronization, we risk corruption (entering an invalid, semantically meaningless or incorrect state of execution of our program or its data)
Let's say we want to have an interlocked increment of a 64-bit number. Let's suppose that the hardware does not offer a way to atomically change 64-bits at a time. We will have to accomplish what we want with more complex data structure that means that even when just reading we can't simply read the most-significant 32 bits and the least-significant 32 bits of our 64-bit number separately. We'd risk observing one part of our 64-bit value changing separately from the other half. It means that we must adhere to some kind of protocol when reading (or writing) this 64-bit value.
To implement this, we need an atomic test and set bit operation and a clear bit operation. (FYI, technically, what we need are two operations commonly referred to as P and V in computer science, but let's keep it simple.) Before reading or writing our data, we perform an atomic test-and-set operation on a single (shared) bit (commonly referred to as a "lock"). If we read a zero, then we know we are the only one that saw a zero and everyone else must have seen a 1. If we see a 1, then we assume someone else is using our shared data, and therefore we have no choice but to just try again. So we loop and keep testing and setting the bit until we observe it as a 0. (This is called a spin lock, and is the best we can do without getting help from the operating system's scheduler.)
When we eventually see a 0, then we can safely read both 32-bit parts of our 64-bit value individually. Or, if we're writing, we can safely write both 32-bit parts of our 64-bit value individually. Once both halves have been read or written, we clear the bit back to 0, permitting access by someone else.
Any such combination of cleverness and use of atomic operations to avoid corruption in this manner constitutes synchronization because we are governing the order in which certain sections of our program can run. And we can achieve synchronization of any complexity and of any amount of data so long as we have access to some kind of atomic data.
Once we have created a program that uses a lock to share a data structure in a conflict-free way, we could also refer to that data structure as being logically atomic. C++ provides a std::atomic to achieve this, for example.
Remember that synchronization at this level (with a lock) is achieved by adhering to a protocol (protecting your data with a lock). Other forms of synchronization, such as what happens when two CPUs try to access the same memory on the same clock tick, are resolved in hardware by the CPUs and the motherboard, memory, controllers, etc. But fundamentally something similar is happening, just at the motherboard level.

Difference between mutexes and memory coherence?

I know about memory coherence protocols for multi-core architectures. MSI for example allows at most one core to hold a cache line in M state with both read and write access enabled. S state allows multiple sharers of the same line to only read the data. I state allows no access to the currently acquired cache line. MESI extends that by adding an E state which allows only one sharer to read, allowing an easier transition to M state if there are no other sharers.
from what I wrote above, I understand that when we write this line of code as part of multi-threaded (pthreads) program:
// temp_sum is a thread local variable
// sum is a global shared variable
sum = sum + temp_sum;
It should allow one thread to access sum in M state invalidating all other sharers, then when another thread reaches the same line it will request M invalidating again the current sharers and so on. But in fact this doesn't happen unless I add a mutex:
pthread_mutex_lock(&locksum);
// temp_sum is a thread local variable
// sum is a global shared variable
sum = sum + temp_sum;
pthread_mutex_unlock(&locksum);
This is the only way to have this work correctly. Now why do we have to supply these mutexes? why isn't this handled by memory coherence directly? why do we need mutexes or atomic instructions?
Your line of code sum = sum + temp_sum; although it may seem trivially simple in C, it is not an atomic operation. It loads the value of sum from memory into a register, performs arithmetic on it (adding the value of temp_sum), then writes the result back to memory (wherever sum is stored).
Even though only one thread can read or write sum from memory at a time, there is still an opportunity for a synchronization problem. A second thread could modify sum in memory while the first is manipulating the value in a register. Then the first thread will write what it thinks is the updated value (the result of arithmetic) back to memory, overwriting whatever the second put there. It is this transitional location in a register that introduces the issue. There is more to the notion of "the value of a variable" than whatever currently resides in memory.
For example, suppose sum is initially 4. Two threads want to add 1 to it. The first thread loads the 4 from memory into a register, and adds 1 to make 5. But before this first thread can store the result back to memory, a second thread loads the 4, adds 1, and writes a 5 back to memory. The first thread then continues and stores its result (5) back to the same memory location. Both threads are convinced that they have done their duty and correctly updated the sum. The problem is that sum is 5 and not 6 as it should be.
The mutex ensures that only one thread will load, modify, and store sum at a time. Any second thread will have to wait (be blocked) until the first has finished.

Parallel processing - Connected Data

Problem
Summary: Parallely apply a function F to each element of an array where F is NOT thread safe.
I have a set of elements E to process, lets say a queue of them.
I want to process all these elements in parallel using the same function f( E ).
Now, ideally I could call a map based parallel pattern, but the problem has the following constraints.
Each element contains a pair of 2 objects.( E = (A,B) )
Two elements may share an object. ( E1 = (A1,B1); E2 = (A1, B2) )
The function f cannot process two elements that share an object. so E1 and E2 cannot be processing in parallel.
What is the right way of doing this?
My thoughts are like so,
trivial thought: Keep a set of active As and Bs, and start processing an Element only when no other thread is already using A OR B.
So, when you give the element to a thread you add the As and Bs to the active set.
Pick the first element, if its elements are not in the active set spawn a new thread , otherwise push it to the back of the queue of elements.
Do this till the queue is empty.
Will this cause a deadlock ? Ideally when a processing is over some elements will become available right?
2.-The other thought is to make a graph of these connected objects.
Each node represents an object (A / B) . Each element is an edge connecting A & B, and then somehow process the data such that we know the elements are never overlapping.
Questions
How can we achieve this best?
Is there a standard pattern to do this ?
Is there a problem with these approaches?
Not necessary, but if you could tell the TBB methods to use, that'll be great.
The "best" approach depends on a lot of factors here:
How many elements "E" do you have and how much work is needed for f(E). --> Check if it's really worth it to work the elements in parallel (if you need a lot of locking and don't have much work to do, you'll probably slow down the process by working in parallel)
Is there any possibility to change the design that can make f(E) multi-threading safe?
How many elements "A" and "B" are there? Is there any logic to which elements "E" share specific versions of A and B? --> If you can sort the elements E into separate lists where each A and B only appears in a single list, then you can process these lists parallel without any further locking.
If there are many different A's and B's and you don't share too many of them, you may want to do a trivial approach where you just lock each "A" and "B" when entering and wait until you get the lock.
Whenever you do "lock and wait" with multiple locks it's very important that you always take the locks in the same order (e.g. always A first and B second) because otherwise you may run into deadlocks. This locking order needs to be observed everywhere (a single place in the whole application that uses a different order can cause a deadlock)
Edit: Also if you do "try lock" you need to ensure that the order is always the same. Otherwise you can cause a lifelock:
thread 1 locks A
thread 2 locks B
thread 1 tries to lock B and fails
thread 2 tries to lock A and fails
thread 1 releases lock A
thread 2 releases lock B
Goto 1 and repeat...
Chances that this actually happens "endless" are relatively slim, but it should be avoided anyway
Edit 2: principally I guess I'd just split E(Ax, Bx) into different lists based on Ax (e.g one list for all E's that share the same A). Then process these lists in parallel with locking of "B" (there you can still "TryLock" and continue if the required B is already used.

How/when to release memory in wait-free algorithms

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?
The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?
The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.
IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).
PS, references to advanced tutorials on wait-free algorithms for morons are welcome.
Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.
After some research I learned this.
The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.
Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations
Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.
Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.
There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....
There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.
And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.
So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.
I think atomic operations for increment/decrement and compare-and-swap would solve this problem.
Idea:
All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.
You haven't specified a language for your question. Here goes an example in c#:
class MyResource
{
// Counter is initially zero. Resource will not be disposed until it has
// been acquired and released.
private int _counter;
public bool Acquire()
{
// Atomically increment counter.
int c = Interlocked.Increment(ref _counter);
// Resource is available if the resulting value is greater than zero.
return c > 0;
}
public bool Release()
{
// Atomically decrement counter.
int c = Interlocked.Decrement(ref _counter);
// We should never reach a negative value
Debug.Assert(c >= 0, "Resource was released without being acquired");
// Dispose when we reach zero
if (c == 0)
{
// Mark as disposed by setting counter its minimum value.
// Only do this if the counter remain at zero. Atomic compare-and-swap operation.
if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
{
// TODO: Run dispose code (free stuff)
return true; // tell caller that resource is disposed
}
}
return false; // released but still in use
}
}
Usage:
// "r" is an instance of MyResource
bool acquired = false;
try
{
if (acquired = r.Acquire())
{
// TODO: Use resource
}
}
finally
{
if (acquired)
{
if (r.Release())
{
// Resource was disposed.
// TODO: Nullify variable or similar to let GC collect it.
}
}
}
I know this is not the best way but it works for me:
for shared dynamic data-structure lists I use usage counter per item
for example:
struct _data
{
DWORD usage;
bool delete;
// here add your data
_data() { usage=0; deleted=true; }
};
const int MAX = 1024;
_data data[MAX];
now when item is started to be used somwhere then
// start use of data[i]
data[i].cnt++;
after is no longer used then
// stop use of data[i]
data[i].cnt--;
if you want to add new item to list then
// add item
for (i=0;i<MAX;i++) // find first deleted item
if (data[i].deleted)
{
data[i].deleted=false;
data[i].cnt=0;
// copy/set your data
break;
}
and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)
[Note]
to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
now any change of data[i].cnt modify like this:
// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;
and modify delete scan like this
while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
if (!data[i].cnt)
if (!data[i].deleted)
{
data[i].deleted=true;
data[i].cnt=0;
// release your dynamic data ...
}
data_cnt_locked=false;
PS.
do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

can i easily write a program to make use of Intel's Quad core or i7 chip if only 1 thread is used?

I wonder if in my program I have only 1 thread, can I write it so that the Quad core or i7 can actually make use of the different cores? Usually when i write programs on a Quad core computer, the CPU usage will only go to about 25%, and the work seems to be divided among the 4 cores, as the Task Manager shows. (the programs i wrote usually is Ruby, Python, or PHP, so they may not be so much optimized).
Update: what if i write it in C or C++ instead, and
for (i = 0; i < 100000000; i++) {
a = i * 2;
b = i + 1;
if (a == ... || b == ...) { ... }
}
and then use the highest level of optimization with the compiler. can the compiler make the multiplication happen on one core, and the addition happen on a different core, and therefore make 2 cores work at the same time? isn't that a fairly easy optimization to use 2 cores?
No. You need to use threads to execute multiple paths concurrently on multiple CPU's (be they real or virtual)... execution of one thread is inherently bound to one CPU as this maintains the "happens before" relationship between statements, which is central to how programs work.
First, unless multiple threads are created in the program, then there is only a single thread of execution in that program.
Seeing 25% of CPU resources being used for the program is an indication that a single core out of four is being utilized at 100%, but all other cores are not being used. If all cores were used, then it would be theoretically possible for the process to hog 100% of the CPU resources.
As a side note, the graphs shown in Task Manager in Windows is the CPU utilization by all processes running at the time, not only for one process.
Secondly, the code you present could be split into code which can execute on two separate threads in order to execute on two cores. I am guessing that you want to show that a and b are independent of each other, and they only depend on i. With that type of situation, separating the inside of the for loop like the following could allow multi-threaded operation which could lead to increased performance:
// Process this in one thread:
for (int i = 0; i < 1000; i++) {
a = i * 2;
}
// Process this in another thread:
for (int i = 0; i < 1000; i++) {
b = i + 1;
}
However, what becomes tricky is if there needs to be a time when the results from the two separate threads need to be evaluated, such as seems to be implied by the if statement later on:
for (i = 0; i < 1000; i++) {
// manipulate "a" and "b"
if (a == ... || b == ...) { ... }
}
This would require that the a and b values which reside in separate threads (which are executing on separate processors) to be looked up, which is a serious headache.
There is no real good guarantee that the i values of the two threads are the same at the same time (after all, multiplication and addition probably will take different amount of times to execute), and that means that one thread may need to wait for another for the i values to get in sync before comparing the a and b that corresponds to the dependent value i. Or, do we make a third thread for value comparison and synchronization of the two threads? In either case, the complexity is starting to build up very quickly, so I think we can agree that we're starting to see a serious mess arising -- sharing states between threads can be very tricky.
Therefore, the code example you provide is only partially parallelizable without much effort, however, as soon as there is a need to compare the two variables, separating the two operations becomes very difficult very quickly.
Couple of rules of thumbs when it comes to concurrent programming:
When there are tasks which can be broken down into parts which involve processing of data that is completely independent of other data and its results (states), then parallelizing can be very easy.
For example, two functions which calculates a value from an input (in pseudocode):
f(x) = { return 2x }
g(x) = { return x+1 }
These two functions don't rely on each other, so they can be executed in parallel without any pain. Also, as they are no states to share or handle between calculations, even if there were multiple values of x that needed to be calculated, even those can be split up further:
x = [1, 2, 3, 4]
foreach t in x:
runInThread(f(t))
foreach t in x:
runInThread(g(t))
Now, in this example, we can have 8 separate threads performing calculations. Not having side effects can be very good thing for concurrent programming.
However, as soon as there is dependency on data and results from other calculations (which also means there are side effects), parallelization becomes extremely difficult. In many cases, these types of problems will have to be performed in serial as they await results from other calculations to be returned.
Perhaps the question comes down to, why can't compilers figure out parts that can be automatically parallelized and perform those optimizations? I'm not an expert on compilers so I can't say, but there is an article on automatic parallization at Wikipedia which may have some information.
I know Intel chips very well.
Per your code, "if (a == ... || b == ...)" is a barrier, otherwise the processor cores will execute all code parallelly, regardless of compiler had done what kind of optimization. That only requires that the compiler is not a very "stupid" one. It means that the hardware has the capability itself, not software. So threaded programming or OpenMP is not necessary in such cases though they will help on improving parallel computing. Note here doesn't mean Hyper-threading, just normal multi-core processor functionalities.
Please google "processor pipeline multi port parallel" to learn more.
Here I'd like to give a classical example which could be executed by multi-core/multi-channel IMC platforms (e.g. Intel Nehalem family such as Core i7) parallelly, no extra software optimization would be needed.
char buffer0[64];
char buffer1[64];
char buffer2[64];
char buffer[192];
int i;
for (i = 0; i < 64; i++) {
*(buffer + i) = *(buffer0 + i);
*(buffer + 64 + i) = *(buffer1 + i);
*(buffer + 128 + i) = *(buffer2 + i);
}
Why? 3 reasons.
1 Core i7 has a triple-channel IMC, its bus width is 192 bits, 64 bits per channel; and memory address space is interleaved among the channels on a per cache-line basis. cache-line length is 64 bytes. so basicly buffer0 is on channel 0, buffer1 will be on channel and buffer2 on channel 2; while for buffer[192], it was interleaved among 3 channels evently, 64 per channel. The IMC supports loading or storing data from or to multiple channels concurrently. That's multi-channel MC burst w/ maximum throughput. While in my following description, I'll only say 64 bytes per channel, say w/ BL x8 (Burst Length 8, 8 x 8 = 64 bytes = cache-line) per channel.
2 buffer0..2 and buffer are continuous in the memory space (on a specific page both virtually and physically, stack memroy). when run, buffer0, 1, 2 and buffer are loaded/fetched into the processor cache, 6 cache-lines in total. so after start the execution of above "for(){}" code, accessing memory is not necessary at all because all data are in the cache, L3 cache, a non-core part, which is shared by all cores. We'll not talk about L1/2 here. In this case every core could pick the data up and then compute them independently, the only requirement is that the OS supports MP and stealing task is allowed, say runtime scheduling and affinities sharing.
3 there're no any dependencies among buffer0, 1, 2 and buffer, so there're no execution stall or barriers. e.g. execute *(buffer + 64 + i) = *(buffer1 + i) doesn't need to wait the execution of *(buffer + i) = *(buffer0 + i) for done.
Though, the most important and difficult point is "stealing task, runtime scheduling and affinities sharing", that's because for a give task, there's only one task exection context and it should be shared by all cores to perform parallel execution. Anyone if could understand this point, s/he is among the top experts in the world. I'm looking for such an expert to cowork on my open source project and be responsible for parallel computing and latest HPC architectures related works.
Note in above example code, you also could use some SIMD instructions such as movntdq/a which will bypass processor cache and write memory directly. It's a very good idea too when perform software level optimization, though accessing memory is extremely expensive, for example, accessing cache (L1) may need just only 1 cycle, but accessing memory needs 142 cycles on former x86 chips.
Please visit http://effocore.googlecode.com and http://effogpled.googlecode.com to know the details.
Implicit parallelism is probably what you are looking for.
If your application code is single-threaded multiple processors/cores will only be used if:
the libraries you use are using multiple threads (perhaps hiding this usage behind a simple interface)
your application spawns other processes to perform some part of its operation
Ruby, Python and PHP applications can all be written to use multiple threads, however.
A single threaded program will only use one core. The operating system might well decide to shift the program between cores from time to time - according to some rules to balance the load etc. So you will see only 25% usage overall and the all four cores working - but only one at once.
The only way to use multiple cores without using multithreading is to use multiple programs.
In your example above, one program could handle 0-2499999, the next 2500000-4999999, and so on. Set all four of them off at the same time, and they will use all four cores.
Usually you would be better off writing a (single) multithreaded program.
With C/C++ you can use OpenMP. It's C code with pragmas like
#pragma omp parallel for
for(..) {
...
}
to say that this for will run in parallel.
This is one easy way to parallelize something, but at some time you will have to understand how parallel programs execute and will be exposed to parallel programming bugs.
If you want to parallel the choice of the "i"s that evaluate to "true" your statement if (a == ... || b == ...) then you can do this with PLINQ (in .NET 4.0):
//note the "AsParallel"; that's it, multicore support.
var query = from i in Enumerable.Range(0, 100000000).AsParallel()
where (i % 2 == 1 && i >= 10) //your condition
select i;
//while iterating, the query is evaluated in parallel!
//Result will probably never be in order (eg. 13, 11, 17, 15, 19..)
foreach (var selected in query)
{
//not parallel here!
}
If, instead, you want to parallelize operations, you will be able to do:
Parallel.For(0, 100000000, i =>
{
if (i > 10) //your condition here
DoWork(i); //Thread-safe operation
});
Since you are talking about 'task manager', you appear to be running on Windows. However, if you are running a webserver on there (for Ruby or PHP with fcgi or Apache pre-forking, ant to a lesser extent other Apache workers), with multiple processes, then they would tend to spread out across the cores.
If only a single program without threading is running, then, no, no significant advantage will come from that - you're only ruinning one thing at a time, other than OS-driven background processes.

Resources