Hybrid parallelization: 'Making only 'my_id ==0' process executes openMP calls' possible?

Hybrid parallelization: 'Making only 'my_id ==0' process executes openMP calls' possible? - multithreading

I'm very new to Hybrid parallel coding, so I'm wondering this kind of concept is possible, and whether it will cause bad efficiency of parallelization.
Let's say that I need A routine and B routine. A is quite difficult to parallelize with MPI, while B is relatively straightforward to MPI. Since I want this code to be scalable to some extent, I'm going to exploit as much MPI parallelization as I can.
I'm getting the concepts of thread and process very roughly, I suppose the numbers of total threads to be n_threads x n_process.
program Hybrid
use MPI
use OMP_LIB
call MPI_INIT ( ierr )
* call MPI_COMM_SIZE ( MPI_COMM_WORLD, n_process, ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
...
* call omp_set_num_threads ( n_threads )
...
call MPI_FINALIZE ( ierr )
end program
So in above example, total threads becomes n_threads x n_process in my understanding (I'm not sure whether I'm using the word total threads properly, though). The asterisks(*) are just for making it easy to find n_threads and n_process.
My serial version of code looks like,
program Serial
do i = 1, time_steps
call A
call B
enddo
end program
Both A and B needs a global view of array.
MPI parallelized B routine, B_MPI, starts with some MPI_ScatterV to distribute its global information into sub processes, and it ends with MPI_GatherV to recover all global view, and only the 'my_id == 0' process holds this global view.
While I want A to be parallelized with OpenMP, I don't want to activate too many threads, so want only 'my_id == 0' process calls OpenMP calls, makes folks of OpenMP threads like below.
program
use MPI
use OMP_LIB
call MPI_INIT ( ierr )
* call MPI_COMM_SIZE ( MPI_COMM_WORLD, n_process, ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
...
do i = 1, time_steps
if (my_id == 0) then
* call A_OMP ! 'call omp_set_num_threads ( n_threads ) ' inside the 'A_OMP'
endif
call B_MPI ! Starts with 'MPI_ScatterV', ends with 'MPI_GatherV'
enddo
end program
So in this way, only invoking OpenMP calls inside 'my_id == 0', I want to make the total threads to be like n_process + n_threads, rather than n_process x n_threads.
To be honest, I'm not very sure how threads are folked if they're mixed with MPI. I want to make it sure whether above kind of picture is possible, and will be efficient.
Thank you for reading this question.

There shouldn't be any issue with this in principle, although note that using your terminology total_threads = n_threads + n_process - 1. If you had 10 physical CPU-cores and wanted to use 4 OpenMP threads, then you would have to launch 7 MPI processes. MPI rank 0 would only create 3 additional threads in the parallel region - the other thread, OpenMP thread 0, is just the original MPI process.
The issue in practice is that to get this to work efficiently you'll need to make sure that the 7 MPI processes and 3 additional OpenMP threads are scheduled appropriately at runtime, i.e. bound to the appropriate CPU cores. This might require a combination of both MPI and OpenMP affinity settings that will need to work together appropriately. Of course, you can never efficiently use more OpenMP threads than you have cores in a node as all the threads must run on the same physical computer as MPI rank 0.

Related

How to control multi-threads synchronization in Perl

I got array with [a-z,A-Z] ASCII numbers like so: my #alphabet = (65..90,97..122);
So main thread functionality is checking each character from alphabet and return string if condition is true.
Simple example :
my #output = ();
for my $ascii(#alphabet){
thread->new(\sub{ return chr($ascii); });
}
I want to run thread on every ASCII number, then put letter from thread function into array in the correct order.
So in out case array #output should be dynamic and contain [a..z,A-Z] after all threads finish their job.
How to check, is all threads is done and keep the order?

You're looking for $thread->join, which waits for a thread to finish. It's documented here, and this SO question may also help.
Since in your case it looks like the work being done in the threads is roughly equal in cost (no thread is going to take a long time more than any other), you can just join each thread in order, like so, to wait for them all to finish:
# Store all the threads for each letter in an array.
my #threads = map { thread->new(\sub{ return chr($_); }) } #alphabet;
my #results = map { $_->join } #threads;
Since, when the first thread returns from join, the others are likely already done and just waiting for "join" to grab their return code, or about to be done, this gets you pretty close to "as fast as possible" parallelism-wise, and, since the threads were created in order, #results is ordered already for free.
Now, if your threads can take variable amounts of time to finish, or if you need to do some time-consuming processing in the "main"/spawning thread before plugging child threads' results into the output data structure, joining them in order might not be so good. In that case, you'll need to somehow either: a) detect thread "exit" events as they happen, or b) poll to see which threads have exited.
You can detect thread "exit" events using signals/notifications sent from the child threads to the main/spawning thread. The easiest/most common way to do that is to use the cond_wait and cond_signal functions from threads::shared. Your main thread would wait for signals from child threads, process their output, and store it into the result array. If you take this approach, you should preallocate your result array to the right size, and provide the output index to your threads (e.g. use a C-style for loop when you create your threads and have them return ($result, $index_to_store) or similar) so you can store results in the right place even if they are out of order.
You can poll which threads are done using the is_joinable thread instance method, or using the threads->list(threads::joinable) and threads->list(threads::running) methods in a loop (hopefully not a busy-waiting one; adding a sleep call--even a subsecond one from Time::HiRes--will save a lot of performance/battery in this case) to detect when things are done and grab their results.
Important Caveat: spawning a huge number of threads to perform a lot of work in parallel, especially if that work is small/quick to complete, can cause performance problems, and it might be better to use a smaller number of threads that each do more than one "piece" of work (e.g. spawn a small number of threads, and each thread uses the threads::shared functions to lock and pop the first item off of a shared array of "work to do" and do it rather than map work to threads as 1:1). There are two main performance problems that arise from a 1:1 mapping:
the overhead (in memory and time) of spawning and joining each thread is much higher than you'd think (benchmark it on threads that don't do anything, just return, to see). If the work you need to do is fast, the overhead of thread management for tons of threads can make it much slower than just managing a few re-usable threads.
If you end up with a lot more threads than there are logical CPU cores and each thread is doing CPU-intensive work, or if each thread is accessing the same resource (e.g. reading from the same disks or the same rows in a database), you hit a performance cliff pretty quickly. Tuning the number of threads to the "resources" underneath (whether those are CPUs or hard drives or whatnot) tends to yield much better throughput than trusting the thread scheduler to switch between many more threads than there are available resources to run them on. The reasons this is slow are, very broadly:
Because the thread scheduler (part of the OS, not the language) can't know enough about what each thread is trying to do, so preemptive scheduling cannot optimize for performance past a certain point, given that limited knowledge.
The OS usually tries to give most threads a reasonably fair shot, so it can't reliably say "let one run to completion and then run the next one" unless you explicitly bake that into the code (since the alternative would be unpredictably starving certain threads for opportunities to run). Basically, switching between "run a slice of thread 1 on resource X" and "run a slice of thread 2 on resource X" doesn't get you anything once you have more threads than resources, and adds some overhead as well.
TL;DR threads don't give you performance increases past a certain point, and after that point they can make performance worse. When you can, reuse a number of threads corresponding to available resources; don't create/destroy individual threads corresponding to tasks that need to be done.

Building on Zac B's answer, you can use the following if you want to reuse threads:
use strict;
use warnings;
use Thread::Pool::Simple qw( );
$| = 1;
my $pool = Thread::Pool::Simple->new(
do => [ sub {
select(undef, undef, undef, (200+int(rand(8))*100)/1000);
return chr($_[0]);
} ],
);
my #alphabet = ( 65..90, 97..122 );
print $pool->remove($_) for map { $pool->add($_) } #alphabet;
print "\n";
The results are returned in order, as soon as they become available.

I'm the author of Parallel::WorkUnit so I'm partial to it. And I thought adding ordered responses was actually a great idea. It does it with forks, not threads, because forks are more widely supported and they often perform better in Perl.
my $wu = Parallel::WorkUnit->new();
for my $ascii(#alphabet){
$wu->async(sub{ return chr($ascii); });
}
#output = $wu->waitall();
If you want to limit the number of simultaneous processes:
my $wu = Parallel::WorkUnit->new(max_children => 5);
for my $ascii(#alphabet){
$wu->queue(sub{ return chr($ascii); });
}
#output = $wu->waitall();

Set number of threads for each section in openmp?

I wonder if it is possible to set number of threads for each section in an openmp parallel section i.e.:
real*8 :: x
real*4 :: y
integer*8 :: ii
integer*4 ** jj
x = 0.0d0
y = 0.0
!$OMP PARALLEL
!$OMP SECTIONS
!$OMP SECTION NUM_THREADS(3)
do ii=1,100000000000
x=x+(cos(sin(tan(ii*1.0d0)))**(x/ii)
end do
!$OMP SECTION NUM_THREADS(1)
do jj=1,10000
x=x+exp(jj*0.001)
end do
!$OMP END SECTIONS
!$OMP END PARALLEL
This code does not work with ifort 16.0 but I just wonder if there is something else one could do..?
EDIT: I get an error (during compilation) when I try to set number of threads per section... I would like to specify different number of threads per section.
EDIT 2:Error message = 2 x
error #5082: Syntax error, found 'NUM_THREADS' when expecting one of "<"END-OF-STATEMENT">" ;
at the two !OMP SECTION NUM_THREADS(i)statements
Pardon the hasty writ..

Now that you've told us what we needed to know it's blindingly obvious what the problem is ...
... the num_threads clause is applicable only to the parallel directive.
It is not possible to, in a straightforward fashion, allocate m out of n threads to one section and the remaining n-m threads to another. You can probably hack something together to achieve that effect but it would be going against the grain of OpenMP programming.

What you are trying to do is against the philosophy of OpenMP, where you are not supposed to have full control on threads. You can, however, use a hack: a combination of OpenMP and pthread. That is, OMP PARALLEL blocks which contain pthread statements (pthread will give you full control on which threads will be used in the OMP block). In the past, I was experimenting with that, and although I didn't try exactly what you want to do, I managed to get some interesting results verifying that a OpenMP+pthread combination is possible. Besides, some compilers (like gfortran) implement OpenMP via pthreads behind the scenes.
Of course you need to write Fortran bindings for the pthread statements you will use, but that's not much of a problem. The real problem is that such an approach is problematic by definition. It mixes two radically different parallelization models and it is a hack, so you are on your own. I wouldn't go that way in a serious application but, with enough trial-and-error, it is a way to do what you are trying to do.

Threads making MPI calls in a Hybrid MPI/OpenMP

I have found an issue in my hybrid MPI/OpenMP code that is reproduced
in the simplest form in the code cited below. I am using 2 threads
per MPI rank. These two threads are then used in a OpenMP "Section"
to do several computations, one of these consists on making a "mpi_allreduce" call on two different vectors A and B whose results
are stored in W and WW. The problem is that every time I run the program
I end up with a different output. My mind is that the MPI calls are
overlapping and the reduced arrays W and WW are combined even when they
have different names but I am not sure. Any comment on how to overcome
this issue is welcome.
Details:
The MPI thread level is initialized to MPI_THREAD_MULTIPLE in the code
but I have tried also serial and funneled (with same issue).
I compile the code mpiifort -openmp allreduce_omp_mpi.f90 and for
running I use:
export OMP_NUM_THREADS=2
mpirun -np 3 ./a.out
PROGRAM HELLO
use mpi
use omp_lib
IMPLICIT NONE
INTEGER nthreads, tid
Integer Provided,mpi_err,myid,nproc
CHARACTER(MPI_MAX_PROCESSOR_NAME):: hostname
INTEGER :: nhostchars
integer :: i
real*8 :: A(1000), B(1000), W(1000),WW(1000)
provided=0
!Initialize MPI context
call mpi_init_thread(MPI_THREAD_MULTIPLE,provided,mpi_err)
CALL mpi_comm_rank(mpi_comm_world,myid,mpi_err)
CALL mpi_comm_size(mpi_comm_world,nproc,mpi_err)
CALL mpi_get_processor_name(hostname,nhostchars,mpi_err)
!Initialize arrays
A=1.0
B=2.0
!Check if MPI_THREAD_MULTIPLE is available
if (provided >= MPI_THREAD_MULTIPLE) then
write(6,*) ' mpi_thread_multiple provided',myid
else
write(6,*) ' not mpi_thread_multiple provided',myid
endif
!$OMP PARALLEL PRIVATE(nthreads, tid) NUM_THREADS(2)
!$omp sections
!$omp section
call mpi_allreduce(A,W,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp section
call mpi_allreduce(B,WW,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp end sections
!$OMP END PARALLEL
write(6,*) 'W',(w(i),i=1,10)
write(6,*) 'WW',(ww(i),i=1,10)
CALL mpi_finalize(mpi_err)
END

The MPI standard forbids concurrent execution of (blocking) collective operations over the same communicator (Section 5.13 "Correctness [of collective communication]"):
...
Finally, in multithreaded implementations, one can have more than one, concurrently executing, collective communication call at a process. In these situations, it is the user's responsibility to ensure that the same communicator is not used concurrently by two different collective communication calls at the same process.
The key point here is: same communicator. Nothing prevents you from starting concurrent collective communications over different communicators:
integer, dimension(2) :: comms
call MPI_COMM_DUP(MPI_COMM_WORLD, comms(1), ierr)
call MPI_COMM_DUP(MPI_COMM_WORLD, comms(2), ierr)
!$omp parallel sections num_threads(2)
!$omp section
call MPI_ALLREDUCE(A, W, 1000, MPI_REAL8, MPI_SUM, comms(1), ierr)
!$omp section
call MPI_ALLREDUCE(B, WW, 1000, MPI_REAL8, MPI_SUM, comms(2), ierr)
!$omp end parallel sections
call MPI_COMM_FREE(comms(1), ierr)
call MPI_COMM_FREE(comms(2), ierr)
This program simply duplicates MPI_COMM_WORLD twice. The first copy is used in the first parallel section, the second copy is used in the second one. Although the two new communicators are copies of MPI_COMM_WORLD, they are separate contexts and thus concurrent operations over them are possible.
MPI_COMM_DUP is an expensive operation, therefore the newly created communicators should be used for as long as possible before being freed.

MPI_Bcast using threads (OpenMP) in MPI

The MPI standard 3.0 says in Section 5.13 that
Finally, in multithreaded implementations, one can have more than one,
concurrently executing, collective communication call at a process. In
these situations, it is the user’s re- sponsibility to ensure that the
same communicator is not used concurrently by two different collective
communication calls at the same process.
I wrote the following program which does NOT execute correctly (but compiles) and dumps a core
void main(int argc, char *argv[])
{
int required = MPI_THREAD_MULTIPLE, provided, rank, size, threadID, threadProcRank ;
MPI_Comm comm = MPI_COMM_WORLD ;
MPI_Init_thread(&argc, &argv, required, &provided);
MPI_Comm_size(comm, &size);
MPI_Comm_rank(comm, &rank);
int buffer1[10000] = {0} ;
int buffer2[10000] = {0} ;
#pragma omp parallel private(threadID,threadProcRank) shared(comm, buffer1)
{
threadID = omp_get_thread_num();
MPI_Comm_rank(comm, &threadProcRank);
printf("\nMy thread ID is %d and I am in process ranked %d", threadID, threadProcRank);
if(threadID == 0)
MPI_Bcast(buffer1, 10000, MPI_INTEGER, 0, comm);
If (threadID == 1)
MPI_Bcast(buffer1, 10000, MPI_INTEGER, 0, comm);
}
MPI_Finalize();
}
My question is: Two threads in each process having thread ID 0 and thread ID 1 post a broadcast call which can be taken as a MPI_Send() in the root process ( i.e. process 0). I am interpreting it as two loops of MPI_Send() where the destination is the remaining processes. The destination processes also post MPI_Bcast() in thread ID 0 and thread ID 1. These can be taken as two MPI_Recv()'s posted by each process in the two threads. Since the MPI_Bcast() are identical - there should be no matching problems in receiving the messages sent by Process 0 (the root). But still the program does not work. Why ? Is it because of the possibility that messages might get mixed up on different/same collectives on the same communicator ? And since MPI (mpich2) sees the possibility of this, it just does not allow two collectives on the same communicator pending at the same time ?

First of all, you are not checking the value of provided where the MPI implementation returns the actually provided thread support level. The standard allows for this level to be lower than the requested one and a correct MPI application would rather do something like:
MPI_Init_thread(&argc, &argv, required, &provided);
if (provided < required)
{
printf("Error: MPI does not provide the required thread support\n");
MPI_Abort(MPI_COMM_WORLD, 1);
exit(1);
}
Second, this line of code is redundant:
MPI_Comm_rank(comm, &threadProcRank);
Threads in MPI do not have separate ranks - only processes have ranks. There was a proposal to bring the so-called endpoints in MPI 3.0 which would have allowed a single process to have more than one ranks and to bind them to different threads but it didn't make it into the final version of the standard.
Third, you are using the same buffer variable in both collectives. I guess your intention was to use buffer1 in the call in thread 0 and buffer2 in the call in thread 1. Also MPI_INTEGER is the datatype that corresponds to INTEGER in Fortran. For the C int type the corresponding MPI datatype is MPI_INT.
Fourth, the interpretation of MPI_BCAST as a loop of MPI_SEND and the corresponding MPI_RECV is just that - an interpretation. In reality the implementation is much different - see here. For example, with smaller messages where the initial network setup latency is much higher than the physical data transmission time, binary and binomial trees are used in order to minimise the latency of the collective. Larger messages are usually broken into many segments and then a pipeline is used to pass the segments from the root rank to all the others. Even in the tree distribution case the messages could still be segmented.
The catch is that in practice each collective operation is implemented using messages with the same tag, usually with negative tag values (these are not allowed to be used by the application programmer). That means that both MPI_Bcast calls in your case would use the same tags to transmit their messages and since the ranks would be the same and the communicator is the same, the messages would get all mixed up. Therefore the requirement for doing concurrent collectives only on separate communicators.
There are two reasons why your program crashes. Reason one is that the MPI library does not provide MPI_THREAD_MULTIPLE. The second reason is if the message is split in two unevenly sized chunks, e.g. a larger first part and a smaller second part. The interference between both collective calls could cause the second thread to receive a large first chunk directed to the first thread while waiting for the second smaller chunk. The result would be message truncation and the abort MPI error handler would get called. This usually does not result in segfault and core dumps, so I would suppose that your MPICH2 is simply not compiled as thread-safe.
This is not MPICH2-specific. Open MPI and other implementations are also prone to the same limitations.

can i easily write a program to make use of Intel's Quad core or i7 chip if only 1 thread is used?

I wonder if in my program I have only 1 thread, can I write it so that the Quad core or i7 can actually make use of the different cores? Usually when i write programs on a Quad core computer, the CPU usage will only go to about 25%, and the work seems to be divided among the 4 cores, as the Task Manager shows. (the programs i wrote usually is Ruby, Python, or PHP, so they may not be so much optimized).
Update: what if i write it in C or C++ instead, and
for (i = 0; i < 100000000; i++) {
a = i * 2;
b = i + 1;
if (a == ... || b == ...) { ... }
}
and then use the highest level of optimization with the compiler. can the compiler make the multiplication happen on one core, and the addition happen on a different core, and therefore make 2 cores work at the same time? isn't that a fairly easy optimization to use 2 cores?

No. You need to use threads to execute multiple paths concurrently on multiple CPU's (be they real or virtual)... execution of one thread is inherently bound to one CPU as this maintains the "happens before" relationship between statements, which is central to how programs work.

First, unless multiple threads are created in the program, then there is only a single thread of execution in that program.
Seeing 25% of CPU resources being used for the program is an indication that a single core out of four is being utilized at 100%, but all other cores are not being used. If all cores were used, then it would be theoretically possible for the process to hog 100% of the CPU resources.
As a side note, the graphs shown in Task Manager in Windows is the CPU utilization by all processes running at the time, not only for one process.
Secondly, the code you present could be split into code which can execute on two separate threads in order to execute on two cores. I am guessing that you want to show that a and b are independent of each other, and they only depend on i. With that type of situation, separating the inside of the for loop like the following could allow multi-threaded operation which could lead to increased performance:
// Process this in one thread:
for (int i = 0; i < 1000; i++) {
a = i * 2;
}
// Process this in another thread:
for (int i = 0; i < 1000; i++) {
b = i + 1;
}
However, what becomes tricky is if there needs to be a time when the results from the two separate threads need to be evaluated, such as seems to be implied by the if statement later on:
for (i = 0; i < 1000; i++) {
// manipulate "a" and "b"
if (a == ... || b == ...) { ... }
}
This would require that the a and b values which reside in separate threads (which are executing on separate processors) to be looked up, which is a serious headache.
There is no real good guarantee that the i values of the two threads are the same at the same time (after all, multiplication and addition probably will take different amount of times to execute), and that means that one thread may need to wait for another for the i values to get in sync before comparing the a and b that corresponds to the dependent value i. Or, do we make a third thread for value comparison and synchronization of the two threads? In either case, the complexity is starting to build up very quickly, so I think we can agree that we're starting to see a serious mess arising -- sharing states between threads can be very tricky.
Therefore, the code example you provide is only partially parallelizable without much effort, however, as soon as there is a need to compare the two variables, separating the two operations becomes very difficult very quickly.
Couple of rules of thumbs when it comes to concurrent programming:
When there are tasks which can be broken down into parts which involve processing of data that is completely independent of other data and its results (states), then parallelizing can be very easy.
For example, two functions which calculates a value from an input (in pseudocode):
f(x) = { return 2x }
g(x) = { return x+1 }
These two functions don't rely on each other, so they can be executed in parallel without any pain. Also, as they are no states to share or handle between calculations, even if there were multiple values of x that needed to be calculated, even those can be split up further:
x = [1, 2, 3, 4]
foreach t in x:
runInThread(f(t))
foreach t in x:
runInThread(g(t))
Now, in this example, we can have 8 separate threads performing calculations. Not having side effects can be very good thing for concurrent programming.
However, as soon as there is dependency on data and results from other calculations (which also means there are side effects), parallelization becomes extremely difficult. In many cases, these types of problems will have to be performed in serial as they await results from other calculations to be returned.
Perhaps the question comes down to, why can't compilers figure out parts that can be automatically parallelized and perform those optimizations? I'm not an expert on compilers so I can't say, but there is an article on automatic parallization at Wikipedia which may have some information.

I know Intel chips very well.
Per your code, "if (a == ... || b == ...)" is a barrier, otherwise the processor cores will execute all code parallelly, regardless of compiler had done what kind of optimization. That only requires that the compiler is not a very "stupid" one. It means that the hardware has the capability itself, not software. So threaded programming or OpenMP is not necessary in such cases though they will help on improving parallel computing. Note here doesn't mean Hyper-threading, just normal multi-core processor functionalities.
Please google "processor pipeline multi port parallel" to learn more.
Here I'd like to give a classical example which could be executed by multi-core/multi-channel IMC platforms (e.g. Intel Nehalem family such as Core i7) parallelly, no extra software optimization would be needed.
char buffer0[64];
char buffer1[64];
char buffer2[64];
char buffer[192];
int i;
for (i = 0; i < 64; i++) {
*(buffer + i) = *(buffer0 + i);
*(buffer + 64 + i) = *(buffer1 + i);
*(buffer + 128 + i) = *(buffer2 + i);
}
Why? 3 reasons.
1 Core i7 has a triple-channel IMC, its bus width is 192 bits, 64 bits per channel; and memory address space is interleaved among the channels on a per cache-line basis. cache-line length is 64 bytes. so basicly buffer0 is on channel 0, buffer1 will be on channel and buffer2 on channel 2; while for buffer[192], it was interleaved among 3 channels evently, 64 per channel. The IMC supports loading or storing data from or to multiple channels concurrently. That's multi-channel MC burst w/ maximum throughput. While in my following description, I'll only say 64 bytes per channel, say w/ BL x8 (Burst Length 8, 8 x 8 = 64 bytes = cache-line) per channel.
2 buffer0..2 and buffer are continuous in the memory space (on a specific page both virtually and physically, stack memroy). when run, buffer0, 1, 2 and buffer are loaded/fetched into the processor cache, 6 cache-lines in total. so after start the execution of above "for(){}" code, accessing memory is not necessary at all because all data are in the cache, L3 cache, a non-core part, which is shared by all cores. We'll not talk about L1/2 here. In this case every core could pick the data up and then compute them independently, the only requirement is that the OS supports MP and stealing task is allowed, say runtime scheduling and affinities sharing.
3 there're no any dependencies among buffer0, 1, 2 and buffer, so there're no execution stall or barriers. e.g. execute *(buffer + 64 + i) = *(buffer1 + i) doesn't need to wait the execution of *(buffer + i) = *(buffer0 + i) for done.
Though, the most important and difficult point is "stealing task, runtime scheduling and affinities sharing", that's because for a give task, there's only one task exection context and it should be shared by all cores to perform parallel execution. Anyone if could understand this point, s/he is among the top experts in the world. I'm looking for such an expert to cowork on my open source project and be responsible for parallel computing and latest HPC architectures related works.
Note in above example code, you also could use some SIMD instructions such as movntdq/a which will bypass processor cache and write memory directly. It's a very good idea too when perform software level optimization, though accessing memory is extremely expensive, for example, accessing cache (L1) may need just only 1 cycle, but accessing memory needs 142 cycles on former x86 chips.
Please visit http://effocore.googlecode.com and http://effogpled.googlecode.com to know the details.

Implicit parallelism is probably what you are looking for.

If your application code is single-threaded multiple processors/cores will only be used if:
the libraries you use are using multiple threads (perhaps hiding this usage behind a simple interface)
your application spawns other processes to perform some part of its operation
Ruby, Python and PHP applications can all be written to use multiple threads, however.

A single threaded program will only use one core. The operating system might well decide to shift the program between cores from time to time - according to some rules to balance the load etc. So you will see only 25% usage overall and the all four cores working - but only one at once.

The only way to use multiple cores without using multithreading is to use multiple programs.
In your example above, one program could handle 0-2499999, the next 2500000-4999999, and so on. Set all four of them off at the same time, and they will use all four cores.
Usually you would be better off writing a (single) multithreaded program.

With C/C++ you can use OpenMP. It's C code with pragmas like
#pragma omp parallel for
for(..) {
...
}
to say that this for will run in parallel.
This is one easy way to parallelize something, but at some time you will have to understand how parallel programs execute and will be exposed to parallel programming bugs.

If you want to parallel the choice of the "i"s that evaluate to "true" your statement if (a == ... || b == ...) then you can do this with PLINQ (in .NET 4.0):
//note the "AsParallel"; that's it, multicore support.
var query = from i in Enumerable.Range(0, 100000000).AsParallel()
where (i % 2 == 1 && i >= 10) //your condition
select i;
//while iterating, the query is evaluated in parallel!
//Result will probably never be in order (eg. 13, 11, 17, 15, 19..)
foreach (var selected in query)
{
//not parallel here!
}
If, instead, you want to parallelize operations, you will be able to do:
Parallel.For(0, 100000000, i =>
{
if (i > 10) //your condition here
DoWork(i); //Thread-safe operation
});

Since you are talking about 'task manager', you appear to be running on Windows. However, if you are running a webserver on there (for Ruby or PHP with fcgi or Apache pre-forking, ant to a lesser extent other Apache workers), with multiple processes, then they would tend to spread out across the cores.
If only a single program without threading is running, then, no, no significant advantage will come from that - you're only ruinning one thing at a time, other than OS-driven background processes.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string