multithreading multiple short tasks in C++ 11 slows down the process?

multithreading multiple short tasks in C++ 11 slows down the process? - multithreading

I'm not really experienced when it comes down to multithreading. I have a facial landmark detector that detects 68 landmarks around the facial components. For every single landmark HoG features around it need to be extracted and appended to the previous landmark features to create a giant vector before passing it to the regressor.
Currently, all the features are getting extracted in serial one after another and I'm trying to extract them in Parallel to speed up the process.
Extracting the features around all the landmarks IN SERIAL takes about 2.5ms on my system. When I try to parallelize it using 68 threads, it takes about 8.5ms extracting features around all the landmarks. So it actually slows down the process and I'm guessing this is probably because of the threads initializing time.
The following is the original code in serial
for(int i = 0; i < 68; i++){ // for each landmark
fx = shape[i]; // x position
fy = shape[i + 68]; // y position
extract_features(image, fx, fy, &features[i]);
}
Now this is what I have done to parallelize it
vector<std::thread> threads;
for(int i = 0; i < 68; i++){ // for each landmark
fx = shape[i]; // x position
fy = shape[i + 68]; // y position
threads.emplace_back(
[image, fx, fy, &] () { extract_features(image, fx, fy, &features[i]); }
);
}
for(int x = 0; x < 68; x++)
threads[x].join();
I should be doing something wrong which is slowing down the process instead of speeding it up. My best guess is, initializing a thread the way I'm doing it is more time consuming that the task itself. If that's the case, is there a way I can initialize the threads already and just run them in the for loop?
I would very much appreciate your help in guiding me through finding the right approach to this project.
Thanks,

Related

Is there a more efficient way of texturing a circle?

I'm trying to create a randomly generated "planet" (circle), and I want the areas of water, land and foliage to be decided by perlin noise, or something similar. Currently I have this (psudo)code:
for (int radius = 0; radius < circleRadius; radius++) {
for (float theta = 0; theta < TWO_PI; theta += 0.1) {
float x = radius * cosine(theta);
float y = radius * sine(theta);
int colour = whateverFunctionIMake(x, y);
setPixel(x, y, colour);
}
}
Not only does this not work (there are "gaps" in the circle because of precision issues), it's incredibly slow. Even if I increase the resolution by changing the increment to 0.01, it still has missing pixels and is even slower (I get 10fps on my mediocre computer using Java (I know not the best) and an increment of 0.01. This is certainly not acceptable for a game).
How might I achieve a similar result whilst being much less computationally expensive?
Thanks in advance.

Why not use:
(x-x0)^2 + (y-y0)^2 <= r^2
so simply:
int x0=?,y0=?,r=?; // your planet position and size
int x,y,xx,rr,col;
for (rr=r*r,x=-r;x<=r;x++)
for (xx=x*x,y=-r;y<=r;y++)
if (xx+(y*y)<=rr)
{
col = whateverFunctionIMake(x, y);
setPixel(x0+x, y0+y, col);
}
all on integers, no floating or slow operations, no gaps ... Do not forget to use randseed for the coloring function ...
[Edit1] some more stuff
Now if you want speed than you need direct pixel access (in most platforms Pixels, SetPixel, PutPixels etc are slooow. because they perform a lot of stuff like range checking, color conversions etc ... ) In case you got direct pixel access or render into your own array/image whatever you need to add clipping with screen (so you do not need to check if pixel is inside screen on each pixel) to avoid access violations if your circle is overlapping screen.
As mentioned in the comments you can get rid of the x*x and y*y inside loop using previous value (as both x,y are only incrementing). For more info about it see:
32bit SQRT in 16T without multiplication
the math is like this:
(x+1)^2 = (x+1)*(x+1) = x^2 + 2x + 1
so instead of xx = x*x we just do xx+=x+x+1 for not incremented yet x or xx+=x+x-1 if x is already incremented.
When put all together I got this:
void circle(int x,int y,int r,DWORD c)
{
// my Pixel access
int **Pixels=Main->pyx; // Pixels[y][x]
int xs=Main->xs; // resolution
int ys=Main->ys;
// circle
int sx,sy,sx0,sx1,sy0,sy1; // [screen]
int cx,cy,cx0, cy0 ; // [circle]
int rr=r*r,cxx,cyy,cxx0,cyy0; // [circle^2]
// BBOX + screen clip
sx0=x-r; if (sx0>=xs) return; if (sx0< 0) sx0=0;
sy0=y-r; if (sy0>=ys) return; if (sy0< 0) sy0=0;
sx1=x+r; if (sx1< 0) return; if (sx1>=xs) sx1=xs-1;
sy1=y+r; if (sy1< 0) return; if (sy1>=ys) sy1=ys-1;
cx0=sx0-x; cxx0=cx0*cx0;
cy0=sy0-y; cyy0=cy0*cy0;
// render
for (cxx=cxx0,cx=cx0,sx=sx0;sx<=sx1;sx++,cxx+=cx,cx++,cxx+=cx)
for (cyy=cyy0,cy=cy0,sy=sy0;sy<=sy1;sy++,cyy+=cy,cy++,cyy+=cy)
if (cxx+cyy<=rr)
Pixels[sy][sx]=c;
}
This renders a circle with radius 512 px in ~35ms so 23.5 Mpx/s filling on mine setup (AMD A8-5500 3.2GHz Win7 64bit single thread VCL/GDI 32bit app coded by BDS2006 C++). Just change the direct pixel access to style/api you use ...
[Edit2]
to measure speed on x86/x64 you can use RDTSC asm instruction here some ancient C++ code I used ages ago (on 32bit environment without native 64bit stuff):
double _rdtsc()
{
LARGE_INTEGER x; // unsigned 64bit integer variable from windows.h I think
DWORD l,h; // standard unsigned 32 bit variables
asm {
rdtsc
mov l,eax
mov h,edx
}
x.LowPart=l;
x.HighPart=h;
return double(x.QuadPart);
}
It returns clocks your CPU has elapsed since power up. Beware you should account for overflows as on fast machines the 32bit counter is overflowing in seconds. Also each core has separate counter so set affinity to single CPU. On variable speed clock before measurement heat upi CPU by some computation and to convert to time just divide by CPU clock frequency. To obtain it just do this:
t0=_rdtsc()
sleep(250);
t1=_rdtsc();
fcpu = (t1-t0)*4;
and measurement:
t0=_rdtsc()
mesured stuff
t1=_rdtsc();
time = (t1-t0)/fcpu
if t1<t0 you overflowed and you need to add the a constant to result or measure again. Also the measured process must take less than overflow period. To enhance precision ignore OS granularity. for more info see:
Measuring Cache Latencies
Cache size estimation on your system? setting affinity example
Negative clock cycle measurements with back-to-back rdtsc?

ways to express concurrency without thread

I am wondering about how concurrency can be expressed without an explicit thread object, not the implementation, which probably would use threads or thread pools, but the language design related issues.
Q1: I wonder what would be lost if there was no thread object, what couldn't be done in such a language?
Q2: I also wonder about how this would be expressed, what ways were proposed or implemented as alternatives or complements to threads?

one possibility is the MPI-programm-model (GPU as well)
lets say you have the following code
for(int i=0; i < 100; i++) {
work(i);
}
the "normal" thread-based way would be the separation of the iteration-range into multiple subsets. So something like this
Thread-1:
for(int i=0; i < 50; i++) {
work(i);
}
Thread-2:
for(int i=50; i < 100; i++) {
work(i);
}
however in MPI/GPU you do something different.
the idea is, that every core execute the same(GPU) or at least
a similar (MPI) programm. the difference is, that each core uses
a different ID, which changes the behavior of the code.
mpi-style: (not exactly the MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int subset = 100 / size;
for (int i = rank * subset;i < (rand+1)*subset; i+) {
//each core will use a different range for i
work(i);
}
the next big thing is communication. Normally you need to use all of the synchronization-stuff manually. MPI is message-based, meaning that its not perfectly suited for classical shared-memory modells (every core has access to the same memory), but in a cluster system (many cores combined with a network) it works excellent. This is not only limited to supercomputers (they use basically only mpi-style stuff), but in the recent years a new type of core-architecture (manycores) was developed. They have a local so called Network-On-Chip, so each core can send/receive messages without having the problem with synchronization.
MPI contains not only simple messages, but higher constructs to automatically scatter and gather data to every core.
Example: (again not MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int data[100];
int result;
int results[size];
if (rank == 0) { //master-core only
fill_with_stuff(data);
}
scatter(0, data); //core-0 will send the data-content to all other cores
result = work(rank, data); // every core works on the same data
gather(0,result,results); //get all local results and store them in
//the results-array of core-0
an other solutions is the openMP-libary
here you declare parallel-blocks. the whole thread-part is done by the libary itself
example:
//this will split the for-loop automatically in 4 threads
#pragma omp parallel for num_threads(4)
for(int i=0; i < 100; i++) {
work(i);
}
the big advantage is, that its fast to write. thats it
you may get better performance with writing the threads on your own,
but it takes a lot more time and knowledge about synchronization

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.

To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}

I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N

You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features

A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

Why FFTW on Windows is faster than on Linux?

I wrote two identical programs in Linux and Windows using the fftw libraries (fftw3.a, fftw3.lib), and compute the duration of the fftwf_execute(m_wfpFFTplan) statement (16-fft).
For 10000 runs:
On Linux: average time is 0.9
On Windows: average time is 0.12
I am confused as to why this is nine times faster on Windows than on Linux.
Processor: Intel(R) Core(TM) i7 CPU 870 # 2.93GHz
Each OS (Windows XP 32 bit and Linux OpenSUSE 11.4 32 bit) are installed on same machines.
I downloaded the fftw.lib (for Windows) from internet and don't know that configurations. Once I build FFTW with this config:
/configure --enable-float --enable-threads --with-combined-threads --disable-fortran --with-slow-timer --enable-sse --enable-sse2 --enable-avx
in Linux and it results in a lib that is four times faster than the default configs (0.4 ms).

16 FFT is very small. What you will find is FFTs smaller than say 64 will be hard coded assembler with no loops to get the highest possible performance. This means they can be highly susceptible to variations in instruction sets, compiler optimisations, even 64 or 32bit words.
What happens when you run a test of FFT sizes from 16 -> 1048576 in powers of 2? I say this as a particular hard-coded asm routine on Linux might not be the best optimized for your machine, whereas you might have been lucky on the Windows implementation for that particular size. A comparison of all sizes in this range will give you a better indication of the Linux vs. Windows performance.
Have you calibrated FFTW? When first run FFTW guesses the fastest implementation per machine, however if you have special instruction sets, or a particular sized cache or other processor features then these can have a dramatic effect on execution speed. As a result performing a calibration will test the speed of various FFT routines and choose the fastest per size for your specific hardware. Calibration involves repeatedly computing the plans and saving the FFTW "Wisdom" file generated. The saved calibration data (this is a lengthy process) can then be re-used. I suggest doing it once when your software starts up and re-using the file each time. I have noticed 4-10x performance improvements for certain sizes after calibrating!
Below is a snippet of code I have used to calibrate FFTW for certain sizes. Please note this code is pasted verbatim from a DSP library I worked on so some function calls are specific to my library. I hope the FFTW specific calls are helpful.
// Calibration FFTW
void DSP::forceCalibration(void)
{
// Try to import FFTw Wisdom for fast plan creation
FILE *fftw_wisdom = fopen("DSPDLL.ftw", "r");
// If wisdom does not exist, ask user to calibrate
if (fftw_wisdom == 0)
{
int iStatus2 = AfxMessageBox("FFTw not calibrated on this machine."\
"Would you like to perform a one-time calibration?\n\n"\
"Note:\tMay take 40 minutes (on P4 3GHz), but speeds all subsequent FFT-based filtering & convolution by up to 100%.\n"\
"\tResults are saved to disk (DSPDLL.ftw) and need only be performed once per machine.\n\n"\
"\tMAKE SURE YOU REALLY WANT TO DO THIS, THERE IS NO WAY TO CANCEL CALIBRATION PART-WAY!",
MB_YESNO | MB_ICONSTOP, 0);
if (iStatus2 == IDYES)
{
// Perform calibration for all powers of 2 from 8 to 4194304
// (most heavily used FFTs - for signal processing)
AfxMessageBox("About to perform calibration.\n"\
"Close all programs, turn off your screensaver and do not move the mouse in this time!\n"\
"Note:\tThis program will appear to be unresponsive until the calibration ends.\n\n"
"\tA MESSAGEBOX WILL BE SHOWN ONCE THE CALIBRATION IS COMPLETE.\n");
startTimer();
// Create a whole load of FFTw Plans (wisdom accumulates automatically)
for (int i = 8; i <= 4194304; i *= 2)
{
// Create new buffers and fill
DSP::cFFTin = new fftw_complex[i];
DSP::cFFTout = new fftw_complex[i];
DSP::fconv_FULL_Real_FFT_rdat = new double[i];
DSP::fconv_FULL_Real_FFT_cdat = new fftw_complex[(i/2)+1];
for(int j = 0; j < i; j++)
{
DSP::fconv_FULL_Real_FFT_rdat[j] = j;
DSP::cFFTin[j][0] = j;
DSP::cFFTin[j][1] = j;
DSP::cFFTout[j][0] = 0.0;
DSP::cFFTout[j][1] = 0.0;
}
// Create a plan for complex FFT.
// Use the measure flag to get the best possible FFT for this size
// FFTw "remembers" which FFTs were the fastest during this test.
// at the end of the test, the results are saved to disk and re-used
// upon every initialisation of the DSP Library
DSP::pCF = fftw_plan_dft_1d
(i, DSP::cFFTin, DSP::cFFTout, FFTW_FORWARD, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real forward FFT
DSP::pCF = fftw_plan_dft_r2c_1d
(i, fconv_FULL_Real_FFT_rdat, fconv_FULL_Real_FFT_cdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real inverse FFT
DSP::pCF = fftw_plan_dft_c2r_1d
(i, fconv_FULL_Real_FFT_cdat, fconv_FULL_Real_FFT_rdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Destroy the buffers. Repeat for each size
delete [] DSP::cFFTin;
delete [] DSP::cFFTout;
delete [] DSP::fconv_FULL_Real_FFT_rdat;
delete [] DSP::fconv_FULL_Real_FFT_cdat;
}
double time = stopTimer();
char * strOutput;
strOutput = (char*) malloc (100);
sprintf(strOutput, "DSP.DLL Calibration complete in %d minutes, %d seconds\n"\
"Please keep a copy of the DSPDLL.ftw file in the root directory of your application\n"\
"to avoid re-calibration in the future\n", (int)time/(int)60, (int)time%(int)60);
AfxMessageBox(strOutput);
isCalibrated = 1;
// Save accumulated wisdom
char * strWisdom = fftw_export_wisdom_to_string();
FILE *fftw_wisdomsave = fopen("DSPDLL.ftw", "w");
fprintf(fftw_wisdomsave, "%s", strWisdom);
fclose(fftw_wisdomsave);
DSP::pCF = NULL;
DSP::cFFTin = NULL;
DSP::cFFTout = NULL;
fconv_FULL_Real_FFT_cdat = NULL;
fconv_FULL_Real_FFT_rdat = NULL;
free(strOutput);
}
}
else
{
// obtain file size.
fseek (fftw_wisdom , 0 , SEEK_END);
long lSize = ftell (fftw_wisdom);
rewind (fftw_wisdom);
// allocate memory to contain the whole file.
char * strWisdom = (char*) malloc (lSize);
// copy the file into the buffer.
fread (strWisdom,1,lSize,fftw_wisdom);
// import the buffer to fftw wisdom
fftw_import_wisdom_from_string(strWisdom);
fclose(fftw_wisdom);
free(strWisdom);
isCalibrated = 1;
return;
}
}
The secret sauce is to create the plan using the FFTW_MEASURE flag, which specifically measures hundreds of routines to find the fastest for your particular type of FFT (real, complex, 1D, 2D) and size:
DSP::pCF = fftw_plan_dft_1d (i, DSP::cFFTin, DSP::cFFTout,
FFTW_FORWARD, FFTW_MEASURE);
Finally, all benchmark tests should also be performed with a single FFT Plan stage outside of execute, called from code that is compiled in release mode with optimizations on and detached from the debugger. Benchmarks should be performed in a loop with many thousands (or even millions) of iterations and then take the average run time to compute the result. As you probably know the planning stage takes a significant amount of time and the execute is designed to be performed multiple times with a single plan.

Dividing loop iterations among threads

I recently wrote a small number-crunching program that basically loops over an N-dimensional grid and performs some calculation at each point.
for (int i1 = 0; i1 < N; i1++)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1; // see bottom of question
It worked fine, yadda yadda yadda, lovely graphs resulted ;-) But then I thought, I have 2 cores on my computer, why not make this program multithreaded so I could run it twice as fast?
Now, my loops run a total of, let's say, around a billion calculations, and I need some way to split them up among threads. I figure I should group the calculations into "tasks" - say each iteration of the outermost loop is a task - and hand out the tasks to threads. I've considered
just giving thread #n all iterations of the outermost loop where i1 % nthreads == n - essentially predetermining which tasks go to which threads
trying to set up some mutex-protected variable which holds the parameter(s) (i1 in this case) of the next task that needs executing - assigning tasks to threads dynamically
What reasons are there to choose one approach over the other? Or another approach I haven't thought about? Does it even matter?
By the way, I wrote this particular program in C, but I imagine I'll be doing the same kind of thing again in other languages as well so answers need not be C-specific. (If anyone knows a C library for Linux that does this sort of thing, though, I'd love to know about it)
EDIT: in this case bin_index is a deterministic function which doesn't change anything except its own local variables. Something like this:
int bin_index(int i1, int i2, int i3, int i4) {
// w, d, h are constant floats
float x1 = i1 * w / N, x2 = i2 * w / N, y1 = i3 * d / N, y2 = i4 * d / N;
float l = sqrt((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2) + h * h);
float th = acos(h / l);
// th_max is a constant float (previously computed as a function of w, d, h)
return (int)(th / th_max);
}
(although I appreciate all the comments, even those which don't apply to a deterministic bin_index)

The first approach is simple. It is also sufficient if you expect that the load will be balanced evenly over the threads. In some cases, especially if the complexity of bin_index is very dependant on the parameter values, one of the threads could end up with a much heavier task than the rest. Remember: the task is finished when the last threads finishes.
The second approach is a bit more complicated, but balances the load more evenly if the tasks are finegrained enough (the number of tasks is much larger than the number of threads).
Note that you may have issues putting the calculations in separate threads. Make sure that bin_index works correctly when multiple threads execute it simultaneously. Beware of the use of global or static variables for intermediate results.
Also, "histogram[bin_index(i1, i2, i3, i4)] += 1" could be interrupted by another thread, causing the result to be incorrect (if the assignment fetches the value, increments it and stores the resulting value in the array). You could introduce a local histogram for each thread and combine the results to a single histogram when all threads have finished. You could also make sure that only one thread is modifying the histogram at the same time, but that may cause the threads to block each other most of the time.

The first approach is enough. No need for complication here. If you start playing with mutexes you risk making hard to detect errors.
Don't start complicating unless you really see that you need this. Syncronization issues (especially in case of many threads instead of many processes) can be really painful.

As I understand it, OpenMP was made just for what you are trying to do, although I have to admit I have not used it yet myself. Basically it seems to boil down to just including a header and adding a pragma clause.
You could probably also use Intel's Thread Building Blocks Library.

If you never coded a multithread application, I bare you to begin with OpenMP:
the library is now included in gcc by default
this is very easy to use
In your example, you should just have to add this pragma:
#pragma omp parallel shared(histogram)
{
for (int i1 = 0; i1 < N; i1++)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1;
}
With this pragma, the compiler will add some instruction to create threads, launch them, add some mutexes around accesses to the histogram variable etc... There are a lot of options, but well defined pragma do all the work for you. Basically, the simplicity depends on the data dependency.
Of course, the result should not be optimal as if you coded all by hand. But if you don't have load balancing problem, you maybe could approach a 2x speed up. Actually this is only write in matrix with no spacial dependency in it.

I would do something like this:
void HistogramThread(int i1, Action<int[]> HandleResults)
{
int[] histogram = new int[HistogramSize];
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1;
HandleResults(histogram);
}
int[] CalculateHistogram()
{
int[] histogram = new int[HistogramSize];
ThreadPool pool; // I don't know syntax off the top of my head
for (int i1=0; i1<N; i1++)
{
pool.AddNewThread(HistogramThread, i1, delegate(int[] h)
{
lock (histogram)
{
for (int i=0; i<HistogramSize; i++)
histogram[i] += h[i];
}
});
}
pool.WaitForAllThreadsToFinish();
return histogram;
}
This way you don't need to share any memory, until the end.

If you ever do it in .NET, use the Parallel Extensions.

If you want to write multithreaded number crunching code (and you are going to be doing a lot of it in the future) I would suggest you take a look at using a functional language like OCaml or Haskell.
Due to the lack of side effects and lack of shared state in functional languages (well, mostly) making your code run across multiple threads is a LOT easier. Plus, you'll probably find that you end up with a lot less code.

I agree with Sharptooth that your first approach seems like the only plausible one.
Your single threaded app is continuously assigning to memory. To get any speedup, your several threads would need to also be continuously assigning to memory. If only one thread is assigning at a time, you would get no speedup at all. So if your assignments are guarded, the whole exercise would fail.
This would be a dangerous approach since you assigning to shared memory without a guard. But it seems to be worth the danger (if a x2 speedup matters). If you can be sure that all the values of bin_index(i1, i2, i3, i4) are different in your division of the loop, then it should work since the array assignment would be to a different locations in your shared memory. Still, one always should look and hard at approaches like this.
I assume you would also produce a test routine to compare the results of the two versions.
Edit:
Looking at your bin_index(i1, i2, i3, i4), I suspect your process could not be parallelized without considerable effort.
The only way to divide up the work of calculation in your loop is, again, to be sure that your threads will access the same areas in memory. However, it looks like bin_index(i1, i2, i3, i4) will likely repeat values quite often. You might divide up the iteration into the conditions where bin_index is higher than a cutoff and where it is lower than a cut-off. Or you could divide it arbitrarily and see whether increment is implemented atomically. But any complex threading approach looks unlikely to provide improvement if you can only have two cores to work with to start with.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string