TBB ThreadingBuildingBlocks strange behaviour - multithreading

My Question: Why is my program freezing if i use "read only" const_accessors?
It seems to be locking up, from the API description it seems to be ok to have one accessors and multiple const_accessors, (writer, reader). Maybe somebody can tell me a different story.
The Goal i try to achieve is to use this concurrent hash map and make it available to 10-200 Threads so that they can lookup and add/delete information. If you have a better solution than the current one i' am using than you also welcome to post the alternatives.
tbb::size_t hashInitSize = 1200000;
concurrent_hash_map<long int,char*> hashmap(hashInitSize);
cout << hashmap.bucket_count() << std::endl;
long int l = 200;
long int c = 201;
concurrent_hash_map<long int,char*>::accessor o;
concurrent_hash_map<long int,char*>::const_accessor t;
concurrent_hash_map<long int,char*>::const_accessor h;
cout << "Trying to find 200 "<< hashmap.find(t,200) << std::endl;
hashmap.insert(o,l);
o->second = "testother";
TBB Community Tutorial Guide Page 43 describes the concept of accessors

From the TBB reference manual:
An accessor acts as a smart pointer to a pair in a concurrent_hash_map. It holds an implicit lock on a pair until the instance is destroyed or method release is called on the accessor.
Accessors acquire a lock when they are used. Multiple accessors can exist at the same time. However, if a program uses multiple accessors concurrently and does not release the locks, it is likely to deadlock.
To avoid deadlock, release the lock on a hash map entry once you are done accessing it.

Related

C++98: is volatile needed for access to shared global objects in a multithreaded application

Sadly I'm stuck with C++98, which I'm using in an embedded application.
My question is: I have a multithreaded application, with various global shared variables (evil, I know).
I do protect every access to them using mutexes. Do I also need to declare these global variables as volatile, in order to prevent the compiler from optimizing accesses to them?
Searching online it seems that volatile is absolutely useless for multithreading, but a lot of articles are related to C++11, which did introduce a memory model which recognizes threads, but I'm in C++98 land.
I also found some resources that indicate that volatile is instead useful in my case, such as this Barr Group's article.
Let me emphasize the fact that I don't want to get rid of the mutexes at all, or try lock free programming. The mutexes are absolutely staying, I just want to understand if the volatile keyword is needed.
Do I also need to declare these global variables as volatile, in order to prevent the compiler from optimizing accesses to them?
No. And if you did, you would still be in trouble because volatile is not sufficient -- things other than the compiler (such as the CPU, posting buffers, and memory controllers) can also optimize accesses.
As I'm sure you've read elsewhere, volatile has no defined multi-threading semantics in C++98. So unless it does in your particular threading standard (which you don't specify), then it's completely useless to you.
Presumably, your code uses mutexes properly. No optimization is allowed to break code that only relies on guarantees provided by the relevant standards or implementation. So if you're using the mutexes correctly, then your code is guaranteed to work.
What does keyword volatile mean ?
You enforce your program always to read/write your variable to memory (like cache miss every time). Always (CPU->L1->L2->L3->Bus->Memoryand Memory->Bus->L3->L2->L1->CPU)
Actually it slows your program, so you should not use it until exact need.
You may know that compiler may do some optimizations, but these are designed not to affect/change your program logic.
Example 1:
int d = 5;
int b = 10;
for (int i=0; i < 1e9; i++){
cout << i;
b++;
}
cout << d; // d may be created only here, or not created at all. compiler may just cout << 5;
// var b - seems to be skipped at all due to being unused
Example 2:
int a = 0;
for (int i = 0; i < 10; i++){
a++;
sleep(1000);
}
cout << a;
// this code could be compiled like the following one
sleep(10000);
cout << 10;
Keyword volatile prevents your var from such optimizations.
i.e. vars a, b will be created and incremented, also cache missed always.
Here is one of the most common use cases for using volatile:
You have a third party sensor or scoreboard, that is attached to var address, and its value is always read and shown on the scoreboard. so in this case you need volatile, to prevent compiler optimizations on it, so your score board can show correct value.
I guess now you can decide whether you need volatile or not

A thread_guard Equivalent To lock_guard / unique_lock

The standard library provides a mutex class, with the ability to manually lock and unlock it:
std::mutex m;
m.lock();
// ...
m.unlock();
However, the library apparently also recognizes that a common case is just to lock the mutex at some point, and unlock it when leaving a block. For this it provides std::lock_guard and std::unique_lock:
std::mutex m;
std::lock_guard<std::mutex> lock(m);
// ...
// Automatic unlock
I think a fairly common pattern for threads, is to create one (either as a stack variable, or a member), then join it before destructing it:
std::thread t(foo);
// ...
t.join();
It seems easy to write a thread_guard, which would take a thread (or a sequence of threads), and would just call join on its own destruction:
std::thread t(foo);
thread_guard<std::thread> g(t);
// ...
// Join automatically
Is there a standard-library class like it?
If not, is there some reason to avoid this?
This issue is discussed in Scott Meyer's book "Modern Effective c++"
The problem is that if there would be another default behavior (detach or join) would cause hard to find errors in case you forget that there is a implicit operation. So the actual default behavior on destruction is asserting if not explicitly joined or detached. And no "Guard" class is there also because of that reason.
If you always want to join it's safe to write such a class yourself. But when someone uses it and wants to detach people can forget that the destructor will implicitly join it. So that's the risk in writing such function.
As an alternative you can use a scope_guard by boost or the folly library (which I personally prefer more) and declare in the beginning explicitly your intention and it will be executed. Or you can write a policy based "Guard" class where you have to explicitly state what you want to do on destruction.

QtConcurrent threading is slow!! What am I doing wrong?

Why is my qtconcurrent::run() call just as slow as calling the member function through the object??
(Ex: QtConcurrent::run(&db, &DBConnect::loadPhoneNumbers) is just as slow as calling db.loadPhoneNumbers())
Read below for futher explanation
I've been trying to create a thread via QtConcurrent::run to help speed up data being sent to a SQL database table. I am taking a member variable which is a QMap and iterating through it to send each key+value to the database.
Member function for the QtConcurrent::run() call:
void DBConnect::loadPhoneNumbers()
{
//m_phoneNumbers is a private QMap member variable in DBConnect
qDebug() << "\t[!] Items to send: " << m_phoneNumbers.size();
QSqlQuery query;
qDebug() << "\t[!] Using loadphonenumbers thread: " << QThread::currentThread();
qDebug() << "\t[!] Ideal Num of Threads: " << QThread::idealThreadCount();
bool isLoaded = false;
QMap<QString,QString>::const_iterator tmp = m_phoneNumbers.constBegin();
while(tmp != m_phoneNumbers.constEnd())
{
isLoaded = query.exec(QString("INSERT INTO "+m_mtable+" VALUES('%1','%2')").arg(tmp.key()).arg(tmp.value()));
if(isLoaded == false)
{
qDebug() << "\r\r[X] ERROR: Could\'t load number " << tmp.key() << " into table " << m_mtable;
qDebug() << query.lastError().text();
}
tmp++;
}
}
main.cpp section that calls the thread
DBConnect db("QODBC", myINI.getSQLServer(),C_DBASE,myINI.getMTable(), myINI.getBTable());
db.startConnect();
//...more code here
qDebug() << "\n[*] Using main thread: " << QThread::currentThread() << endl;
//....two qtconcurrent::run() threads started and finished here (not shown)
qDebug() << "\n[*] Sending numbers to Database...";
QFuture<void> dbFuture = QtConcurrent::run(&db, &DBConnect::loadPhoneNumbers);
dbFuture.waitForFinished();
My understanding of the situation
From my understanding, this thread will run under a new pool of threads seperate from the main thread. What I am seeing is not the case (note there are 2 other QtConcurrent::run() calls before this one for the database, all left to finish before continuing to database call)
Now I thought about using QtConcurrent::map() / mapped() but couldn't get it to work properly with a QMap. (Couldn't find any examples to help out with either but that is besides the matter... was just an FYI in case someone asks why I didn't use one)
Have been doing some "debug" work to find out whats happening and in my tests I use QThread::currentThread() to find which thread I am currently making a call from. This is what is happening for the various threads in my program. (All qtconcurrent::run() calls are made in main.cpp FYI... not sure if that makes a difference)
Check what is main thread: on QThread(0x5d2cd0)
Run thread 1: on QThread(0x5dd238, name = "Thread (pooled)")
Run thread 2: on QThread(0x5d2cd0)
Run thread 3 (loadPhoneNumbers function): on QThread(0x5d2cd0)
As seen above, other than the first qtconcurrent::run() call, everything else is on the main thread (o.O)
Questions:
From my understanding, all my threads (all qtconcurrent::run) should be on their own thread (only first one is). Is that true or am I missing something?
Second, is my loadPhoneNumebrs() member function thread safe?? (Since I am not altering anything from what I can see)
Biggest question:
Why is my loadPhoneNumbers() qtconcurrent::run call just as slow as if I just called the member function? (ex: db.loadPhoneNumbers() is just as slow as the qtconcurrent::run() version)
Any help is much appreciated!
Threads don't magically speed things up, they just make it so you can continue doing other stuff while it's happening in the background. When you call waitForFinished(), your main thread won't continue until the load phone numbers thread is finished, essentially negating that advantage. Depending on the implementation, that may be why your currentThread() is showing the same as main, because the wait is already happening.
Probably more significant in terms of speed would be to build a single query that inserts all the values in the list, rather than a separate query for each value.
According to QtSql documentation:
A connection can only be used from within the thread that created it.
Moving connections between threads or creating queries from a
different thread is not supported.
It works anyway because ODBC itself supports multithreaded access to a single ODBC handle. But since you are only using one connection, all queries are probably serialized by ODBC as if there was only a single thread (see for example what Oracle's ODBC driver does).
waitForFinished() calls a private function stealRunnable() that, as its name implies, takes a not yet started task from the QFuture queue an runs it in the current thread.

What does threadsafe mean?

Recently I tried to Access a textbox from a thread (other than the UI thread) and an exception was thrown. It said something about the "code not being thread safe" and so I ended up writing a delegate (sample from MSDN helped) and calling it instead.
But even so I didn't quite understand why all the extra code was necessary.
Update:
Will I run into any serious problems if I check
Controls.CheckForIllegalCrossThread..blah =true
Eric Lippert has a nice blog post entitled What is this thing you call "thread safe"? about the definition of thread safety as found of Wikipedia.
3 important things extracted from the links :
“A piece of code is thread-safe if it functions correctly during
simultaneous execution by multiple threads.”
“In particular, it must satisfy the need for multiple threads to
access the same shared data, …”
“…and the need for a shared piece of data to be accessed by only one
thread at any given time.”
Definitely worth a read!
In the simplest of terms threadsafe means that it is safe to be accessed from multiple threads. When you are using multiple threads in a program and they are each attempting to access a common data structure or location in memory several bad things can happen. So, you add some extra code to prevent those bad things. For example, if two people were writing the same document at the same time, the second person to save will overwrite the work of the first person. To make it thread safe then, you have to force person 2 to wait for person 1 to complete their task before allowing person 2 to edit the document.
Wikipedia has an article on Thread Safety.
This definitions page (you have to skip an ad - sorry) defines it thus:
In computer programming, thread-safe describes a program portion or routine that can be called from multiple programming threads without unwanted interaction between the threads.
A thread is an execution path of a program. A single threaded program will only have one thread and so this problem doesn't arise. Virtually all GUI programs have multiple execution paths and hence threads - there are at least two, one for processing the display of the GUI and handing user input, and at least one other for actually performing the operations of the program.
This is done so that the UI is still responsive while the program is working by offloading any long running process to any non-UI threads. These threads may be created once and exist for the lifetime of the program, or just get created when needed and destroyed when they've finished.
As these threads will often need to perform common actions - disk i/o, outputting results to the screen etc. - these parts of the code will need to be written in such a way that they can handle being called from multiple threads, often at the same time. This will involve things like:
Working on copies of data
Adding locks around the critical code
Opening files in the appropriate mode - so if reading, don't open the file for write as well.
Coping with not having access to resources because they're locked by other threads/processes.
Simply, thread-safe means that a method or class instance can be used by multiple threads at the same time without any problems occurring.
Consider the following method:
private int myInt = 0;
public int AddOne()
{
int tmp = myInt;
tmp = tmp + 1;
myInt = tmp;
return tmp;
}
Now thread A and thread B both would like to execute AddOne(). but A starts first and reads the value of myInt (0) into tmp. Now for some reason, the scheduler decides to halt thread A and defer execution to thread B. Thread B now also reads the value of myInt (still 0) into it's own variable tmp. Thread B finishes the entire method so in the end myInt = 1. And 1 is returned. Now it's Thread A's turn again. Thread A continues. And adds 1 to tmp (tmp was 0 for thread A). And then saves this value in myInt. myInt is again 1.
So in this case the method AddOne() was called two times, but because the method was not implemented in a thread-safe way the value of myInt is not 2, as expected, but 1 because the second thread read the variable myInt before the first thread finished updating it.
Creating thread-safe methods is very hard in non-trivial cases. And there are quite a few techniques. In Java you can mark a method as synchronized, this means that only one thread can execute that method at a given time. The other threads wait in line. This makes a method thread-safe, but if there is a lot of work to be done in a method, then this wastes a lot of space. Another technique is to 'mark only a small part of a method as synchronized' by creating a lock or semaphore, and locking this small part (usually called the critical section). There are even some methods that are implemented as lock-less thread-safe, which means that they are built in such a way that multiple threads can race through them at the same time without ever causing problems, this can be the case when a method only executes one atomic call. Atomic calls are calls that can't be interrupted and can only be done by one thread at a time.
In real world example for the layman is
Let's suppose you have a bank account with the internet and mobile banking and your account have only $10.
You performed transfer balance to another account using mobile banking, and the meantime, you did online shopping using the same bank account.
If this bank account is not threadsafe, then the bank allows you to perform two transactions at the same time and then the bank will become bankrupt.
Threadsafe means that an object's state doesn't change if simultaneously multiple threads try to access the object.
You can get more explanation from the book "Java Concurrency in Practice":
A class is thread‐safe if it behaves correctly when accessed from multiple threads, regardless of the scheduling or interleaving of the execution of those threads by the runtime environment, and with no additional synchronization or other coordination on the part of the calling code.
A module is thread-safe if it guarantees it can maintain its invariants in the face of multi-threaded and concurrence use.
Here, a module can be a data-structure, class, object, method/procedure or function. Basically scoped piece of code and related data.
The guarantee can potentially be limited to certain environments such as a specific CPU architecture, but must hold for those environments. If there is no explicit delimitation of environments, then it is usually taken to imply that it holds for all environments that the code can be compiled and executed.
Thread-unsafe modules may function correctly under mutli-threaded and concurrent use, but this is often more down to luck and coincidence, than careful design. Even if some module does not break for you under, it may break when moved to other environments.
Multi-threading bugs are often hard to debug. Some of them only happen occasionally, while others manifest aggressively - this too, can be environment specific. They can manifest as subtly wrong results, or deadlocks. They can mess up data-structures in unpredictable ways, and cause other seemingly impossible bugs to appear in other remote parts of the code. It can be very application specific, so it is hard to give a general description.
Thread safety: A thread safe program protects it's data from memory consistency errors. In a highly multi-threaded program, a thread safe program does not cause any side effects with multiple read/write operations from multiple threads on same objects. Different threads can share and modify object data without consistency errors.
You can achieve thread safety by using advanced concurrency API. This documentation page provides good programming constructs to achieve thread safety.
Lock Objects support locking idioms that simplify many concurrent applications.
Executors define a high-level API for launching and managing threads. Executor implementations provided by java.util.concurrent provide thread pool management suitable for large-scale applications.
Concurrent Collections make it easier to manage large collections of data, and can greatly reduce the need for synchronization.
Atomic Variables have features that minimize synchronization and help avoid memory consistency errors.
ThreadLocalRandom (in JDK 7) provides efficient generation of pseudorandom numbers from multiple threads.
Refer to java.util.concurrent and java.util.concurrent.atomic packages too for other programming constructs.
Producing Thread-safe code is all about managing access to shared mutable states. When mutable states are published or shared between threads, they need to be synchronized to avoid bugs like race conditions and memory consistency errors.
I recently wrote a blog about thread safety. You can read it for more information.
You are clearly working in a WinForms environment. WinForms controls exhibit thread affinity, which means that the thread in which they are created is the only thread that can be used to access and update them. That is why you will find examples on MSDN and elsewhere demonstrating how to marshall the call back onto the main thread.
Normal WinForms practice is to have a single thread that is dedicated to all your UI work.
I find the concept of http://en.wikipedia.org/wiki/Reentrancy_%28computing%29 to be what I usually think of as unsafe threading which is when a method has and relies on a side effect such as a global variable.
For example I have seen code that formatted floating point numbers to string, if two of these are run in different threads the global value of decimalSeparator can be permanently changed to '.'
//built in global set to locale specific value (here a comma)
decimalSeparator = ','
function FormatDot(value : real):
//save the current decimal character
temp = decimalSeparator
//set the global value to be
decimalSeparator = '.'
//format() uses decimalSeparator behind the scenes
result = format(value)
//Put the original value back
decimalSeparator = temp
To understand thread safety, read below sections:
4.3.1. Example: Vehicle Tracker Using Delegation
As a more substantial example of delegation, let's construct a version of the vehicle tracker that delegates to a thread-safe class. We store the locations in a Map, so we start with a thread-safe Map implementation, ConcurrentHashMap. We also store the location using an immutable Point class instead of MutablePoint, shown in Listing 4.6.
Listing 4.6. Immutable Point class used by DelegatingVehicleTracker.
class Point{
public final int x, y;
public Point() {
this.x=0; this.y=0;
}
public Point(int x, int y) {
this.x = x;
this.y = y;
}
}
Point is thread-safe because it is immutable. Immutable values can be freely shared and published, so we no longer need to copy the locations when returning them.
DelegatingVehicleTracker in Listing 4.7 does not use any explicit synchronization; all access to state is managed by ConcurrentHashMap, and all the keys and values of the Map are immutable.
Listing 4.7. Delegating Thread Safety to a ConcurrentHashMap.
public class DelegatingVehicleTracker {
private final ConcurrentMap<String, Point> locations;
private final Map<String, Point> unmodifiableMap;
public DelegatingVehicleTracker(Map<String, Point> points) {
this.locations = new ConcurrentHashMap<String, Point>(points);
this.unmodifiableMap = Collections.unmodifiableMap(locations);
}
public Map<String, Point> getLocations(){
return this.unmodifiableMap; // User cannot update point(x,y) as Point is immutable
}
public Point getLocation(String id) {
return locations.get(id);
}
public void setLocation(String id, int x, int y) {
if(locations.replace(id, new Point(x, y)) == null) {
throw new IllegalArgumentException("invalid vehicle name: " + id);
}
}
}
If we had used the original MutablePoint class instead of Point, we would be breaking encapsulation by letting getLocations publish a reference to mutable state that is not thread-safe. Notice that we've changed the behavior of the vehicle tracker class slightly; while the monitor version returned a snapshot of the locations, the delegating version returns an unmodifiable but “live” view of the vehicle locations. This means that if thread A calls getLocations and thread B later modifies the location of some of the points, those changes are reflected in the Map returned to thread A.
4.3.2. Independent State Variables
We can also delegate thread safety to more than one underlying state variable as long as those underlying state variables are independent, meaning that the composite class does not impose any invariants involving the multiple state variables.
VisualComponent in Listing 4.9 is a graphical component that allows clients to register listeners for mouse and keystroke events. It maintains a list of registered listeners of each type, so that when an event occurs the appropriate listeners can be invoked. But there is no relationship between the set of mouse listeners and key listeners; the two are independent, and therefore VisualComponent can delegate its thread safety obligations to two underlying thread-safe lists.
Listing 4.9. Delegating Thread Safety to Multiple Underlying State Variables.
public class VisualComponent {
private final List<KeyListener> keyListeners
= new CopyOnWriteArrayList<KeyListener>();
private final List<MouseListener> mouseListeners
= new CopyOnWriteArrayList<MouseListener>();
public void addKeyListener(KeyListener listener) {
keyListeners.add(listener);
}
public void addMouseListener(MouseListener listener) {
mouseListeners.add(listener);
}
public void removeKeyListener(KeyListener listener) {
keyListeners.remove(listener);
}
public void removeMouseListener(MouseListener listener) {
mouseListeners.remove(listener);
}
}
VisualComponent uses a CopyOnWriteArrayList to store each listener list; this is a thread-safe List implementation particularly suited for managing listener lists (see Section 5.2.3). Each List is thread-safe, and because there are no constraints coupling the state of one to the state of the other, VisualComponent can delegate its thread safety responsibilities to the underlying mouseListeners and keyListeners objects.
4.3.3. When Delegation Fails
Most composite classes are not as simple as VisualComponent: they have invariants that relate their component state variables. NumberRange in Listing 4.10 uses two AtomicIntegers to manage its state, but imposes an additional constraint—that the first number be less than or equal to the second.
Listing 4.10. Number Range Class that does Not Sufficiently Protect Its Invariants. Don't do this.
public class NumberRange {
// INVARIANT: lower <= upper
private final AtomicInteger lower = new AtomicInteger(0);
private final AtomicInteger upper = new AtomicInteger(0);
public void setLower(int i) {
//Warning - unsafe check-then-act
if(i > upper.get()) {
throw new IllegalArgumentException(
"Can't set lower to " + i + " > upper ");
}
lower.set(i);
}
public void setUpper(int i) {
//Warning - unsafe check-then-act
if(i < lower.get()) {
throw new IllegalArgumentException(
"Can't set upper to " + i + " < lower ");
}
upper.set(i);
}
public boolean isInRange(int i){
return (i >= lower.get() && i <= upper.get());
}
}
NumberRange is not thread-safe; it does not preserve the invariant that constrains lower and upper. The setLower and setUpper methods attempt to respect this invariant, but do so poorly. Both setLower and setUpper are check-then-act sequences, but they do not use sufficient locking to make them atomic. If the number range holds (0, 10), and one thread calls setLower(5) while another thread calls setUpper(4), with some unlucky timing both will pass the checks in the setters and both modifications will be applied. The result is that the range now holds (5, 4)—an invalid state. So while the underlying AtomicIntegers are thread-safe, the composite class is not. Because the underlying state variables lower and upper are not independent, NumberRange cannot simply delegate thread safety to its thread-safe state variables.
NumberRange could be made thread-safe by using locking to maintain its invariants, such as guarding lower and upper with a common lock. It must also avoid publishing lower and upper to prevent clients from subverting its invariants.
If a class has compound actions, as NumberRange does, delegation alone is again not a suitable approach for thread safety. In these cases, the class must provide its own locking to ensure that compound actions are atomic, unless the entire compound action can also be delegated to the underlying state variables.
If a class is composed of multiple independent thread-safe state variables and has no operations that have any invalid state transitions, then it can delegate thread safety to the underlying state variables.

Parallel, but slower

I'm using monte carlo method to calculate pi and do a basic experience with parallel programming and openmp
the problem is that when i use 1 thread, x iterations, always runs faster than n thread, x iterations. Can anyone tell me why?
For example the code runs like this "a.out 1 1000000", where 1 is threads and 1000000 the iterations
include <omp.h>
include <stdio.h>
include <stdlib.h>
include <iostream>
include <iomanip>
include <math.h>
using namespace std;
int main (int argc, char *argv[]) {
double arrow_area_circle, pi;
float xp, yp;
int i, n;
double pitg= atan(1.0)*4.0; //for pi error
cout << "Number processors: " << omp_get_num_procs() << endl;
//Number of divisions
iterarions=atoi(argv[2]);
arrow_area_circle = 0.0;
#pragma omp parallel num_threads(atoi(argv[1]))
{
srandom(omp_get_thread_num());
#pragma omp for private(xp, yp) reduction(+:arrow_area_circle) //*,/,-,+
for (i = 0; i < iterarions; i++) {
xp=rand()/(float)RAND_MAX;
yp=rand()/(float)RAND_MAX;
if(pow(xp,2.0)+pow(yp,2.0)<=1) arrow_area_circle++;
}
}
pi = 4*arrow_area_circle / iterarions;
cout << setprecision(18) << "PI = " << pi << endl << endl;
cout << setprecision(18) << "Erro = " << pitg-pi << endl << endl;
return 0;
}
A CPU intensive task like this will be slower if you do the work in more threads than there are CPU's in the system. If you are running it on a single CPU system, you will definitely see a slowdown with more than one thread. This is due to the OS having to switch between the various threads - this is pure overhead. You should ideally have the same number of threads as cores for a task like this.
Another issue is that arrow_area_circle is shared between threads. If you have a thread running on each core, incrementing arrow_area_circle will invalidate the copy in the caches of the other cores, causing them to have to refetch. arrow_area_circle++ which should take a couple cycles will take dozens or hundreds of cycles. Try creating an arrow_area_circle per thread and combining them at the end.
EDIT: Joe Duffy just posted a blog entry on the cost of sharing data between threads.
It looks like you are using some kind of auto-parallelizing compiler. I am going to assume you have more than 1 core/CPU in your system (as that would be too obvious -- and no hyperthreading on a Pentium 4 doesn't count as having two cores, regardless of what Intel's marketing would have you believe.) There are two problems that I see. The first is trivial and probably not your problem:
If the variable arrow_area_circle is shared between your processes, then the act of executing arrow_area_circle++ will cause an interlocking instruction to be used to synchronize the value in a way that is atomically sound. You should increment a "private" variable, then add that value just once at the end to the common arrow_area_circle variable instead of incrementing arrow_area_circle in your inner loop.
The rand() function, to operate soundly, must internally execute with a critical section. The reason is that its internal state/seed is a static shared variable; if it were not, it would be possible for two different processes to get the same output from rand() with unusually high probability, just because they were calling rand() at nearly the same time. That means rand() runs slowly, and especially so as more threads/processes are calling it at the same time. Unlike the arrow_area_circle variable (which just needs an atomic increment), a true critical section has to be invoked by rand() because its state update is more complicated. To work around this, you should obtain the source code for your own random number generator and use it with a private seed or state. The source code for the standard rand() implementation in most compilers is widely available.
I'd also like to point out that you are using the pow(,) function to perform the same thing as x * x. The later is about 300 times faster than the former. Though this point is irrelevant to the question you are asking. :)
Context switching.
rand() is blocking function. It means that it has critical section inside.
Just to stress that you have to be really careful using random numbers in a parallel setting. In fact you should use something like SPRNG
Whatever you do, make sure that each thread isn't using the same random numbers.

Resources