Performance loss in C++0x and static local variables?

Performance loss in C++0x and static local variables? - multithreading

What is a realistic performance loss due to the fact that in C++0x all other threads shall
wait in a case like this:
string& program_name() {
static string instance = "Parallel Pi";
return instance;
}
Lets assume the optimal scenario: The programmer was very careful that even with 100 threads only the main thread calls the function program_name, all the other 99 worker threads are busy doing useful stuff, which does not involve calling this "critical" function.
I quote from the new C++0x-Std § 6.7.(4) stmt.decl
...such an object is initialized the first time control passes through its declaration... If control enters the declaration concurrently while the object is being initialized, the concurrent execution shall wait for completion of the initialization...
What is a realistic overhead that a real-world compiler is needed to impose on me to ensure that that static initialization is done as required by the standard.
Is a lock/mutex required? I assume they are expensive, even when not really needed?
If they are expensive, will this be done by less expensive mechanisms?
edit: added string...

If control enters the declaration concurrently while the object is being initialized, the
concurrent execution shall wait for completion of the initialization...
I think this is reasonable and very normal thing to do in concurrent programming. Anyway, this statement doesn't say that all other threads must wait for this initialization. They have to wait in case they need to access the initializing object.
Is a lock/mutex required? I assume they are expensive, even when not really needed?
Could be. Mutex / lock aren't that expensive actually, they're expensive only when the locked code fragment needs to be accessed frequently by many or even all threads.
If they are expensive, will this be done by less expensive mechanisms?
There are also another non-lock based solutions AFAIK.

If you were really concerned about the price of the lock, you could simply call the function before you started your worker threads, which would initialise the static. If you call it after the threads start, either you or the compiler has got to arrange for locking of some sort, so there is no real extra overhead.

Related

Is synchronization for variable change cheaper then for something else?

In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?

Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.

Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.

isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.

threading synchronization at high speed

I have a threading question and what I'd qualify as a modest threading background.
Suppose I have the following (oversimplified) design and behavior:
Object ObjectA - has a reference to object ObjectB and a method MethodA().
Object ObjectB - has a reference to ObjectA, an array of elements ArrayB and a method MethodB().
ObjectA is responsible for instantiating ObjectB. ObjectB.ObjectA will point to ObjectB's instantiator.
Now, whenever some conditions are met, a new element is added in ObjectB.ArrayB and a new thread is started for this element, say ThreadB_x, where x goes from 1 to ObjectB.ArrayB.Length. Each such thread calls ObjectB.MethodB() to pass some data in, which in turn calls ObjectB.ObjectA.MethodA() for data processing.
So multiple threads call the same method ObjectB.MethodB(), and it's very likely that they do so at the very same time. There's a lot of code in MethodB that creates and initializes new objects, so I don't think there are problems there. But then this method calls ObjectB.ObjectA.MethodA(), and I don't have the slightest idea of what's going on in there. Based on the results I get, nothing wrong, apparently, but I'd like to be sure of that.
For now, I enclosed the call to ObjectB.ObjectA.MethodA() in a lock statement inside ObjectB.MethodB(), so I'm thinking this will ensure there are no clashes to the call of MethodA(), though I'm not 100% sure of that. But what happens if each ThreadB_x calls ObjectB.MethodB() a lot of times and very very fast? Will I have a queue of calls waiting for ObjectB.ObjectA.MethodA() to finish?
Thanks.

Your question is very difficult to answer because of the lack of information. It depends on the average time spent in methodA, how many times this method is called per thread, how many cores are allocated to the process, the OS scheduling policy, to name a few parameters.
All things being equals, when the number of threads grows toward infinity, you can easily imagine that the probability for two threads requesting access to a shared resource simultaneously will tend to one. This probability will grow faster in proportion to the amount of time spent on the shared resource. That intuition is probably the reason of your question.
The main idea of multithreading is to parallelize code which can be effectively computed concurrently, and avoid contention as much as possible. In your setup, if methodA is not pure, ie. if it may change the state of the process - or in C++ parlance, if it cannot be made const, then it is a source of contention (recall that a function can only be pure if it uses pure functions or constants in its body).
One way of dealing with a shared resource is to protect it with a mutex, as you've done in your code. Another way is to try to turn its use into an async service, with one thread handling it, and others requesting that thread for computation. In effect, you will end up with an explicit queue of requests, but threads doing these requests will be free to work on something else in the mean time. The goal is always to maximize computation time, as opposed to thread management time, which happens each time a thread gets rescheduled.
Of course, it is not always possible to do so, eg. when the result of methodA belongs to a strongly ordered chain of computation.

Are "benaphores" worth implementing on modern OS's?

Back in my days as a BeOS programmer, I read this article by Benoit Schillings, describing how to create a "benaphore": a method of using atomic variable to enforce a critical section that avoids the need acquire/release a mutex in the common (no-contention) case.
I thought that was rather clever, and it seems like you could do the same trick on any platform that supports atomic-increment/decrement.
On the other hand, this looks like something that could just as easily be included in the standard mutex implementation itself... in which case implementing this logic in my program would be redundant and wouldn't provide any benefit.
Does anyone know if modern locking APIs (e.g. pthread_mutex_lock()/pthread_mutex_unlock()) use this trick internally? And if not, why not?

What your article describes is in common use today. Most often it's called "Critical Section", and it consists of an interlocked variable, a bunch of flags and an internal synchronization object (Mutex, if I remember correctly). Generally, in the scenarios with little contention, the Critical Section executes entirely in user mode, without involving the kernel synchronization object. This guarantees fast execution. When the contention is high, the kernel object is used for waiting, which releases the time slice conductive for faster turnaround.
Generally, there is very little sense in implementing synchronization primitives in this day and age. Operating systems come with a big variety of such objects, and they are optimized and tested in significantly wider range of scenarios than a single programmer can imagine. It literally takes years to invent, implement and test a good synchronization mechanism. That's not to say that there is no value in trying :)

Java's AbstractQueuedSynchronizer (and its sibling AbstractQueuedLongSynchronizer) works similarly, or at least it could be implemented similarly. These types form the basis for several concurrency primitives in the Java library, such as ReentrantLock and FutureTask.
It works by way of using an atomic integer to represent state. A lock may define the value 0 as unlocked, and 1 as locked. Any thread wishing to acquire the lock attempts to change the lock state from 0 to 1 via an atomic compare-and-set operation; if the attempt fails, the current state is not 0, which means that the lock is owned by some other thread.
AbstractQueuedSynchronizer also facilitates waiting on locks and notification of conditions by maintaining CLH queues, which are lock-free linked lists representing the line of threads waiting either to acquire the lock or to receive notification via a condition. Such notification moves one or all of the threads waiting on the condition to the head of the queue of those waiting to acquire the related lock.
Most of this machinery can be implemented in terms of an atomic integer representing the state as well as a couple of atomic pointers for each waiting queue. The actual scheduling of which threads will contend to inspect and change the state variable (via, say, AbstractQueuedSynchronizer#tryAcquire(int)) is outside the scope of such a library and falls to the host system's scheduler.

What are multi-threading DOs and DONTs? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am applying my new found knowledge of threading everywhere and getting lots of surprises
Example:
I used threads to add numbers in an
array. And outcome was different every
time. The problem was that all of my
threads were updating the same
variable and were not synchronized.
What are some known thread issues?
What care should be taken while using
threads?
What are good multithreading resources.
Please provide examples.
sidenote:(I renamed my program thread_add.java to thread_random_number_generator.java:-)

In a multithreading environment you have to take care of synchronization so two threads doesn't clobber the state by simultaneously performing modifications. Otherwise you can have race conditions in your code (for an example see the infamous Therac-25 accident.) You also have to schedule the threads to perform various tasks. You then have to make sure that your synchronization and scheduling doesn't cause a deadlock where multiple threads will wait for each other indefinitely.
Synchronization
Something as simple as increasing a counter requires synchronization:
counter += 1;
Assume this sequence of events:
counter is initialized to 0
thread A retrieves counter from memory to cpu (0)
context switch
thread B retrieves counter from memory to cpu (0)
thread B increases counter on cpu
thread B writes back counter from cpu to memory (1)
context switch
thread A increases counter on cpu
thread A writes back counter from cpu to memory (1)
At this point the counter is 1, but both threads did try to increase it. Access to the counter has to be synchronized by some kind of locking mechanism:
lock (myLock) {
counter += 1;
}
Only one thread is allowed to execute the code inside the locked block. Two threads executing this code might result in this sequence of events:
counter is initialized to 0
thread A acquires myLock
context switch
thread B tries to acquire myLock but has to wait
context switch
thread A retrieves counter from memory to cpu (0)
thread A increases counter on cpu
thread A writes back counter from cpu to memory (1)
thread A releases myLock
context switch
thread B acquires myLock
thread B retrieves counter from memory to cpu (1)
thread B increases counter on cpu
thread B writes back counter from cpu to memory (2)
thread B releases myLock
At this point counter is 2.
Scheduling
Scheduling is another form of synchronization and you have to you use thread synchronization mechanisms like events, semaphores, message passing etc. to start and stop threads. Here is a simplified example in C#:
AutoResetEvent taskEvent = new AutoResetEvent(false);
Task task;
// Called by the main thread.
public void StartTask(Task task) {
this.task = task;
// Signal the worker thread to perform the task.
this.taskEvent.Set();
// Return and let the task execute on another thread.
}
// Called by the worker thread.
void ThreadProc() {
while (true) {
// Wait for the event to become signaled.
this.taskEvent.WaitOne();
// Perform the task.
}
}
You will notice that access to this.task probably isn't synchronized correctly, that the worker thread isn't able to return results back to the main thread, and that there is no way to signal the worker thread to terminate. All this can be corrected in a more elaborate example.
Deadlock
A common example of deadlock is when you have two locks and you are not careful how you acquire them. At one point you acquire lock1 before lock2:
public void f() {
lock (lock1) {
lock (lock2) {
// Do something
}
}
}
At another point you acquire lock2 before lock1:
public void g() {
lock (lock2) {
lock (lock1) {
// Do something else
}
}
}
Let's see how this might deadlock:
thread A calls f
thread A acquires lock1
context switch
thread B calls g
thread B acquires lock2
thread B tries to acquire lock1 but has to wait
context switch
thread A tries to acquire lock2 but has to wait
context switch
At this point thread A and B are waiting for each other and are deadlocked.

There are two kinds of people that do not use multi threading.
1) Those that do not understand the concept and have no clue how to program it.
2) Those that completely understand the concept and know how difficult it is to get it right.

I'd make a very blatant statement:
DON'T use shared memory.
DO use message passing.
As a general advice, try to limit the amount of shared state and prefer more event-driven architectures.

I can't give you examples besides pointing you at Google. Search for threading basics, thread synchronisation and you'll get more hits than you know.
The basic problem with threading is that threads don't know about each other - so they will happily tread on each others toes, like 2 people trying to get through 1 door, sometimes they will pass though one after the other, but sometimes they will both try to get through at the same time and will get stuck. This is difficult to reproduce, difficult to debug, and sometimes causes problems. If you have threads and see "random" failures, this is probably the problem.
So care needs to be taken with shared resources. If you and your friend want a coffee, but there's only 1 spoon you cannot both use it at the same time, one of you will have to wait for the other. The technique used to 'synchronise' this access to the shared spoon is locking. You make sure you get a lock on the shared resource before you use it, and let go of it afterwards. If someone else has the lock, you wait until they release it.
Next problem comes with those locks, sometimes you can have a program that is complex, so much that you get a lock, do something else then access another resource and try to get a lock for that - but some other thread has that 2nd resource, so you sit and wait... but if that 2nd thread is waiting for the lock you hold for the 1st resource.. it's going to sit and wait. And your app just sits there. This is called deadlock, 2 threads both waiting for each other.
Those 2 are the vast majority of thread issues. The answer is generally to lock for as short a time as possible, and only hold 1 lock at a time.

I notice you are writing in java and that nobody else mentioned books so Java Concurrency In Practice should be your multi-threaded bible.

-- What are some known thread issues? --
Race conditions.
Deadlocks.
Livelocks.
Thread starvation.
-- What care should be taken while using threads? --
Using multi-threading on a single-processor machine to process multiple tasks where each task takes approximately the same time isn’t always very effective.For example, you might decide to spawn ten threads within your program in order to process ten separate tasks. If each task takes approximately 1 minute to process, and you use ten threads to do this processing, you won’t have access to any of the task results for the whole 10 minutes. If instead you processed the same tasks using just a single thread, you would see the first result in 1 minute, the next result 1 minute later, and so on. If you can make use of each result without having to rely on all of the results being ready simultaneously, the single
thread might be the better way of implementing the program.
If you launch a large number of threads within a process, the overhead of thread housekeeping and context switching can become significant. The processor will spend considerable time in switching between threads, and many of the threads won’t be able to make progress. In addition, a single process with a large number of threads means that threads in other processes will be scheduled less frequently and won’t receive a reasonable share of processor time.
If multiple threads have to share many of the same resources, you’re unlikely to see performance benefits from multi-threading your application. Many developers see multi-threading as some sort of magic wand that gives automatic performance benefits. Unfortunately multi-threading isn’t the magic wand that it’s sometimes perceived to be. If you’re using multi-threading for performance reasons, you should measure your application’s performance very closely in several different situations, rather than just relying on some non-existent magic.
Coordinating thread access to common data can be a big performance killer. Achieving good performance with multiple threads isn’t easy when using a coarse locking plan, because this leads to low concurrency and threads waiting for access. Alternatively, a fine-grained locking strategy increases the complexity and can also slow down performance unless you perform some sophisticated tuning.
Using multiple threads to exploit a machine with multiple processors sounds like a good idea in theory, but in practice you need to be careful. To gain any significant performance benefits, you might need to get to grips with thread balancing.
-- Please provide examples. --
For example, imagine an application that receives incoming price information from
the network, aggregates and sorts that information, and then displays the results
on the screen for the end user.
With a dual-core machine, it makes sense to split the task into, say, three threads. The first thread deals with storing the incoming price information, the second thread processes the prices, and the final thread handles the display of the results.
After implementing this solution, suppose you find that the price processing is by far the longest stage, so you decide to rewrite that thread’s code to improve its performance by a factor of three. Unfortunately, this performance benefit in a single thread may not be reflected across your whole application. This is because the other two threads may not be able to keep pace with the improved thread. If the user interface thread is unable to keep up with the faster flow of processed information, the other threads now have to wait around for the new bottleneck in the system.
And yes, this example comes directly from my own experience :-)

DONT use global variables
DONT use many locks (at best none at all - though practically impossible)
DONT try to be a hero, implementing sophisticated difficult MT protocols
DO use simple paradigms. I.e share the processing of an array to n slices of the same size - where n should be equal to the number of processors
DO test your code on different machines (using one, two, many processors)
DO use atomic operations (such as InterlockedIncrement() and the like)

YAGNI
The most important thing to remember is: do you really need multithreading?

I agree with pretty much all the answers so far.
A good coding strategy is to minimise or eliminate the amount of data that is shared between threads as much as humanly possible. You can do this by:
Using thread-static variables (although don't go overboard on this, it will eat more memory per thread, depending on your O/S).
Packaging up all state used by each thread into a class, then guaranteeing that each thread gets exactly one state class instance to itself. Think of this as "roll your own thread-static", but with more control over the process.
Marshalling data by value between threads instead of sharing the same data. Either make your data transfer classes immutable, or guarantee that all cross-thread calls are synchronous, or both.
Try not to have multiple threads competing for the exact same I/O "resource", whether it's a disk file, a database table, a web service call, or whatever. This will cause contention as multiple threads fight over the same resource.
Here's an extremely contrived OTT example. In a real app you would cap the number of threads to reduce scheduling overhead:
All UI - one thread.
Background calcs - one thread.
Logging errors to a disk file - one thread.
Calling a web service - one thread per unique physical host.
Querying the database - one thread per independent group of tables that need updating.
Rather than guessing how to do divvy up the tasks, profile your app and isolate those bits that are (a) very slow, and (b) could be done asynchronously. Those are good candidates for a separate thread.
And here's what you should avoid:
Calcs, database hits, service calls, etc - all in one thread, but spun up multiple times "to improve performance".

Don't start new threads unless you really need to. Starting threads is not cheap and for short running tasks starting the thread may actually take more time than executing the task itself. If you're on .NET take a look at the built in thread pool, which is useful in a lot of (but not all) cases. By reusing the threads the cost of starting threads is reduced.
EDIT: A few notes on creating threads vs. using thread pool (.NET specific)
Generally try to use the thread pool. Exceptions:
Long running CPU bound tasks and blocking tasks are not ideal run on the thread pool cause they will force the pool to create additional threads.
All thread pool threads are background threads, so if you need your thread to be foreground, you have to start it yourself.
If you need a thread with different priority.
If your thread needs more (or less) than the standard 1 MB stack space.
If you need to be able to control the life time of the thread.
If you need different behavior for creating threads than that offered by the thread pool (e.g. the pool will throttle creating of new threads, which may or may not be what you want).
There are probably more exceptions and I am not claiming that this is the definitive answer. It is just what I could think of atm.

I am applying my new found knowledge of threading everywhere
[Emphasis added]
DO remember that a little knowledge is dangerous. Knowing the threading API of your platform is the easy bit. Knowing why and when you need to use synchronisation is the hard part. Reading up on "deadlocks", "race-conditions", "priority inversion" will start you in understanding why.
The details of when to use synchronisation are both simple (shared data needs synchronisation) and complex (atomic data types used in the right way don't need synchronisation, which data is really shared): a lifetime of learning and very solution specific.

An important thing to take care of (with multiple cores and CPUs) is cache coherency.

I am surprised that no one has pointed out Herb Sutter's Effective Concurrency columns yet. In my opinion, this is a must read if you want to go anywhere near threads.

a) Always make only 1 thread responsible for a resource's lifetime. That way thread A won't delete a resource thread B needs - if B has ownership of the resource
b) Expect the unexpected

DO think about how you will test your code and set aside plenty of time for this. Unit tests become more complicated. You may not be able to manually test your code - at least not reliably.
DO think about thread lifetime and how threads will exit. Don't kill threads. Provide a mechanism so that they exit gracefully.
DO add some kind of debug logging to your code - so that you can see that your threads are behaving correctly both in development and in production when things break down.
DO use a good library for handling threading rather than rolling your own solution (if you can). E.g. java.util.concurrency
DON'T assume a shared resource is thread safe.
DON'T DO IT. E.g. use an application container that can take care of threading issues for you. Use messaging.

In .Net one thing that surprised me when I started trying to get into multi-threading is that you cannot straightforwardly update the UI controls from any thread other than the thread that the UI controls were created on.
There is a way around this, which is to use the Control.Invoke method to update the control on the other thread, but it is not 100% obvious the first time around!

Don't be fooled into thinking you understand the difficulties of concurrency until you've split your head into a real project.
All the examples of deadlocks, livelocks, synchronization, etc, seem simple, and they are. But they will mislead you, because the "difficulty" in implementing concurrency that everyone is talking about is when it is used in a real project, where you don't control everything.

While your initial differences in sums of numbers are, as several respondents have pointed out, likely to be the result of lack of synchronisation, if you get deeper into the topic, be aware that, in general, you will not be able to reproduce exactly the numeric results you get on a serial program with those from a parallel version of the same program. Floating-point arithmetic is not strictly commutative, associative, or distributive; heck, it's not even closed.
And I'd beg to differ with what, I think, is the majority opinion here. If you are writing multi-threaded programs for a desktop with one or more multi-core CPUs, then you are working on a shared-memory computer and should tackle shared-memory programming. Java has all the features to do this.
Without knowing a lot more about the type of problem you are tackling, I'd hesitate to write that 'you should do this' or 'you should not do that'.

Should access to a shared resource be locked by a parent thread before spawning a child thread that accesses it?

If I have the following psuedocode:
sharedVariable = somevalue;
CreateThread(threadWhichUsesSharedVariable);
Is it theoretically possible for a multicore CPU to execute code in threadWhichUsesSharedVariable() which reads the value of sharedVariable before the parent thread writes to it? For full theoretical avoidance of even the remote possibility of a race condition, should the code look like this instead:
sharedVariableMutex.lock();
sharedVariable = somevalue;
sharedVariableMutex.unlock();
CreateThread(threadWhichUsesSharedVariable);
Basically I want to know if the spawning of a thread explicitly linearizes the CPU at that point, and is guaranteed to do so.
I know that the overhead of thread creation probably takes enough time that this would never matter in practice, but the perfectionist in me is afraid of the theoretical race condition. In extreme conditions, where some threads or cores might be severely lagged and others are running fast and efficiently, I can imagine that it might be remotely possible for the order of execution (or memory access) to be reversed unless there was a lock.

I would say that your pseudocode is safe on any correctly functioning
multiprocessor system. The C++ compiler cannot generate a call to
CreateThread() before sharedVariable has received a correct value
unless it can prove to itself that doing so is safe. You are guaranteed
that your single-threaded code executes equivalently to a completely
non-reordered linear execution path. Any system that "time warps" the
thread creation ahead of the variable assignment is seriously broken.
I don't think declaring sharedVariable as volatile does anything
useful in this case.

Given your example and if you were using Java then the answer would be "No". In Java it is not possible for the thread to spawn and read your value before the assignment operation is complete. In some other languages this might be a different story.
"Variables shared between multiple threads (e.g., instance variables of objects) have atomic assignment guaranteed by the Java language specification for all data types except longs and doubles... If a method consists solely of a single variable access or assignment, there is no need to make it synchronized for thread-safety, and every reason not to do so for performance."
reference
If your double or long is declared volatile, then you are also guaranteed that the assignment is an atomic operation.
Update:
Your example is going to work in C++ just like it works in Java. Theoretically there is no way that the thread spawning will begin or complete before the assignment, even with Out of Order Execution.
Note that your example is VERY specific and in any other case it is recommended that you ensure the shared resource is protected properly. The new C++ standard is coming out with a lot of atomic stuff, so you could declare your variable as atomic and the assignment operation will be visible to all threads without the need of locking. CAS (compare and set) is a your next best option.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string