When I load a shared library dynamically, for example with dlopen on linux, do I have to worry about the visibility of the loaded library between processors, or will it be automatically fenced/ensured safe?
For example, say I have this function in the loaded library:
char const * get_string()
{ return "literal"; }
In the main program using such a string-literal pointer is safe between multiple threads as they are all guaranteed to see its initial value. However, I'm wondering how the rules of "initial values" really apply to a loaded library (as the standard doesn't deal much with it.
Say that I load the library, then immediately call the get_string function. I pass the pointer to another thread via a non-memory sequenced atomic (relaxed in C++11 parlance). Can the other thread use this pointer safely without having to issue any load fence or other syncronization instruction?
My assumption is that it is safe. Perhaps because the new library will be loaded into new pages the other core cannot have them loaded yet, and thus cannot have old visibility on them?
I would like some kind of authorative reference as part of the answer if possible. Or a technical description of how it is made thread-safe by default. Or of course a refutation if it isn't thread-safe on its own.
Your question is : will dlopen() load all my lib code properly before returning ? Yes it will. Otherwise you'd have the problem with only a single thread. It would be very difficult to handle if you had to sleep before dlopen completes asynchronously. It will also perform various checks and initialize what needs to be before you have a chance to get the function pointer you are looking for. That means that if you get that pointer, everything is here, you can use directly in any thread.
Now of course, you need to pass that pointer with the usual thread safety, but I assume you know how.
Please be aware that static initialization and modules don't play well together (see all the other questions on SO about that subject).
Your comment on cores is strange. Cores don't load memory. They prefetch it in their cache, but that's not a problem, just a bit slow.
I'll expand on what Basile said. I followed up with glibc and found out dlopen there does in deed use mmap. All guarantees of memory visibility are assumed from the mmap system call, dlopen itself doesn't make any additional guarantees.
Users of mmap generally assume that it will map memory correctly across all processors at the point of its return such that visibility is not a concern. This does not appear to be an explicit guarantee, but the OS would probably be unusable without such a guarantee. There is also no known system where this doesn't work as expected.
Related
I am programming in C on Linux x86-64. I'm using a library which creates a number of threads via a raw clone system call rather than using pthread_create. These threads run low-level code internal to the library.
I would like to hook one of these threads to introspect its behavior. Hooking the code is easy enough, but I've discovered that I can't call almost anything in libc because the thread state is not configured. pthread_create normally inserts a bunch of data into the thread-local storage area indexed by fs:. Some of that data, for example, is essential to libc's function, such as the function pointer encryption key (pointer_guard) and locale pointer.
So my question is: can I upgrade a clone'd thread to a full pthread via any mechanism? If not, is there any way that I can call C functions from a clone'd thread (such as printf, toupper, etc. which require libc's thread-local data)?
Some of that data, for example, is essential to libc's function, such as the function pointer encryption key (pointer_guard) and locale pointer.
Correct. Don't forget about errno, which is also in there.
can I upgrade a clone'd thread to a full pthread via any mechanism?
No.
is there any way that I can call C functions from a clone'd thread
No.
If you have sources to the library, it should be relatively easy to replace direct clone calls with pthread_create.
If you do not, but the library is available in archive form, you may be able to use obcopy --rename-symbol to redirect its clone calls to a replacement (e.g. my_clone), which can then create a new thread via pthread_create and invoke the target function in that thread. Whether this will succeed greatly depends on how much the library cares about details of the clone.
It's also probably not worth the trouble.
A better alternative may be to implement the introspection without calling into libc. Since your printf and toupper probably only need to deal with ASCII and C locale, it's not hard to implement limited versions of these functions and use direct system calls to write the output.
I want to find a reliable way (other than reading the kernel source code) to check if a given operation (or system call) is atomic (in the sense that other process can only see the state before or after that operation, but not something in between) on Linux. The goal of this is to avoid using unnecessary locks for some operations if the kernel already does that for me.
So far I can only find resources like this about this topic, which is by no means authoritative or exhaustive. Also, the Linux man pages contains little information about this. For example, for most functions mentioned in the above link, I don't find anything about their atomicity in the man pages.
Could anyone tell me if there is a standard or official documentation which provides this information? Any help would be much appreciated.
I think POSIX thread-safe functions are a good starting point. Thread-safe functions are functions that will give the same results when called from different threads. This is not at all the same as being atomic, but at least it gives a hint about which functions certainly are not atomic.
POSIX.1-2001 and POSIX.1-2008 require that all functions specified in the standard shall be thread-safe, except for a specific set of functions (most of which are implemented in the standard library and not in the kernel).
As an example of a function that is thread-safe but not atomic, consider fwrite(). fwrite() will write to a per-process buffer under pthread locks, so it is thread-safe. However, the buffer may be flushed in separate write() chunks, so other processes don't see it as an atomic write.
perlthrtut excerpt:
Note that a shared variable guarantees that if two or more threads try
to modify it at the same time, the internal state of the variable will
not become corrupted. However, there are no guarantees beyond this, as
explained in the next section.
Working on Linux supporting multiprocessor kernel threads.
Is there a guarantee that all threads will see the updated shared variable value ?
Consulting the perlthrtut doc as stated above there is no such guarantee.
Now the question: What can be done programmatically to guarantee that?
You ask
Is there a guarantee that all threads will see the updated shared variable value ?
Yes. :shared is that guarantee. The value will be safely and consistently and freshly updated.
The problem is simply that, without other synchronization, you don't know the order of these updates.
Consulting the perlthrtut doc as stated above there is no such guarantee.
You didn't read far enough. :)
The very next section in perlthrtut explains the kind of pitfalls you do face with perl threads: data races, which is to say, application logic races concerning shared data. Again, the shared data will be consistent and fresh and immune to corruption from (more-or-less) atomic perl opcodes. However, the high-level perl operations you perform on that shared data are not guaranteed to be atomic. $shared_var++, for instance, might be more than one atomic operation.
(If I may hazard a guess, you are perhaps thinking too much about other languages' lower level threading interfaces with their cache inconsistencies, torn words, reordered instructions, and lions and tigers and bears. Perl's model takes care of those low-level concerns for you.)
Using :shared on a variable causes all threads to reference it in the same physical memory address, so it doesn't matter which processor/core/hyper-thread they happen to be executing in. The perlthrtut talk of guarantees is in reference to race conditions, and in short, that you need to take into account that shared variables can be modified by any thread at any time. If this is a problem you'll need to make use of synchronization functions (e.g. lock() and cond_wait()) to control access.
You seem to be confused as to what :shared does. It makes it so a variable is shared by all threads.
A variable is indeed guaranteed to have the value it has, no matter which thread accesses it. It's a tautology, so nothing can be done to programmatically guarantee that.
It is a general question but:
In a multithreaded program, is it safe for the compiler to use registers to temporarily store global variables?
I think its not, since storing global variables in registers may change saved values
for other threads.
And how about using registers to store local variables defined within a function?
I think it is ok,since no other thread will be able to get these variables.
Please correct me if im wrong.
Thank you!
Things are much more complicated than you think they are.
Even if the compiler stores a value to memory, the CPU generally does not immediately push the data out to RAM. It stores it in a cache (and some systems have 2 or 3 levels of caches between the processor and the memory).
To make things worse, the order of instructions that the compiler decides, may not be what actually gets executed as many processors can reorder instructions (and even sub-parts of instructions) in their own pipelines.
In general, in a multithreaded environment you should personally take care to never access (either read or write) the same memory from two separate threads unless one of the following is true:
you are using one of several special atomic operations that ensure proper synchronization.
you have used one of several synchronization operations to "reserve" access to shared data and then to "relinquish" it. These do include the required memory barriers that also guarantee the data is what it's supposed to be.
You may want to read http://en.wikipedia.org/wiki/Memory_ordering#Memory_barrier_types and http://en.wikipedia.org/wiki/Memory_barrier
If you are ready for a little headache and want to see how complicated things can actually get, here is your evening lecture Memory Barriers: a Hardware View for Software Hackers.
'Safe' is not really the right word to use. Many higher level languages (eg. C) do not have a threading model and so the language specification says nothing about mutli-threaded interactions.
If you are not using any kind of locking primitives then you have no guarantees what so ever about how the different threads interact. So the compiler is within its rights to use registers for global variables.
Even if you are using locking the behaviour can still be tricky: if you read a variable, then grab a lock and then read the variable again the compiler still has no way of knowing if it has to read the variable from memory again, or can use the earlier value it stored in a register.
In C/C++ declaring a variable as volatile will force the compiler to always reload the variable from memory and solve this particular instance.
There are also 'Interlocked*' primitives on most systems that have guaranteed atomicity semantics which can be used to ensure certain operations are threadsafe. Locking primitives are typically built on these low level operations.
In a multithreaded program, you have one of two cases: if it's running on a uniprocessor (single core, single CPU), then switching between threads is handled like switching between processes (although it's not quite as much work since the threads operate in the same virtual memory space) - all registers of one thread are saved during the transition to another thread, so using registers for whatever purpose is fine. This is the job of the context switch routines that the OS uses, and the register set is considered part of a threads (or processes) context. If you have a multiprocessor system - either multiple CPUs or multiple cores on a single CPU - each processor has its own distinct set of registers, so again, using registers for storing things is fine. On top of that, of course, context switching on a particular CPU will save the registers of the old thread/process before switching to the new one, so everything is preserved.
That said, on some architectures and/or with some OSes, there might be specific exceptions to that, because certain registers are reserved by the ABI for specific uses by the OS or by the libraries that provide an interface to the OS, but your compiler(s) generally have that type of knowledge of your platform built in. You need to be aware of them, though, if you're doing inline assembly or certain other "low-level" things...
If you don't pass the CLONE_VM flag to clone(), then the new process shares memory with the original. Can this be used to make two distinct applications (two main()'s) run in the same process? Ideally this would be as simple as calling clone() with CLONE_VM and then calling exec(), but I realize it's probably more involved. At the very least, I assume that the spawned application would need to be compiled to be relocatable (-fPIC). I realize I could always recode applications to be libraries, and create a master app spawning the other 'apps' as threads, but I'm curious of this approach is possible.
Well, yes, that's what threads are, minus the "two distinct main()/application" part.
In fact, the reason clone(2) is there is to implement threads.
Clone(2) more-or-less requires you to declare a separate stack (if you don't it makes one), because without it the child won't be able to return from its current call level without destroying the parent's stack.
Once you start setting up stacks for each process then you might as well just use the posix thread library.
As for the part where two different applications are loaded, calling execve(2) would most likely not be the way to do it. These days the kernel doesn't precisely run programs anyway. It's more typical that the image is set up to run the Elf dynamic loader, and that's all that the kernel really runs. The loader then mmaps(2)s the process and its libraries into the address space. Certainly that could be done to get the "two distinct applications", and the thread scheduler would be happy to run them as two processes via clone(2).
Why not compile the applications into the same executable and just start them as threads in main?
What is the problem running them as separate tasks anyway? You can still share memory if you really want to.
Short answer: it's impossible.
Well, it's possible if you're willing to write your own custom ELF loader and simulate a lot of things that the kernel normally does for a process.
It's better to compile each of the apps into a library that exposes exactly one function, main (renamed to something else). Then the main stub program should link with the two libraries and call each one's exported function.