I am using the libloading crate to load a dynamic library that I need to use in multiple threads.
let lib = Library::new("lib.dylib").unwrap();
Do I load the library on each thread, or is there a way to inject/share the library into a thread when the thread is started?
The shared libraries are loaded into a process. It shouldn't matter what thread you've used to load the library, the code should equally be available to all the threads once the library has been loaded. This is because the POSIX threads share the same memory space, namely the memory space of the process.
When a CPU jumps to a memory address belonging to a loaded library, the memory is already there, mapped by the operating system into the process memory space. Threads has no influence whatsoever on the availability of the said memory address.
Keep in mind though, that according to the library documentation you might want to avoid calling Library::new from multiple threads simultaneously.
Related
As I got to know that apart from data segment and code segment threads also share heap segment
What resources are shared between threads??
then if I create a variable dynamically using malloc() or calloc() inside the thread then does that variable would be accessible to all the other threads of the same process?
Theoretically, if you know the memory address. Yes, heap allocated variables should be accessible from any thread within the same process.
{malloc, calloc, realloc, free, posix_memalign} of glibc-2.2+ are thread safe
http://linux.derkeiler.com/Newsgroups/comp.os.linux.development.apps/2005-07/0323.html
Original post
Generally, malloc/new/free/delete on multi threaded systems are thread safe, so this should be no problem - and allocating in one thread , deallocating in another is a quite common thing to do.
As threads are an implementation feature, it certainly is implementation dependant though - e.g. some systems require you to link with a multi threaded runtime library.
And this
Besides also answered in: the link you posted
Threads differ from traditional multitasking operating system
processes in that:
processes are typically independent, while threads exist as subsets of
a process processes carry considerable state information, whereas
multiple threads within a process share state as well as memory and
other resources processes have separate address spaces, whereas
threads share their address space processes interact only through
system-provided inter-process communication mechanisms. Context
switching between threads in the same process is typically faster than
context switching between processes.
So, yes it is.
1) I tried searching how memory would be allocated when we use threads in program but couldn't find the answer. Here What and where are the stack and heap? is how stack and heap works when a single program is called. But what happens when it comes to program with threads?
2)Using OpenMP parallel region creates threads and parallel code would be executed concurrently in each thread. Does this allocate more space in the memory than the memory occupied by same code with sequential execution?
In general, yes, [user-space] stacks are one per thread, whereas the heap is usually shared by all threads. See for example this Linux question. However, on some operating systems (OS), on Windows in particular, even a single threaded app may use more than one heap. Using OpenMP for threading doesn't change these basics, which are mostly dependant on the operating system. So unless you narrow your question to a specific OS, more can't be said at this level of generality.
Since I'm too lazy to draw this myself, here's the comparative illustration from PThreads Programming by Nichols et al. (1996)
A somewhat more detailed (and alas potentially a bit more confusing) diagram is found in the free LLNL POSIX Threads Programming tutorial by B. Barney.
And yes, as you correctly suspected, running more threads does consume more stack memory. You can actually exhaust the virtual address space of a process just with thread stacks if you make enough of them. Various implementations of OpenMP have a STACKSIZE environment variable (or thereabout) that controls how much stack OpenMP allocates for a thread.
Regarding Z boson's question/suggestion about Thread Local Storage (TLS): roughly (i.e. conceptually) speaking, Thread Local Storage is a per-thread heap. There are differences from the per-process heap in the API used to manipulate it, at the very least because each thread needs its own separate pointer to its own TLS, but basically you have a heap-like chunk of the process address space that's reserved to each thread. TLS is optional, you don't have to use it. OpenMP provides its own abstraction/directive for TLS-like persistent per-thread data, called THREADPRIVATE. It's not necessary that the OpenMP THREADPRIVATE uses the operating system's TLS support, however there's a Linux-focused paper which says that such an implementation gave the best performance, at least in that environment.
And here is a subtlety (or why I said "roughly speaking" when I compared TLS to per-thread heaps): assume you want a per-thread heap, say, in order to reduce locking contention to the main heap. You don't actually have to store an entire per-thread heap in each thread's TLS. It suffices to store in each thread's TLS a different head pointer to heaps allocated in the shared per-process space. Identifying and automatically using per-thread heaps in a program (in order to reduce locking contention on the main heap) is a farily difficult CS problem. Heap allocators which do this automatically are called scalable/parallel[izing] heap allocators or thereabout. For example, Intel TBB provides one such allocator, and it can be used in your program even if you use nothing else from TBB. Although some people seem to believe Intel's TBB allocator contains black magic, it's in fact not really different from the aforementioned basic idea of using TLS to point to some thread-local heap, which in turn is made of several doubly-linked lists segregated by block/object-size, as the following diagrams from the Intel paper on TBB illustrate:
IBM has something rather similar for AIX 7.1, but a bit more complex. You can tell its (default) allocator to use a fixed number of heaps for multi-threaded applications, e.g. MALLOCOPTIONS=multiheap:3. AIX 7.1 also has another option (which can be combined the multiheap) MALLOCOPTIONS=threadcache, which appears somewhat similar to what Intel TBB does, in that it keeps a per-thread cache of deallocated regions, from which future allocation requests can be serviced with less global heap contention. Besides those options for the default allocator, AIX 7.1 also has a (non-default) "Watson2" allocator which "uses a thread-specific mechanism that uses a varying number of heap structures, which depend on the behavior of the program. Therefore no configuration options are required." (But you do need to select this allocator explicitly with MALLOCTYPE=Watson2.) Watson2's operation sounds even closer to what the Intel TBB allocator does.
The aforementioned two examples (Intel TBB and AIX) detailed above just meant as concrete examples, but shouldn't be understood as holding some exclusive sauce. The idea of per-thread or per-CPU heap cache/arena/magazine is fairly widespread. The BSDcan jemalloc paper cites a 1998 MS Research paper as the first to have systematically evaluated arenas for this purpose. The aforementioned MS paper does cite the ptmalloc web page as "visited on May 11, 1998" and summarizes ptmalloc's working as follows: "It uses a linked list of subheaps where each subheap has a lock, 128 free lists, and some memory to manage. When a thread needs to allocate a block, it scans the list of subheaps and grabs the first unlocked one, allocates the required block, and returns. If it can't find an unlocked subheap, it creates a new one and adds it to the list. In this way, a thread never waits on a locked subheap."
mprotect() is used to protect memory pages, for example, making pages read-only. It sets this protection for the whole process, that is, if a page is read-only, no thread can write to that page. Is there a way to protect pages in different ways for different threads? For example, 1 thread can write to page P, and all other threads in my program can only read from P.
If you create a thread using CLONE_VM flag in the "clone" system call (this is what you would normally call a thread) then the MMU settings are the same as for the parent thread.
This means that write access is possible for both threads.
If you do not use CLONE_VM flags then the both threads do not have shared memory at all!
(pthread_create() sets the CLONE_VM flag internally).
It would be possible to do what you want - however it would be very difficult:
Allocate all memory blocks using shared memory functions (e.g. shmget()) instead of standard functions (e.g. malloc()).
If a new thread is created use "clone()" directly instead of "pthread_create()" with the CLONE_VM flag not set.
The shared memory is shared between the threads and the threads created by "normal" memory allocation functions (e.g. malloc()) is not shared between the threads. The same is true for mmap() mapped memory.
When a new thread is created such memory blocks (created by malloc or mmap) are copied so that both threads have their own copy of this memory block at the same address. If one thread writes to this address then the other thread will not see the change.
Allocating more "shared" memory is rather tricky. It is easy if the memory only should be shared between the allocating thread and child threads that are not created, yet. it is difficult to share memory between already-running threads or between threads that are (indirect) children of different already-running threads.
The threads do not have shared stack memory so they cannot access each other's stack.
Global and "static" variables are not shared by default - to make them "shared" between the threads some tricky programming is required.
With newer Intel CPUs you can use Memory Protection Keys [1] for different access settings per thread within a process. On Linux, run lscpu and check for the pku and ospke flags.
The example on the man page [2] is a bit outdated in the sense that the corresponding system calls do not need to be invoked manually anymore. Instead, glibc provides the following API calls:
pkey_alloc() to allocate a new key (16 are available)
pkey_set() to set the permission for a given key
pkey_mprotect() to apply the key to a given memory region
pkey_free() to free a key
Since the register that maintains the permission bits per protection key is thread-local, different protection settings are possible per thread. Protection key settings can only further lock down the general settings and do not affect instruction fetch.
[1] https://www.kernel.org/doc/Documentation/x86/protection-keys.txt
[2] http://man7.org/linux/man-pages/man7/pkeys.7.html
Consider a following application: a web search server that upon start creates a large in-memory index of web pages based on data read from disk. Once initialized, in-memory index can not be modified and multiple threads are started to serve user queries. Assume the server is compiled to native code and uses OS threads.
Now, threading model gives no isolation between threads. A buggy thread or any non thread safe code, can corrupt the index or corrupt memory that was allocated by and logically belongs to some other thread. Such problems are difficult to detect and debug.
Theoretically, Linux allows to enforce a better isolation. Once index is initialized, memory it occupies can be marked read only. Threads can be replaced with processes that share the index (shared memory) but other than that have separate heaps and can not corrupt each other. Illegal operation are automatically detected by hardware and the operating system. No mutexes or other synchronization primitives are needed. Memory related data races are completely eliminated.
Is such model feasible in practice? Are you aware of any real life application that do such things? Or maybe there are some fundamental difficulties that make such model impractical? Do you think such approach would introduce a performance overhead compared to traditional threads? Theoretically, memory that is used is the same, but are there some implementation-related issues that would make things slower?
The obvious solution is to not use threads at all. Use separate processes. Since each process has much in common with code and readonly structures, making the readonly data shared is trivial: format it as needed for in-memory use within a file and map the file to memory.
Using this scheme, only the variable per-process data would be independent. The code would be shared and statically initialized data would be shared until written. If a process croaks, there is zero impact on other processes. No concurrency issues at all.
You can use mprotect() to make your index read-only. On a 64-bit system you can map the local memory for each thread at a random address (see this Wikipedia article on address space randomization) which makes the odds of memory corruption from one thread touching another astronomically small (and of course any corruption that misses mapped memory altogether will cause a segfault). Obviously you'll need to have different heaps for each thread.
I think you might find memcached interesting. Also, you can create a shared memory and open it as read-only and then create your threads. This should not cause much performance degradation.
After thinking about the the whole concept of shared memory , a question came up:
can two processes share the same shared memory segment ? can two threads share the same shared memory ?
After thinking about it a little more clearly , I'm almost positive that two processes can share the same shared memory segment , where the first is the father and the second is the son , that was created with a fork() , but what about two threads ?
Thanks
can two processes share the same shared memory segment?
Yes and no. Typically with modern operating systems, when another process is forked from the first, they share the same memory space with a copy-on-write set on all pages. Any updates made to any of the read-write memory pages causes a copy to be made for the page so there will be two copies and the memory page will no longer be shared between the parent and child process. This means that only read-only pages or pages that have not been written to will be shared.
If a process has not been forked from another then they typically do not share any memory. One exception is if you are running two instances of the same program then they may share code and maybe even static data segments but no other pages will be shared. Another is how some operating systems allow applications to share the code pages for dynamic libraries that are loaded by multiple applications.
There are also specific memory-map calls to share the same memory segment. The call designates whether the map is read-only or read-write. How to do this is very OS dependent.
can two threads share the same shared memory?
Certainly. Typically all of the memory inside of a multi-threaded process is "shared" by all of the threads except for some relatively small stack spaces which are per-thread. That is usually the definition of threads in that they all are running within the same memory space.
Threads also have the added complexity of having cached memory segments in high speed memory tied to the processor/core. This cached memory is not shared and updates to memory pages are flushed into central storage depending on synchronization operations.
In general, a major point of processes is to prevent memory being shared! Inter-process comms via a shared memory segment is certainly possible on the most common OS, but the mechanisms are not there by default. Failing to set up, and manage, the shared area correctly will likely result in a segFault/AV if you're lucky and UB if not.
Threads belonging to the same process, however, do not have such hardware memory-management protection can pretty much share whatever they like, the obvious downside being that they can corrupt pretty much whatever they like. I've never actually found this to be a huge problem, esp. with modern OO languages that tend to 'structure' pointers as object instances, (Java, C#, Delphi).
Yes, two processes can both attach to a shared memory segment. A shared memory segment wouldn't be much use if that were not true, as that is the basic idea behind a shared memory segment - that's why it's one of several forms of IPC (inter-Process communication).
Two threads in the same process could also both attach to a shared memory segment, but given that they already share the entire address space of the process they are part of, there likely isn't much point (although someone will probably see that as a challenge to come up with a more-or-less valid use case for doing so).
In general terms, each process occupies a memory space isolated from all others in order to avoid unwanted interactions (including those which would represent security issues). However, there is usually a means for processes to share portions of memory. Sometimes this is done to reduce RAM footprint ("installed files" in VAX/VMS is/was one such example). It can also be a very efficient way for co-operating processes to communicate. How that sharing is implemented/structured/managed (e.g. parent/child) depends on the features provided by the specific operating system and design choices implemented in the application code.
Within a process, each thread has access to exactly the same memory space as all other threads of the same process. The only thing a thread has unique to itself is "execution context", part of which is its stack (although nothing prevents one thread from accessing or manipulating the stack "belonging to" another thread of the same process).