How to efficiently search in a database using a multithreaded scheme?

How to efficiently search in a database using a multithreaded scheme? - multithreading

I have a RAM based database in the form of a linked list of trees (Each node in the list points to a tree of strings).
A set of words is given as an input and every word in this set must be searched for in the RAM database.
I thought of implementing a multithreaded search feature. The current implemenation uses a 2 level threading scheme. The first level class of threads will be concurrently taking words out from the input set and then each thread of this level will spawn other worker threads that will be searching for the same word in the RAM DB.
The implementation works but suffers a lot from synchronization overhead (Besides the overhead of creating, terminating the threads and the load imbalance between them) so i want to improve the scheme for better performance.
Current implementation details: The first level threads creates (spawns) worker threads to search for the same word. Whenever, one of the worker threads finds the word in the DB, it must kill other threads and then return the result to the parent thread (First level thread). The parent thread will grab another word and repeats the process until there are no words to search for. The input set is protected with a lock and every group of worker threads (Threads searching for the same word) have a protected shared pointer in the RAM DB.
The question is : What other more efficient schemes could you suggest for such a situation ?

Presumably the fastest is to have the worker threads already "running", but blocked on a condition variable (or barrier). (If you know a priori how to load-balance the trees, the threads know which trees to search; otherwise, you need a work queue of some kind.) The thread which learns the (next) word to search for then stores it in a shared location and signals the variable (or joins the barrier). When one thread finds the word, it sets a flag that sends the other threads (who must unfortnately be written to check it periodically) back to waiting.

Related

tbb::task_group: Number of threads used by application doesnot go down after tbb:task_group object is destructed

I have created a tbb::task_group and added multiple task to it. In the end I wait() on the tasks to complete. I was profiling the code and saw that the number of threads used by my application have increased (as visible in Window's Task Manager). However when the tbb::task_group object is destructed, the thread count does not decrease.
Additionally if I call the same code block again (without restarting the application), the number of threads sometimes increases and sometimes not.
Is this an expected behavior? If yes, how can I make sure the threads created previously are reused?

Yes, this is expected behavior. It is done specifically to reuse threads between parallel algorithms. You can verify it by marking threads with thread-local variables (TBB provides combinable class) or looking into callbacks of task_scheduler_observer.
TBB always but lazily create the number of threads specified at the initialization time - even if you run only single task. By default the number of TBB worker threads equals to the number of HW threads (cores*HT) minus one for the application thread.
BTW, I'd not recommend you using tbb::task which is for advanced cases, check out tbb::parallel_invoke or tbb::task_group first which are high-level interfaces to tasks. Or even better, look whether your algorithm can be expressed on even more higher level using things like parallel_for, parallel_reduce (possibly with custom Range), parallel_pipeline, flow::graph, etc.

Relationship between Threads and Processes

I read about threads in some operating system books , and i confuse about the following:
A. Some books when talk about:
many to one relation mean:many threads in user space map to one thread in kernel .
one to one relation mean:one thread in user space map to one thread in kernel
many to many relation mean: some threads in user space multiplex in lower or equal threads in kernel space .
B. On the other hand, some book talk about 4 relations between threads & processes
many to one ,mean:A process defines an address space and dynamic resource ownership. Multiple threads may be created and executed within that process.
one to one ,mean:Each thread of execution is a unique process with its
own address space and resources.
one to many ,mean: thread may migrate from one process environment
to another. This allows a thread to be easily moved among distinct systems.
many to many ,mean:Combines attributes of (many to one) and (one to many) cases.
The cases in A is clear but in B i didn't understand number 3 , would you please explain it ?
Thanks.

I am not sure which book you are reading, but it seems to be it was written long time ago and now doesn't have any practical usage. For instance, there is no system I know of which allows thread migration. I doubt there ever was one practically used.
As for user-spaced threads modern systems do not use them. All platforms I know of use threads which are managed by kernel (i.e. kernel threads). All threads within the same process have access to this process memory, but can't go outside of it.

A thread is piece of a process ;while a process is a program in execution mode which requires resources.

Do child processes copy entire arrays?

I'm writing a basic UNIX program that involves processes sending messages to each other. My idea to synchronize the processes is to simply have an array of flags to indicate whether or not a process has reached a certain point in the code.
For example, I want all the processes to wait until they've all been created. I also want them to wait until they've all finished sending messages to each other before they begin reading their pipes.
I'm aware that a process performs a copy-on-write operation when it writes to a previously defined variable.
What I'm wondering is, if I make an array of flags, will the pointer to that array be copied, or will the entire array be copied (thus making my idea useless).
I'd also like any tips on inter-process communication and process synchronization.
EDIT: The processes are writing to each other process' pipe. Each process will send the following information:
typedef struct MessageCDT{
pid_t destination;
pid_t source;
int num;
} Message;
So, just the source of the message and some random number. Then each process will print out the message to stdout: Something along the lines of "process 20 received 5724244 from process 3".

Unix processes have independent address spaces. This means that the memory in one is totally separate from the memory in another. When you call fork(), you get a new copy of the process. Immediately on return from fork(), the only thing different between the two processes is fork()'s return value. All of the data in the two processes are the same, but they are copies. Updating memory in one cannot be known by the other, unless you take steps to share the memory.
There are many choices for interprocess communication (IPC) in Unix, including shared memory, semaphores, pipes (named and unnamed), sockets, message queues and signals. If you Google these things you will find lots to read.
In your particular case, trying to make several processes wait until they all reach a certain point, I might use a semaphore or shared memory, depending on whether there is some master process that started them all or not.
If there is a master process that launches the others, then the master could setup the semaphore with a count equal to the number of processes to synchronize and then launch them. Each child could then decrement the semaphore value and wait for the semaphore value to reach zero.
If there is no master process, then I might create a shared memory segment that contains a count of processes and a flag for each process. But when you have two or more processes using shared memory, then you also need some kind of locking mechanism (probably a semaphore again) to ensure that two processes do not try to update the shared memory simultaneously.
Keep in mind that reading a pipe that nobody is writing to will block the reader until data appears. I don't know what your processes do, but perhaps that is synchronization enough? One other thing to consider if you have multiple processes writing to a given pipe, their data may become interleaved if the writes are larger than PIPE_BUF. The value and location of this macro are system dependent.
-Kevin

The entire array of flags will seem to be copied. It will not actually be copied until one process or another writes to it of course. But that's an implementation detail and transparent to the individual processes. As far as each process is concerned, they each get a copy of the array.
There are ways to make this not happen. You can use mmap with the MAP_SHARED option for the memory used for your flags. Then each sub-process will share the same region of memory. There's also Posix shared memory (which I, BTW, think is an awful hack). To find out about Posix shared memory, look at the shm_overview(7) man page.
But using memory in this way isn't really a good idea. On multi-core systems it's not always the case that when one process (or thread) writes to an area of shared memory that all other processes will see the value written right away. Frequently the value will hang out for awhile in the L2 cache and not be immediately flushed.
If you want to communicate using shared memory, you will have to used mutexes or the C++11 atomic operations to ensure that writes are properly seen by the other processes.

Real World Examples of read-write in concurrent software

I'm looking for real world examples of needing read and write access to the same value in concurrent systems.
In my opinion, many semaphores or locks are present because there's no known alternative (to the implementer,) but do you know of any patterns where mutexes seem to be a requirement?
In a way I'm asking for candidates for the standard set of HARD problems for concurrent software in the real world.

What kind of locks are used depends on how the data is being accessed by multiple threads. If you can fine tune the use case, you can sometimes eliminate the need for exclusive locks completely.
An exclusive lock is needed only if your use case requires that the shared data must be 100% exact all the time. This is the default that most developers start with because that's how we think about data normally.
However, if what you are using the data for can tolerate some "looseness", there are several techniques to share data between threads without the use of exclusive locks on every access.
For example, if you have a linked list of data and if your use of that linked list would not be upset by seeing the same node multiple times in a list traversal and would not be upset if it did not see an insert immediately after the insert (or similar artifacts), you can perform list inserts and deletes using atomic pointer exchange without the need for a full-stop mutex lock around the insert or delete operation.
Another example: if you have an array or list object that is mostly read from by threads and only occasionally updated by a master thread, you could implement lock-free updates by maintaining two copies of the list: one that is "live" that other threads can read from and another that is "offline" that you can write to in the privacy of your own thread. To perform an update, you copy the contents of the "live" list into the "offline" list, perform the update to the offline list, and then swap the offline list pointer into the live list pointer using an atomic pointer exchange. You will then need some mechanism to let the readers "drain" from the now offline list. In a garbage collected system, you can just release the reference to the offline list - when the last consumer is finished with it, it will be GC'd. In a non-GC system, you could use reference counting to keep track of how many readers are still using the list. For this example, having only one thread designated as the list updater would be ideal. If multiple updaters are needed, you will need to put a lock around the update operation, but only to serialize updaters - no lock and no performance impact on readers of the list.
All the lock-free resource sharing techniques I'm aware of require the use of atomic swaps (aka InterlockedExchange). This usually translates into a specific instruction in the CPU and/or a hardware bus lock (lock prefix on a read or write opcode in x86 assembler) for a very brief period of time. On multiproc systems, atomic swaps may force a cache invalidation on the other processors (this was the case on dual proc Pentium II) but I don't think this is as much of a problem on current multicore chips. Even with these performance caveats, lock-free runs much faster than taking a full-stop kernel event object. Just making a call into a kernel API function takes several hundred clock cycles (to switch to kernel mode).
Examples of real-world scenarios:
producer/consumer workflows. Web service receives http requests for data, places the request into an internal queue, worker thread pulls the work item from the queue and performs the work. The queue is read/write and has to be thread safe.
Data shared between threads with change of ownership. Thread 1 allocates an object, tosses it to thread 2 for processing, and never wants to see it again. Thread 2 is responsible for disposing the object. The memory management system (malloc/free) must be thread safe.
File system. This is almost always an OS service and already fully thread safe, but it's worth including in the list.
Reference counting. Releases the resource when the number of references drops to zero. The increment/decrement/test operations must be thread safe. These can usually be implemented using atomic primitives instead of full-stop kernal mutex locks.

Most real world, concurrent software, has some form of requirement for synchronization at some level. Often, better written software will take great pains to reduce the amount of locking required, but it is still required at some point.
For example, I often do simulations where we have some form of aggregation operation occurring. Typically, there are ways to prevent locking during the simulation phase itself (ie: use of thread local state data, etc), but the actual aggregation portion typically requires some form of lock at the end.
Luckily, this becomes a lock per thread, not per unit of work. In my case, this is significant, since I'm typically doing operations on hundreds of thousands or millions of units of work, but most of the time, it's occuring on systems with 4-16 PEs, which means I'm usually restricting to a similar number of units of execution. By using this type of mechanism, you're still locking, but you're locking between tens of elements instead of potentially millions.

Thread and Process

What is the best definition of a thread and what is a process?
If I call a function, how do I know that a thread is calling it or a process (or am I not understanding it??!). This is in a multi-core system (quadcore).

From http://wiki.answers.com/Q/What_is_the_difference_between_a_computer_process_and_thread:
A single process can have multiple threads that share global data and address space with other threads running in the same process, and therefore can operate on the same data set easily. Processes do not share address space and a different mechanism must be used if they are to share data.
If we consider running a word processing program to be a process, then the auto-save and spell check features that occur in the background are different threads of that process which are all operating on the same data set (your document).

One thing to add is how does a multi-core processor handle this. Think of a thread as the sequential execution of your code.
A core in a CPU can only execute one thread at a time. So if this thread is blocked because the program is waiting for an I/O operation to finish, the process is blocked (very simplified example: Word not responding). Multi-threading allows us to execute multiple code paths at the same time. "Same time" is a bit of a lie, since only one thread can actually execute at a time in a core, but the CPU gives some small chunk of time to each thread, so it appears as if all these threads are executing at the same time. A good example here is the spell checker in Word.
If you have multiple cores, the only difference is that in an N-Core CPU you can have N threads executing at the same time. To simplify a lot, it doesn't matter what process the threads belong to. To simply even further, you'd expect a N times performance increase. :-D

In every modern OS I know of, everything runs in a thread, which runs in a process.
The OS can keep track of multiple processes, and each process can host an arbitrary number of threads. So all code is executed within a thread and within a process (since the thread runs in a process).
The main distinction between the two is that each process has its own virtual address space. Separate processes do not have access to each others' data, file handles or anything else, and are essentially not aware that other processes exist.
On the other hand, every thread in a process share the same address space, and all threads can therefore inspect or modify each others' data, call the same functions and everything else.
It is often (but not always) the cases that one program consists of one process and a number of threads.

A process is composed of one or more threads (one by default for most environments). A process can create additional threads though.
Like the previous answer says, each Process has its own memory space (each can have a pointer to 0x12345, with that memory location having different values for each process), while all the Threads of a process would actually point to the exact same memory location, since they're all in the same memory space.
When calling a function, it's almost always called on the same thread that the caller is running on. In Objective-C, there are exceptions (performSelectorOnMainThread), and there might be for other languages as well, but that sort of functionality is necessary only in special cases.

From a user's point of view, the main distinction is that threads share memory with each other, while processes do not. That means you can easily share data between threads, while processes require some kind of OS call to do so.
Some call this a benifit of threads, but sharing data between multiple threads of control is fraught with danger, so it can be argued that processes lead to more reliable code.
There's a lot more to it, particularly if you are an OS person.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string