I want to do bulk insert for many threads at the same time, each time each thread insert data into different collections. I know it's not thread safe if I put all data into one collection, but what if each thread insert data into a totally different collection? In such case, can I assume it's thread safe and do not have to worry about stuff?
If every thread uses it's own connection then it is thread safe. There is no difference if you insert in-to the same collection or different ones. The crucial part is that every thread must use it's own separate connection to the database.
Related
I'm not sure if reverse/inverse counting mutex is the name of the synchronization primitive I'm looking for, and search results are vague. What I need is the ability for a thread that wants to write to an object to be able to "lock" said object such that it waits for all existing readers of that object to finish doing so, but where no new thread/task can attempt to aquire any access to that object (neither read nor write), until the thread that wants to write finishes doing so and "unlocks" that object.
The question is how to design a class that behaves as such a synchronization primitive using only preexisting binary/counting/recursive semaphores.
Using a standard counting semaphore isn't suitable, since that only limits the max number of tasks that would be able to access the object simultaneously. But it doesn't restrict/enforce that they may only read, nor would they notifiy the thread that wants to write that they have finished, nor would it prevent any other threads in the meanwhile starting to read.
I need some kind of "counting" semaphore that is not bounded from above, but on which "register_read" or "lock_for_read" can be called (which keeps count how many simultaneous readers there are), but on which a task can call "lock_for_write", and then blocks until the count reaches 0, and after "lock_for_write" is called, any new calls to "lock_for_read" would have to block until the writing thread calls "unlock_from_write".
This is related to How to assign unique ids to threads in a pthread wrapper? and The need for id_callback when in a multithread environment?.
When we need to differentiate among unique threads, we cannot use functions like pthread_self because thread ids are reused. In those problems, it was suggested to use a monotonically increasing counter to provide a unique id due to counter potential thread id reuse. The counter is then passed to the thread by way of arg in pthread_create.
I don't think we can maintain a map of external thread ids to unique ids because of the reuse problem. The same thread id could have multiple unique ids.
How do we retrieve the arg passed to pthread_create from outside the thread? Is it even retrievable?
I don't think we can maintain a map of external thread ids to unique
ids because of the reuse problem. The same thread id could have
multiple unique ids.
You can, as long as in this map you only keep the external thread IDs corresponding to currently running threads. When a thread exits, you remove it from the map.
The user of the map obviously only cares about currently running threads, since apparently the only way it has to identify the thread it wants is the external thread ID.
I've been reading about semaphores and came across this article:
www.csc.villanova.edu/~mdamian/threads/posixsem.html
So, this page states that if there are two threads accessing the same data, things can get ugly. The solution is to allow only one thread to access the data at the same time.
This is clear and I understand the solution, only why would anyone need threads to do this? What is the point? If the threads are blocked so that only one can execute, why use them at all? There is no advantage. (or maybe this is a just a dumb example; in such a case please point me to a sensible one)
Thanks in advance.
Consider this:
void update_shared_variable() {
sem_wait( &g_shared_variable_mutex );
g_shared_variable++;
sem_post( &g_shared_variable_mutex );
}
void thread1() {
do_thing_1a();
do_thing_1b();
do_thing_1c();
update_shared_variable(); // may block
}
void thread2() {
do_thing_2a();
do_thing_2b();
do_thing_2c();
update_shared_variable(); // may block
}
Note that all of the do_thing_xx functions still happen simultaneously. The semaphore only comes into play when the threads need to modify some shared (global) state or use some shared resource. So a thread will only block if another thread is trying to access the shared thing at the same time.
Now, if the only thing your threads are doing is working with one single shared variable/resource, then you are correct - there is no point in having threads at all (it would actually be less efficient than just one thread, due to context switching.)
When you are using multithreading not everycode that runs will be blocking. For example, if you had a queue, and two threads are reading from that queue, you would make sure that no thread reads at the same time from the queue, so that part would be blocking, but that's the part that will probably take the less time. Once you have retrieved the item to process from the queue, all the rest of the code can be run asynchronously.
The idea behind the threads is to allow simultaneous processing. A shared resource must be governed to avoid things like deadlocks or starvation. If something can take a while to process, then why not create multiple instances of those processes to allow them to finish faster? The bottleneck is just what you mentioned, when a process has to wait for I/O.
Being blocked while waiting for the shared resource is small when compared to the processing time, this is when you want to use multiple threads.
This is of course a SSCCE (Short, Self Contained, Correct Example)
Let's say you have 2 worker threads that do a lot of work and write the result to a file.
you only need to lock the file (shared resource) access.
The problem with trivial examples....
If the problem you're trying to solve can be broken down into pieces that can be executed in parallel then threads are a good thing.
A slightly less trivial example - imagine a for loop where the data being processed in each iteration is different every time. In that circumstance you could execute each iteration of the for loop simultaneously in separate threads. And indeed some compilers like Intel's will convert suitable for loops to threads automatically for you. In that particular circumstances no semaphores are needed because of the iterations' data independence.
But say you were wanting to process a stream of data, and that processing had two distinct steps, A and B. The threadless approach would involve reading in some data then doing A then B and then output the data before reading more input. Or you could have a thread reading and doing A, another thread doing B and output. So how do you get the interim result from the first thread to the second?
One way would be to have a memory buffer to contain the interim result. The first thread could write the interim result to a memory buffer and the second could read from it. But with two threads operating independently there's no way for the first thread to know if it's safe to overwrite that buffer, and there's no way for the second to know when to read from it.
That's where you can use semaphores to synchronise the action of the two threads. The first thread takes a semaphore that I'll call empty, fills the buffer, and then posts a semaphore called filled. Meanwhile the second thread will take the filled semaphore, read the buffer, and then post empty. So long as filled is initialised to 0 and empty is initialised to 1 it will work. The second thread will process the data only after the first has written it, and the first won't write it until the second has finished with it.
It's only worth it of course if the amount of time each thread spends processing data outweighs the amount of time spent waiting for semaphores. This limits the extent to which splitting code up into threads yields a benefit. Going beyond that tends to mean that the overall execution is effectively serial.
You can do multithreaded programming without semaphores at all. There's the Actor model or Communicating Sequential Processes (the one I favour). It's well worth looking up JCSP on Wikipedia.
In these programming styles data is shared between threads by sending it down communication channels. So instead of using semaphores to grant another thread access to data it would be sent a copy of that data down something a bit like a network socket, or a pipe. The advantage of CSP (which limits that communication channel to send-finishes-only-if-receiver-has-read) is that it stops you falling into the many many pitfalls that plague multithreaded do programs. It sounds inefficient (copying data is inefficient), but actually it's not so bad with Intel's QPI architecture, AMD's Hypertransport. And it means hat the 'channel' really could be a network connection; scalability built in by design.
I'm currently creating an SDL/OpenGL program, which renders objects based on a few state variables. These state variables are updated continuously in a seperate thread, at a user-defined rate. Every once in a while, the main thread asynchronously needs to swap some of these state variables.
Now, these state variables are mostly pointers, so when I update them from the main thread (i.e. asynchronously with respect to the updating thread), I first create a mutex lock, delete the objects, create/swap them to their new ones, and then unlock the mutex. Again though, the update thread is still running during this time.
Because of that last point, I was curious. What happens if the thread attempts to access any of those state variables mid-asynchronous-update? I know that this isn't allowed (due to the mutex lock), but what happens behind-the-scenes?
Unless you cover your update code with mutex lock and unlock, the update thread(your last point) won't care about the lock by main thread. It will just update that data.
You should use the same mutex object(just create it ones for the lifetime of update thread and main thread) on the update thread before updating the variables. This way, main thread won't get access to that data while update thread is accessing and vice versa.
You may want to take a good look at how mutex's are used for synchronization of threads.
UPDATE: FOR YOUR QUESTION
"So basically, everywhere I have a thread-unsafe variable, I should surround all accesses to that variable with the same mutex?"
Yes, but you should also be aware of scenarios where deadlock can occur. deadlocks are main reason why multi threading is avoided in many applications or to put it in another way, many people don't like multi threading.
I am creating an app that accesses a database. On every database access, the app waits for the job to be finished.
To keep the UI responsive, I want to put all the database stuff in a separate thread.
Here is my idea:
The db-thread creates all database components it needs when it is created
Now the thread just sits there and waits for a command
If it receives a command, it performs the action and goes back to idle. During that time the main thread waits.
the db-thread lives as long as the app is running
Does this sound ok?
What's the best way to get the database results from the db-thread into the main thread?
I haven't done much with threads so far, therefore I'm wondering if the db-thread can create a query component out of which the main thread reads the results. Main thread and db thread will never access the query at the same time. Will this still cause problems?
What you are looking for is the standard data access technique, called asynchronous query execution. Some data access components implement this feature in an easy-to-use manner. At least dbGo (ADO) and AnyDAC implement that. Lets consider the dbGo.
The idea is simple - you call the convenient dataset methods, like a Open. The method launches required task in a background thread and immediately returns. When the task is completed, an appropriate event will be fired, notifying the application, that the task is finished.
The standard approach with the DB GUI applications and the Open method is the following (draft):
include eoAsyncExecute, eoAsyncFetch, eoAsyncFetchNonBlock into dataset ExecuteOptions;
disconnect TDataSource.DataSet from dataset;
set dataset OnFetchComplete to a proc P;
show "Hello ! We do the hard work to process your requests. Please wait ..." dialog;
call the dataset Open method;
when the query execution will be finished, the OnFetchComplete will be called, so the P. And the P hides the "Wait" dialog and connects TDataSource.DataSet back to the dataset.
Also your "Wait" dialog may have a Cancel button, which an user may use to cancel a too long running query.
First of all - if you haven't much experience with multi-threading, don't start with the VCL classes. Use the OmniThreadLibrary, for (among others) those reasons:
Your level of abstraction is the task, not the thread, a much better way of dealing with concurrency.
You can easily switch between executing tasks in their own thread and scheduling them with a thread pool.
All the low-level details like thread shutdown, bidirectional communication and much more are taken care of for you. You can concentrate on the database stuff.
The db-thread creates all database components it needs when it is created
This may not be the best way. I have generally created components only when needed, but not destroyed immediately. You should definitely keep the connection open in a thread pool thread, and close it only once the thread has been inactive for some time and the pool disposes of it. But it is also often a good idea to keep a cache of transaction and statement objects.
If it receives a command, it performs the action and goes back to idle. During that time the main thread waits.
The first part is being handled fine when OTL is used. However - don't have the main thread wait, this will bring little advantage over performing the database access directly in the VCL thread in the first place. You need an asynchronous design to make best use of multiple threads. Consider a standard database browser form that has controls for filtering records. I handle this by (re-)starting a timer every time one of the controls changes. Once the user finishes editing the timer event fires (say after 500 ms), and a task is started that executes the statement that fetches data according to the filter criteria. The grid contents are cleared, and it is repopulated only when the task has finished. This may take some time though, so the VCL thread doesn't wait for the task to complete. Instead the user could even change the filter criteria again, in which case the current task is cancelled and a new one started. OTL gives you an event for task completion, so the asynchronous design is easy to achieve.
What's the best way to get the database results from the db-thread into the main thread?
I generally don't use data aware components for multi-threaded db apps, but use standard controls that are views for business objects. In the database tasks I create these objects, put them in lists, and the task completion event transfers the list to the VCL thread.
Main thread and db thread will never access the query at the same time.
With all components that load data on-demand you can't be sure of that. Often only the first records are fetched from the db, and fetching continues after they have been consumed. Such components obviously must not be shared by threads.
I have implemented both strategies: Thread pool and adhoc thread creation.
I suggest to begin with the adhoc thread creation, it is simpler to implement and simpler to scale.
Only move to a thread pool if (with careful evaluation) (1) there is a lot of resources (and time) invested in the creation of the thread and (2) you have a lot of creation requests.
In both cases you must deal with passing parameters and collect results. I suggest to extend the thread class with properties that allow this data passing.
Refer to the documentation of the classes, components and functions that the thread use to make sure they are thread safe, that is, they can be use simultaneously from different threads. If not, you will need to synchronize the access. In some cases you may find slight differences regarding thread safety. As an example, see DateTimeToStr.
If you create your thread at start and reuse it later whenever you need it, you have to make sure that you disconnect the db components (grid..) from the underlying datasource (disableControls) each time you're "processing" data.
For the sake of simplicity, I would inherit TThread and implement all the business logic in my own class. The result dataset would be a member of this class and I would connect it the db aware compos in with synchronize.
Anyway, it is also very important to delegate as much work as possible to the db server and keep the UI as lightweight as possible. Firebird is my favourite db server: triggers, for select, custom UDF dlls developed in Delphi, many thread safe db components with lots of examples and good support (forum) : jvUIB...
Good Luck