Lock vector element-wise - multithreading

I'm reading a huge file (25m) record and I'm trying to speed that up. I've tried mmap and multithreaded reading. The problem is within each thread I've to write to the below structure which is not thread-safe. I've tried mutex but I've to lock up the whole vector which even makes reading slower. Is there any suggestions to make locking element-wise?
vector<unsigned> *flight_passengers = new vector<unsigned>[1400000];

Related

Do I need always lock before read even if very occasionally write to the memory?

I have a shared data structure that is read in one thread and modified in another thread. However, its data changes very occasionally. Most of time, it is read by the thread. I now have a Mutex (or RW lock) locked before read/write and unlocked after read/write.
Because the data rarely changes, lock-unlock every time it is read seems inefficient. If no change is made to the data, I can get rid of the lock because only read to the same structure can run simultaneously without lock.
My question is:
Is there a lock-free solution that allows me changes the data without a lock?
Or, the lock-unlock in read (one thread, in other words, no contention) don't take much of time/resources (no enter to the kernel) at all?
If there's no contention, not kernel call is needed, but still atomic lock acquisition is needed. If the resource is occupied for a short period of time, then spinning can be attempted before kernel call.
Mutex and RW lock implementations, such as (an usual quality implementation of) std::mutex / std::shared_mutex in C++ or CRITICAL_SECTION / SRW_LOCK in Windows already employ above mentioned techniques on their own. Linux mutexes are usually based on futex, so they also avoid kernel call when it its not needed. So you don't need to bother about saving a kernel call yourself.
And there are alternatives to locking. There are atomic types that can be accessed using lock-free reads and writes, they can always avoid lock. There are other patterns, such as SeqLock. There is transaction memory.
But before going there, you should make sure that locking is performance problem. Because use of atomics may be not simple (although it is simple for some languages and simple cases), and other alternatives have their own pitfalls.
An uncontrolled data race may be dangerous. Maybe not. And there may be very thin boundary between cases where it is and where it is not. For example, copying a bunch of integer could only result in garbage integers occasionally obtained, if integers are properly sized and aligned, then there may be only a mix up, but not garbage value of a single integer, and if you add some more complex type, say string, you may have a crash. So most of the times uncontrolled data race is treated as Undefined Behavior.

Is it safe to update an object in a thread without locks if other threads will not access it?

I have a vector of entities. At update cycle I iterate through vector and update each entity: read it's position, calculate current speed, write updated position. Also, during updating process I can change some other objects in other part of program, but each that object related only to current entity and other entities will not touch that object.
So, I want to run this code in threads. I separate vector into few chunks and update each chunk in different threads. As I see, threads are fully independent. Each thread on each iteration works with independent memory regions and doesn't affect other threads work.
Do I need any locks here? I assume, that everything should work without any mutexes, etc. Am I right?
Short answer
No, you do not need any lock or synchronization mechanism as your problem appear to be a embarrassingly parallel task.
Longer answer
A race conditions that can only appear if two threads might access the same memory at the same time and at least one of the access is a write operation. If your program exposes this characteristic, then you need to make sure that threads access the memory in an ordered fashion. One way to do it is by using locks (it is not the only one though). Otherwise the result is UB.
It seems that you found a way to split the work among your threads s.t. each thread can work independently from the others. This is the best case scenario for concurrent programming as it does not require any synchronization. The complexity of the code is dramatically decreased and usually speedup will jump up.
Please note that as #acelent pointed out in the comment section, if you need changes made by one thread to be visible in another thread, then you might need some sort of synchronization due to the fact that depending on the memory model and on the HW changes made in one thread might not be immediately visible in the other.
This means that you might write from Thread 1 to a variable and after some time read the same memory from Thread 2 and still not being able to see the write made by Thread 1.
"I separate vector into few chunks and update each chunk in different threads" - in this case you do not need any lock or synchronization mechanism, however, the system performance might degrade considerably due to false sharing depending on how the chunks are allocated to threads. Note that the compiler may eliminate false sharing using thread-private temporal variables.
You can find plenty of information in books and wiki. Here is some info https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
Also there is a stackoverflow post here does false sharing occur when data is read in openmp?

Single write - single read big memory buffer sharing without locks

Let's suppose I have a big memory buffer used as a framebuffer, what is constantly written by a thread (or even multiple threads, guaranteed that no two threads write the same byte concurrently). These writes are indeterministic in time, scattered through the codebase, and cannot be blocked.
I have another single thread which periodically reads out (copies) the whole buffer for generating a display frame. This read should not be blocked, too. Tearing is not a problem in my case. In other words, my only goal is that every change done by the write thread(s) should eventually appear in the reading thread. The ordering or some (negligible compared to a display refresh rate) delay does not matter.
Reading and writing the same memory location concurrently is a data race, which results in an undefined behavior in c++11, and this article lists same really dreadful examples where the optimizing compiler generates code for a memory read that alters the memory contents in the presence of data race.
Still, I need some solution without completely redesigning this legacy code. Every advice counts what is safe from practical standpoints, independent of if it is theoretically correct or not. I am also open to not-fully-portable solutions, too.
Aside from that I have a data race, I can easily force the visibility of the buffer changes in the reading thread by establishing a synchronizes-with relation between the threads (acquire-release an atomic guard variable, used for nothing else), or by adding platform-specific memory fence calls to key points in the writer thread(s).
My ideas to target the data race:
Use assembly for the reading thread. I would try to avoid that.
Make the memory buffer volatile, thus preventing the compiler to optimize such nasty things what are described in the referenced article.
Put the reading thread's code in a separate compile unit, and compile with -O0
+1. Leave everything as is, and cross my fingers (as currently I do not notice issues) :)
What is the safest from the list above? Do you see a better solution?
FYI, the target platform is ARM (with multiple cores) and x86 (for testing).
(This question is concretizing a previous one what was a little too generic.)

Semaphores & threads - what is the point?

I've been reading about semaphores and came across this article:
www.csc.villanova.edu/~mdamian/threads/posixsem.html
So, this page states that if there are two threads accessing the same data, things can get ugly. The solution is to allow only one thread to access the data at the same time.
This is clear and I understand the solution, only why would anyone need threads to do this? What is the point? If the threads are blocked so that only one can execute, why use them at all? There is no advantage. (or maybe this is a just a dumb example; in such a case please point me to a sensible one)
Thanks in advance.
Consider this:
void update_shared_variable() {
sem_wait( &g_shared_variable_mutex );
g_shared_variable++;
sem_post( &g_shared_variable_mutex );
}
void thread1() {
do_thing_1a();
do_thing_1b();
do_thing_1c();
update_shared_variable(); // may block
}
void thread2() {
do_thing_2a();
do_thing_2b();
do_thing_2c();
update_shared_variable(); // may block
}
Note that all of the do_thing_xx functions still happen simultaneously. The semaphore only comes into play when the threads need to modify some shared (global) state or use some shared resource. So a thread will only block if another thread is trying to access the shared thing at the same time.
Now, if the only thing your threads are doing is working with one single shared variable/resource, then you are correct - there is no point in having threads at all (it would actually be less efficient than just one thread, due to context switching.)
When you are using multithreading not everycode that runs will be blocking. For example, if you had a queue, and two threads are reading from that queue, you would make sure that no thread reads at the same time from the queue, so that part would be blocking, but that's the part that will probably take the less time. Once you have retrieved the item to process from the queue, all the rest of the code can be run asynchronously.
The idea behind the threads is to allow simultaneous processing. A shared resource must be governed to avoid things like deadlocks or starvation. If something can take a while to process, then why not create multiple instances of those processes to allow them to finish faster? The bottleneck is just what you mentioned, when a process has to wait for I/O.
Being blocked while waiting for the shared resource is small when compared to the processing time, this is when you want to use multiple threads.
This is of course a SSCCE (Short, Self Contained, Correct Example)
Let's say you have 2 worker threads that do a lot of work and write the result to a file.
you only need to lock the file (shared resource) access.
The problem with trivial examples....
If the problem you're trying to solve can be broken down into pieces that can be executed in parallel then threads are a good thing.
A slightly less trivial example - imagine a for loop where the data being processed in each iteration is different every time. In that circumstance you could execute each iteration of the for loop simultaneously in separate threads. And indeed some compilers like Intel's will convert suitable for loops to threads automatically for you. In that particular circumstances no semaphores are needed because of the iterations' data independence.
But say you were wanting to process a stream of data, and that processing had two distinct steps, A and B. The threadless approach would involve reading in some data then doing A then B and then output the data before reading more input. Or you could have a thread reading and doing A, another thread doing B and output. So how do you get the interim result from the first thread to the second?
One way would be to have a memory buffer to contain the interim result. The first thread could write the interim result to a memory buffer and the second could read from it. But with two threads operating independently there's no way for the first thread to know if it's safe to overwrite that buffer, and there's no way for the second to know when to read from it.
That's where you can use semaphores to synchronise the action of the two threads. The first thread takes a semaphore that I'll call empty, fills the buffer, and then posts a semaphore called filled. Meanwhile the second thread will take the filled semaphore, read the buffer, and then post empty. So long as filled is initialised to 0 and empty is initialised to 1 it will work. The second thread will process the data only after the first has written it, and the first won't write it until the second has finished with it.
It's only worth it of course if the amount of time each thread spends processing data outweighs the amount of time spent waiting for semaphores. This limits the extent to which splitting code up into threads yields a benefit. Going beyond that tends to mean that the overall execution is effectively serial.
You can do multithreaded programming without semaphores at all. There's the Actor model or Communicating Sequential Processes (the one I favour). It's well worth looking up JCSP on Wikipedia.
In these programming styles data is shared between threads by sending it down communication channels. So instead of using semaphores to grant another thread access to data it would be sent a copy of that data down something a bit like a network socket, or a pipe. The advantage of CSP (which limits that communication channel to send-finishes-only-if-receiver-has-read) is that it stops you falling into the many many pitfalls that plague multithreaded do programs. It sounds inefficient (copying data is inefficient), but actually it's not so bad with Intel's QPI architecture, AMD's Hypertransport. And it means hat the 'channel' really could be a network connection; scalability built in by design.

What to lock and what to not lock in a multithreaded environment (semaphores and shared memory)

I was implementing some simple Producer/Consumer program that had some semaphores and shared memory. To keep things simple, let's assume that there's just a block of shared memory and a semaphore in my program.
At first, I though that I only had to consider as critical section bits of code that'd try to write to the shared memory block. But as the shared memory block consists of, let's say, 1024bytes, I can't read all the data at the same time (it's not an atomic operation), so it is indeed possible that while I'm reading from it, the Producer comes and starts writing in it, so the reader will get half old data, half new data. From this, I can only think that I also have to put shared memory reading logic inside a "semaphore" block.
Now, I have lots of code that looks like this:
if (sharedMemory[0] == '0') { ... }
In this case, I am just looking for a single char in memory. I guess I don't have to worry about puting a semaphore around this, do I?
And what if instead I have something like
if (sharedMemory[0] == '0' && sharedMemory[1] == '1') { ... }
From my perspective, I guess that as this are 2 operations, I'd have to consider this as a critical section, thus having to put a semaphore around it. Am I right?
Thanks!
Technically, on a multicore or multiprocessor system the only thing that's atomic are assembly opcodes which are specifically documented as being atomic. Even reading a single byte presents a (quite small) chance the another processor will come along and modify it before you're doing reading it, except in some cases that deals with CPU cache and aligned chunks of memory (Fun thread: http://software.intel.com/en-us/forums/showthread.php?t=76744, Interesting read: http://www.corensic.com/CorensicBlog/tabid/101/EntryId/8/Memory-Consistency-Models.aspx)
You must either use types which internally guarantee atomicity or specifically protect accesses on multithreaded multicore systems.
(The answer may change slightly on IL platforms like .NET and JVMs since they make their own guarantees about what's atomic and what isn't).
Definitely lock around non-atomic operations, and checking two different values counts as as a non-atomic operation, although there are tricks you can use to check up to four bytes or more, provided your processor doesn't cache the results. You have to consider how your data is used. But basically, any access to shared memory should have have a semaphore around it.

Resources