What is the general design ideas of read-compute-write thread-safe program based on it's single-threaded version? - multithreading

Consider that the sequental version of the program already exists and implements a sequence of "read-compute-write" operations on a single input file and other single output file. "Read" and "write" operations are performed by the 3rd-party library functions which are hard (but possible) to modify, while the "compute" function is performed by the program itself. Read-write library functions seems to be not thread-safe, since they operate with internal flags and internal memory buffers.
It was discovered that the program is CPU-bounded, and it is planned to improve the program by taking advantage of multiple CPUs (up to 80) by designing the multi-processor version of the program and using OpenMP for that purpose. The idea is to instantiate multiple "compute" functions with same single input and single output.
It is obvious that something nedds to be done in insuring the consistent access to reads, data transfers, computations and data storages. Possible solutions are: (hard) rewrite the IO library functions in thread-safe manner, (moderate) write a thread-safe wrapper for IO functions that would also serve as a data cacher.
Is there any general patterns that cover the subject of converting, wrapping or rewriting the single-threaded code to comply with OpenMP thread-safety assumptions?
EDIT1: The program is fresh enough for changes to make it multi-threaded (or, generally a parallel one, implemented either by multi-threading, multi-processing or other ways).

As a quick response, if you are processing a single file and writing to another, with openMP its easy to convert the sequential version of the program to a multi-thread version without taking too much care about the IO part, provided that the compute algorithm itself can be parallelized.
This is true because usually the main thread, takes care of the IO. If this cannot be achieved because the chunks of data are too big to read at once, and the compute algorithm cannot process smaller chunks, you can use the openMP API to synchronize the IO in each thread. This does not mean that the whole application will stop or wait until the other threads finish computing so new data can be read or written, it means that only the read and write parts need to be done atomically.
For example, if the flow of your sequencial application is as follows:
1) Read
2) compute
3) Write
Given that it truly can be parallelized, and each chunk of data needs to be read from within each thread, each thread could follow the next design:
1) Synchronized read of chunk from input (only one thread at the time could execute this section)
2) Compute chunk of data (done in parallel)
3) Synchronized write of computed chunk to output (only one thread at the time could execute this section)
if you need to write the chunks in the same order you have read them, you need to buffer first, or adopt a different strategy like fseek to the correct position, but that really depends if the output file size is known from the start, ...
Take special attention to the openMP scheduling strategy, because the default may not be the best to your compute algorithm. And if you need to share results between threads, like the offset of the input file you have read, you may use reduction operations provided by the openMP API, which is way more efficient than making a single part of your code run atomically between all threads, just to update a global variable, openMP knows when its safe to write.
In regards of the "read, process, write" operation, as long as you keep each read and write atomic between every worker, I can't think any reason you'll find any trouble. Even when the data read is being stored in a internal buffer, having every worker accessing it atomically, that data is acquired in the exact same order. You only need to keep special attention when saving that chunk to the output file, because you don't know the order each worker will finish processing its attributed chunk, so, you could have a chunk ready to be saved that was read after others that are still being processed. You just need each worker to keep track of the position of each chunk and you can keep a list of pointers to chunks that need to be saved, until you have a sequence of finished chunks since the last one saved to the output file. Some additional care may need to be taken here.
If you are worried about the internal buffer itself (and keeping in mind I don't know the library you are talking about, so I can be wrong) if you make a request to some chunk of data, that internal buffer should only be modified after you requested that data and before the data is returned to you; and as you made that request atomically (meaning that every other worker will need to keep in line for its turn) when the next worker asks for his piece of data, that internal buffer should be in the same state as when the last worker received its chunk. Even in the case that the library particularly says it returns a pointer to a position of the internal buffer and not a copy of the chunk itself, you can make a copy to the worker's memory before releasing the lock on the whole atomic read operation.
If the pattern I suggested is followed correctly, I really don't think you would find any problem you wouldn't find in the same sequential version of the algorithm.

with a little of synchronisation you can go even further. Consider something like this:
#pragma omp parallel sections num_threads
#pragma omp section
#pragma omp section
#pragma omp parallel num_threads(N)
#pragma omp section
So, the basic idea would be that input() and output() read/write chunks of data. The compute part then would work on a chunk of data while the other threads are reading/writing. It will take a bit of manual synchronization work in notify*() and wait*(), but that's not magic.


Which one I should use in Clojure? go block or thread?

I want to see the intrinsic difference between a thread and a long-running go block in Clojure. In particular, I want to figure out which one I should use in my context.
I understand if one creates a go-block, then it is managed to run in a so-called thread-pool, the default size is 8. But thread will create a new thread.
In my case, there is an input stream that takes values from somewhere and the value is taken as an input. Some calculations are performed and the result is inserted into a result channel. In short, we have input and out put channel, and the calculation is done in the loop. So as to achieve concurrency, I have two choices, either use a go-block or use thread.
I wonder what is the intrinsic difference between these two. (We may assume there is no I/O during the calculations.) The sample code looks like the following:
(go-loop []
(when-let [input (<! input-stream)]
... ; calculations here
(>! result-chan result))
(loop []
(when-let [input (<!! input-stream)]
... ; calculations here
(put! result-chan result))
I realize the number of threads that can be run simultaneously is exactly the number of CPU cores. Then in this case, is go-block and thread showing no differences if I am creating more than 8 thread or go-blocks?
I might want to simulate the differences in performance in my own laptop, but the production environment is quite different from the simulated one. I could draw no conclusions.
By the way, the calculation is not so heavy. If the inputs are not so large, 8,000 loops can be run in 1 second.
Another consideration is whether go-block vs thread will have an impact on GC performance.
There's a few things to note here.
Firstly, the thread pool that threads are created on via clojure.core.async/thread is what is known as a cached thread pool, meaning although it will re-use recently used threads inside that pool, it's essentially unbounded. Which of course means it could potentially hog a lot of system resources if left unchecked.
But given that what you're doing inside each asynchronous process is very lightweight, threads to me seem a little overkill. Of course, it's also important to take into account the quantity of items you expect to hit the input stream, if this number is large you could potentially overwhelm core.async's thread pool for go macros, potentially to the point where we're waiting for a thread to become available.
You also didn't mention preciously where you're getting the input values from, are the inputs some fixed data-set that remains constant at the start of the program, or are inputs continuously feed into the input stream from some source over time?
If it's the former then I would suggest you lean more towards transducers and I would argue that a CSP model isn't a good fit for your problem since you aren't modelling communication between separate components in your program, rather you're just processing data in parallel.
If it's the latter then I presume you have some other process that's listening to the result channel and doing something important with those results, in which case I would say your usage of go-blocks is perfectly acceptable.

Do I need synchronization to read and write a common cache file in a multithread environment?

Consider the following algorithm, which is running on multiple threads at the same time:
for (i=0; i<10000; i++) {
z = rand(0,50000);
if (isset(cache[z])) results[z] = cache[z];
else {
result = z*100;
cache[z] = result;
results[z] = result;
The cache and results are both shared variables among the threads. If this algorithm runs as it is, without synchronization, what kind of errors can occur? If two threads try to write concurrently to cache[z] or results[z] can data be lost, or plain and simply the data will be accepted by the thread that won the 'race-condition'?
A more concrete example of a question: let's say Thread A and Thread B both try to write to cache[10] at the same time the number 1000, and in the same time, Thread C tries to read the data that is in cache[10]. Can the read operation of Thread C finish, in an intermitent sate, let's say, as 100, and then Thread C will continue working with the incorrect data?
USE CASE: A real life use case for which I am asking this question, is hashtabled caches. If all of the Threads will use the same hashtable cache, and they will read and write data from and to it, if the data they write to a specific key will always be the same, do I need to synchronize these read and write operations?
Nobody could possibly know. Different languages, compiler, CPUs, platforms, and threading standards could handle this in entirely different ways. There's no way anyone can know what some future compiler, CPU, or platform might do. Unless the documentation or specification for the language or threading standard says what will happen in this case, there is absolutely no way to know what might happen. Of course, if something you're using guarantees particular behavior in this case, then what is guaranteed to happen will happen (unless it's broken).
At one time, there didn't exist any CPUs that buffered writes such that they could be visible out-of-order. But if you wrote code under the assumption that this meant that writes would never become visible out-of-order, that code would be broken on pretty much every modern platform.
This sad tale repeated over and over with numerous compiler optimizations that people never expected compilers to make but that compilers later made. Some of the aliasing fiascos come to mind.
Making decisions that require you to imagine correctly possible future evolutions of computing seems extremely unwise and has failed repeatedly, sometimes catastrophically, in the past.

Using threadsafe initialization in a JRuby gem

Wanting to be sure we're using the correct synchronization (and no more than necessary) when writing threadsafe code in JRuby; specifically, in a Puma instantiated Rails app.
UPDATE: Extensively re-edited this question, to be very clear and use latest code we are implementing. This code uses the atomic gem written by #headius (Charles Nutter) for JRuby, but not sure it is totally necessary, or in which ways it's necessary, for what we're trying to do here.
Here's what we've got, is this overkill (meaning, are we over/uber-engineering this), or perhaps incorrect?
require 'atomic' # gem from #headius
SUPPORTED_SERVICES = %w(serviceABC anotherSvc andSoOnSvc).freeze
module Foo
def self.included(cls)
cls.send :__setup
module ClassMethods
def get(service_name, method_name, *args)
__cached_client(service_name).send(method_name.to_sym, *args)
# we also capture exceptions here, but leaving those out for brevity
def __client(service_name)
# obtain and return a client handle for the given service_name
# we definitely want to cache the value returned from this method
# **AND**
# it is a requirement that this method ONLY be called *once PER service_name*.
def __cached_client(service_name)
def __setup
##_clients = Atomic.new({})
##_clients.update do |current_service|
SUPPORTED_SERVICES.inject(Atomic.new({}).value) do |memo, service_name|
if current_services[service_name]
memo.merge({service_name => __client(service_name)})
require 'ourgem'
class GetStuffFromServiceABC
include Foo
def self.get_some_stuff
result = get('serviceABC', 'method_bar', 'arg1', 'arg2', 'arg3')
puts result
Summary of the above: we have ##_clients (a mutable class variable holding a Hash of clients) which we only want to populate ONCE for all available services, which are keyed on service_name.
Since the hash is in a class variable (and hence threadsafe?), are we guaranteed that the call to __client will not get run more than once per service name (even if Puma is instantiating multiple threads with this class to service all the requests from different users)? If the class variable is threadsafe (in that way), then perhaps the Atomic.new({}) is unnecessary?
Also, should we be using an Atomic.new(ThreadSafe::Hash) instead? Or again, is that not necessary?
If not (meaning: you think we do need the Atomic.news at least, and perhaps also the ThreadSafe::Hash), then why couldn't a second (or third, etc.) thread interrupt between the Atomic.new(nil) and the ##_clients.update do ... meaning the Atomic.news from EACH thread will EACH create two (separate) objects?
Thanks for any thread-safety advice, we don't see any questions on SO that directly address this issue.
Just a friendly piece of advice, before I attempt to tackle the issues you raise here:
This question, and the accompanying code, strongly suggests that you don't (yet) have a solid grasp of the issues involved in writing multi-threaded code. I encourage you to think twice before deciding to write a multi-threaded app for production use. Why do you actually want to use Puma? Is it for performance? Will your app handle many long-running, I/O-bound requests (like uploading/downloading large files) at the same time? Or (like many apps) will it primarily handle short, CPU-bound requests?
If the answer is "short/CPU-bound", then you have little to gain from using Puma. Multiple single-threaded server processes would be better. Memory consumption will be higher, but you will keep your sanity. Writing correct multi-threaded code is devilishly hard, and even experts make mistakes. If your business success, job security, etc. depends on that multi-threaded code working and working right, you are going to cause yourself a lot of unnecessary pain and mental anguish.
That aside, let me try to unravel some of the issues raised in your question. There is so much to say that it's hard to know where to start. You may want to pour yourself a cold or hot beverage of your choice before sitting down to read this treatise:
When you talk about writing "thread-safe" code, you need to be clear about what you mean. In most cases, "thread-safe" code means code which doesn't concurrently modify mutable data in a way which could cause data corruption. (What a mouthful!) That could mean that the code doesn't allow concurrent modification of mutable data at all (using locks), or that it does allow concurrent modification, but makes sure that it doesn't corrupt data (probably using atomic operations and a touch of black magic).
Note that when your threads are only reading data, not modifying it, or when working with shared stateless objects, there is no question of "thread safety".
Another definition of "thread-safe", which probably applies better to your situation, has to do with operations which affect the outside world (basically I/O). You may want some operations to only happen once, or to happen in a specific order. If the code which performs those operations runs on multiple threads, they could happen more times than desired, or in a different order than desired, unless you do something to prevent that.
It appears that your __setup method is only called when ourgem.rb is first loaded. As far as I know, even if multiple threads require the same file at the same time, MRI will only ever let a single thread load the file. I don't know whether JRuby is the same. But in any case, if your source files are being loaded more than once, that is symptomatic of a deeper problem. They should only be loaded once, on a single thread. If your app handles requests on multiple threads, those threads should be started up after the application has loaded, not before. This is the only sane way to do things.
Assuming that everything is sane, ourgem.rb will be loaded using a single thread. That means __setup will only ever be called by a single thread. In that case, there is no question of thread safety at all to worry about (as far as initialization of your "client cache" goes).
Even if __setup was to be called concurrently by multiple threads, your atomic code won't do what you think it does. First of all, you use Atomic.new({}).value. This wraps a Hash in an atomic reference, then unwraps it so you just get back the Hash. It's a no-op. You could just write {} instead.
Second, your Atomic#update call will not prevent the initialization code from running more than once. To understand this, you need to know what Atomic actually does.
Let me pull out the old, tired "increment a shared counter" example. Imagine the following code is running on 2 threads:
i += 1
We all know what can go wrong here. You may end up with the following sequence of events:
Thread A reads i and increments it.
Thread B reads i and increments it.
Thread A writes its incremented value back to i.
Thread B writes its incremented value back to i.
So we lose an update, right? But what if we store the counter value in an atomic reference, and use Atomic#update? Then it would be like this:
Thread A reads i and increments it.
Thread B reads i and increments it.
Thread A tries to write its incremented value back to i, and succeeds.
Thread B tries to write its incremented value back to i, and fails, because the value has already changed.
Thread B reads i again and increments it.
Thread B tries to write its incremented value back to i again, and succeeds this time.
Do you get the idea? Atomic never stops 2 threads from running the same code at the same time. What it does do, is force some threads to retry the #update block when necessary, to avoid lost updates.
If your goal is to ensure that your initialization code will only ever run once, using Atomic is a very inappropriate choice. If anything, it could make it run more times, rather than less (due to retries).
So, that is that. But if you're still with me here, I am actually more concerned about whether your "client" objects are themselves thread-safe. Do they have any mutable state? Since you are caching them, it seems that initializing them must be slow. Be that as it may, if you use locks to make them thread-safe, you may not be gaining anything from caching and sharing them between threads. Your "multi-threaded" server may be reduced to what is effectively an unnecessarily complicated, single-threaded server.
If the client objects have no mutable state, good for you. You can be "free and easy" and share them between threads with no problems. If they do have mutable state, but initializing them is slow, then I would recommend caching one object per thread, so they are never shared. Thread[] is your friend there.

Speed Up with multithreading

i have a parse method in my program, which first reads a file from disk then, parses the lines and creats an object for every line. For every file a collection with the objects from the lines is saved afterwards. The files are about 300MB.
This takes about 2.5-3 minutes to complete.
My question: Can i expect a significant speed up if i split the tasks up to one thread just reading files from disk, another parsing the lines and a third saving the collections? Or would this maybe slow down the process?
How long is it common for a modern notebook harddisk to read 300MB? I think, the bottleneck is the cpu in my task, because if i execute the method one core of cpu is always at 100% while the disk is idle more then the half time.
greetings, rain
private CANMessage parseLine(String line)
CANMessage canMsg = new CANMessage();
int offset = 0;
int offset_add = 0;
char[] delimiterChars = { ' ', '\t' };
string[] elements = line.Split(delimiterChars);
if (!isMessageLine(ref elements))
return canMsg = null;
offset = getPositionOfFirstWord(ref elements);
canMsg.TimeStamp = Double.Parse(elements[offset]);
offset += 3;
offset_add = getOffsetForShortId(ref elements, ref offset);
canMsg.ID = UInt16.Parse(elements[offset], System.Globalization.NumberStyles.HexNumber);
offset += 17; // for signs between identifier and data length number
canMsg.DataLength = Convert.ToInt16(elements[offset + offset_add]);
offset += 1;
parseDataBytes(ref elements, ref offset, ref offset_add, ref canMsg);
return canMsg;
catch (Exception exp)
MessageBox.Show(exp.Message + "\n\n" + exp.StackTrace);
return null;
So this is the parse method. It works this way, but maybe you are right and it is inefficient. I have .NET Framwork 4.0 and i am on Windows 7. I have a Core i7 where every core has HypterThreading, so i am only using about 1/8 of the cpu.
EDIT2: I am using Visual Studio 2010 Professional. It looks like the tools for a performance profiling are not available in this version (according to msdn MSDN Beginners Guide to Performance Profiling).
EDIT3: I changed the code now to use threads. It looks now like this:
foreach (string str in checkedListBoxImport.CheckedItems)
for(int i = 0; i < toImport.Count; i++)
String newString = new String(toImport.ElementAt(i).ToArray());
Thread t = new Thread(() => importOperation(newString));
While the parsing you saw above is called in the importOperation(...).
With this code it was possible to reduce the time from about 2.5 minutes to "only" 40 seconds. I got some concurrency problems i have to track but at least this is much faster then before.
Thank you for your advice.
It's unlikely that you are going to get consistent metrics for laptop hard disk performance as we have no idea how old your laptop is nor do we know if it is sold state or spinning.
Considering you have already done some basic profiling, I'd wager the CPU really is your bottleneck as it is impossible for a single threaded application to use more than 100% of a single cpu. This is of course ignoring your operating system splitting the process over multiple cores and other oddities. If you were getting 5% CPU usage instead, it'd be most likely were bottle necking at IO.
That said your best bet would be to create a new thread task for each file you are processing and send that to a pooled thread manager. Your thread manager should limit the number of threads you are running to either the number of cores you have available or if memory is an issue (you did say you were generating 300MB files after all) the maximum amount of ram you can use for the process.
Finally, to answer the reason why you don't want to use a separate thread for each operation, consider what you already know about your performance bottlenecks. You are bottle necked on cpu processing and not IO. This means that if you split your application into separate threads your read and write threads would be starved most of the time waiting for your processing thread to finish. Additionally, even if you made them process asynchronously, you have the very real risk of running out of memory as your read thread continues to consume data that your processing thread can't keep up with.
Thus, be careful not to start each thread immediately and let them instead be managed by some form of blocking queue. Otherwise you run the risk of slowing your system to a crawl as you spend more time in context switches than processing. This is of course assuming you don't crash first.
It's unclear how many of these 300MB files you've got. A single 300MB file takes about 5 or 6 seconds to read on my netbook, with a quick test. It does indeed sound like you're CPU-bound.
It's possible that threading will help, although it's likely to complicate things significantly of course. You should also profile your current code - it may well be that you're just parsing inefficiently. (For example, if you're using C# or Java and you're concatenating strings in a loop, that's frequently a performance "gotcha" which can be easily remedied.)
If you do opt for a multi-threaded approach, then to avoid thrashing the disk, you may want to have one thread read each file into memory (one at a time) and then pass that data to a pool of parsing threads. Of course, that assumes you've also got enough memory to do so.
If you could specify the platform and provide your parsing code, we may be able to help you optimize it. At the moment all we can really say is that yes, it sounds like you're CPU bound.
That long for only 300 MB is bad.
There's different things that could be impacting performance as well depending upon the situation, but typically it's reading the hard disk is still likely the biggest bottleneck unless you have something intense going on during the parsing, and which seems the case here because it only takes several seconds to read 300MB from a harddisk (unless it's way bad fragged maybe).
If you have some inefficient algorithm in the parsing, then picking or coming up with a better algorithm would probably be more beneficial. If you absolutely need that algorithm and there's no algorithmic improvement available, it sounds like you might be stuck.
Also, don't try to multithread to read and write at the same time with the multithreading, you'll likely slow things way down to increased seeking.
Given that you think this is a CPU bound task, you should see some overall increase in throughput with separate IO threads (since otherwise your only processing thread would block waiting for IO during disk read/write operations).
Interestingly I had a similar issue recently and did see a significant net improvement by running separate IO threads (and enough calculation threads to load all CPU cores).
You don't state your platform, but I used the Task Parallel Library and a BlockingCollection for my .NET solution and the implementation was almost trivial. MSDN provides a good example.
As Jon notes, the time spent on IO is probably small compared to the time spent calculating, so while you can expect an improvement, the best use of time may be profiling and improving the calculation itself. Using multiple threads for the calculation will speed up significantly.
Hmm.. 300MB of lines that have to be split up into a lot of CAN message objects - nasty! I suspect the trick might be to thread off the message assembly while avoiding excessive disk-thrashing between the read and write operations.
If I was doing this as a 'fresh' requirement, (and of course, with my 20/20 hindsight, knowing that CPU was going to be the problem), I would probably use just one thread for reading, one for writing the disk and, initially at least, one thread for the message object assembly. Using more than one thread for message assembly means the complication of resequencing the objects after processing to prevent the output file being written out-of-order.
I would define a nice disk-friendly sized chunk-class of lines and message-object array instances, say 1024 of them, and create a pool of chunks at startup, 16 say, and shove them onto a storage queue. This controls and caps memory use, greatly reduces new/dispose/malloc/free, (looks like you have a lot of this at the moment!), improves the efficiency of the disk r/w operations as only large r/w are performed, (except for the last chunk which will be, in general, only partly filled), provides inherent flow-control, (the read thread cannot 'run away' because the pool will run out of chunks and the read thread will block on the pool until the write thread returns some chunks), and inhibits excess context-switching because only large chunks are processed.
The read thread opens the file, gets a chunk from the queue, reads the disk, parses into lines and shoves the lines into the chunk. It then queues the whole chunk to the processing thread and loops around to get another chunk from the pool. Possibly, the read thread could, on start or when idle, be waiting on its own input queue for a message class instance that contains the read/write filespecs. The write filespec could be propagated through a field of the chunks, so supplying the the write thread wilth everything it needs via. the chunks. This makes a nice subsystem to which filespecs can be queued and it will process them all without any further intervention.
The processing thread gets chunks from its input queue and splits the the lines up into the message objects in the chunk and then queues the completed, whole chunks to the write thread.
The write thread writes the message objects to the output file and then requeues the chunk to the storage pool queue for re-use by the read thread.
All the queues should be blocking producer-consumer queues.
One issue with threaded subsystems is completion notification. When the write thread has written the last chunk of a file, it probably needs to do something. I would probably fire an event with the last chunk as a parameter so that the event handler knows which file has been completely written. I would probably somethihng similar with error notifications.
If this is not fast enough, you could try:
1) Ensure that the read and write threads cannot be preemepted in favour of the other during chunk-disking by using a mutex. If your chunks are big enough, this probably won't make much difference.
2) Use more than one processing thread. If you do this, chunks may arrive at the write-thread 'out-of-order'. You would maybe need a local list and perhaps some sort of sequence-number in the chunks to ensure that the disk writes are correctly ordered.
Good luck, whatever design you come up with..

C++/CLI efficient multithreaded circular buffer

I have four threads in a C++/CLI GUI I'm developing:
Collects raw data
The GUI itself
A background processing thread which takes chunks of raw data and produces useful information
Acts as a controller which joins the other three threads
I've got the raw data collector working and posting results to the controller, but the next step is to store all of those results so that the GUI and background processor have access to them.
New raw data is fed in one result at a time at regular (frequent) intervals. The GUI will access each new item as it arrives (the controller announces new data and the GUI then accesses the shared buffer). The data processor will periodically read a chunk of the buffer (a seconds worth for example) and produce a new result. So effectively, there's one producer and two consumers which need access.
I've hunted around, but none of the CLI-supplied stuff sounds all that useful, so I'm considering rolling my own. A shared circular buffer which allows write-locks for the collector and read locks for the gui and data processor. This will allow multiple threads to read the data as long as those sections of the buffer are not being written to.
So my question is: Are there any simple solutions in the .net libraries which could achieve this? Am I mad for considering rolling my own? Is there a better way of doing this?
Is it possible to rephrase the problem so that:
The Collector collects a new data point ...
... which it passes to the Controller.
The Controller fires a GUI "NewDataPointEvent" ...
... and stores the data point in an array.
If the array is full (or otherwise ready for processing), the Controller sends the array to the Processor ...
... and starts a new array.
If the values passed between threads are not modified after they are shared, this might save you from needing the custom thread-safe collection class, and reduce the amount of locking required.
