Please tell me what is wrong with my threading! - multithreading

I have a function where I will compress a bunch of files into a single compressed file..it is taking a long time(to compress),so I tried implementing threading in my application..Say if I have 20 files for compression,I separated that as 5*4=20,inorder to do that I have separate variables(which are used for compression) for all 4 threads in order to avoid locks and I will wait until the 4 thread finishes..Now..the threads are working but i see no improvement in their performance..normally it will take 1 min for 20 files(for example) after implementing threading ...there is only 5 or 3 sec difference., sometimes the same.
here i will show the code for 1 thread(so it is for other3 threads)
//main thread
myClassObject->thread1 = AfxBeginThread((AFX_THREADPROC)MyThreadFunction1,myClassObject);
....
HANDLE threadHandles[4];
threadHandles[0] = myClassObject->thread1->m_hThread;
....
WaitForSingleObject(myClassObject->thread1->m_hThread,INFINITE);
UINT MyThreadFunction(LPARAM lparam)
{
CMerger* myClassObject = (CMerger*)lparam;
CString outputPath = myClassObject->compressedFilePath.GetAt(0);//contains the o/p path
wchar_t* compressInputData[] = {myClassObject->thread1outPath,
COMPRESS,(wchar_t*)(LPCTSTR)(outputPath)};
HINSTANCE loadmyDll;
loadmydll = LoadLibrary(myClassObject->thread1outPath);
fp_Decompress callCompressAction = NULL;
int getCompressResult=0;
myClassObject->MyCompressFunction(compressInputData,loadClient7zdll,callCompressAction,myClassObject->thread1outPath,
getCompressResult,minIndex,myClassObject->firstThread,myClassObject);
return 0;
}

Firstly, you only wait on one of the threads. I think you want WaitForMultipleObjects.
As for the lack of speed up have you considered that your actual bottleneck is NOT the compression but the file loading? File loading is slow and 4 threads contending for time slices of the hard disk "could" even result in lower performance.
This is why premature optimisation is evil. You need to profile, profile and profile again to work out where your REAL bottlenecks are.
Edit: I can't really comment on your WaitForMultipleObjects unless I see the code. I have never had any problems with it myself ...
As for a bottleneck. Its a metaphor if you try to pour a large amount of liquid out of a cylinder by tipping it upside-down then the water leaves at a constant rate. If you try to do this with a bottle you will notice that it can't do it as fast. This is because there is only so much liquid that can flow through the thin part of the bottle (not to mention the air entering into it). Thus the limitation of your water emptying from the container is limited by the neck of the bottle (the thin part).
In programming when you talk about a bottle neck you are talking about the slowest part of the code. In this case if your threads spend most of their time waiting for the disk load to complete then you are going to get very little speed up by multi-threading as you can only load so much at once. In fact when you try to load 4 times as much at once then you will start to find that you have to wait around just as long for the load to complete. In your single threading case you wait around and once its loaded you compress. In the 4 threaded case you are waiting around 4 times as long for all the loads to complete and then you compress all 4 files simultaneously. This is why you get a small speed up. Unfortunately due to the fact you are spending most of your time waiting for the loads to complete you won't see anything approaching a 4x speed up. Hence the limiting factor of your method is not the compression but the loading the file from disk and hence it gets called a bottleneck.
Edit2: In a case such as you are suggesting you will find the best speed up would be had by eliminating the amount of time you are waiting for data to load from disk.
1) If you load a file as multiple of disk pages (usually 2048 byte but you can query windows to get the size) you get best possible load performance. If you load sizes that aren't a multiple of this you will get quite a serious performance hit.
2) Look at asynchronous loading. For example you could be loading all of file 2 (or more) in to memory while you are processing file 1. This means that you aren't waiting around for the load to complete. Its unlikely, though, you'll get a vast speed up here as you'll probably still end up waiting for the load. The other thing to try is to load "chunks" of the audio file asynchronously. ie:
Load chunk 1.
Start chunk 2 loading.
Process chunk 1.
Wait for chunk 2 to load.
Start chunk 3 loading.
Process chunk 2.
(And so on)
3) You could just buy a faster disk drive.

Related

Nodejs and calculation heavy operations, utilizing cpu to the maximum with worker threads, while still getting some responsiveness

I'm trying to solve the following scenario in nodejs in a performant meaner.
I have a 100Mb worth of jsons which I need to process and the time function to process each entry is about O(sweet_jesus(n)). In real time it takes about ~4-5 seconds for each entry.
The only silver lining that I can totally run the processing of each entry individually (about 900 entries in total), they are unrelated.
My first choice was to go for worker_threads with node-worker-threads-pool:
import fs from 'fs';
import path from 'path';
import _ from 'lodash';
import moment from 'moment';
import workerPool from 'node-worker-threads-pool';
function generateShortEvaluationsByWorkers(){
const pool = new workerPool.StaticPool({
size: 10,
task: path.resolve('src/simulator/evaluationGenerator.js')
});
let simulationEvaluations = [];
const promises = [];
fs.readdirSync(path.resolve(`results/companies`)).forEach(file => {
const rawData = fs.readFileSync(path.resolve(`results/companies/${file}`));
const company = JSON.parse(rawData);
console.log(new Date(), ": company parsed, sending it for processing:", file);
promises.push(pool.exec(company).then(result=>{
simulationEvaluations.push(result);
}));
});
Promise.all(promises).then(()=>{
fs.writeFileSync(
path.resolve(`results/bundles/simulationEvaluations.json`), JSON.stringify(simulationEvaluations, null, 2)
);
pool.destroy();
})
}
The above code runs beautifully, it shows that the I/O - of reading all the files and feeding it to the pool - takes about 5-6 seconds...
But after that there is absolutely no difference whatsoever compared to running whole thing in a single thread. The logs do show that the processing of the individual files no longer happen in order as before, so I guess there are some threading happening in the background, but the total time does not change one bit. It takes about an hour either way.
Also my hyper-threaded Intel 8750 with 6 cores (12 logical) shows 86% utilization goes to the node process. So my alleged 10 separate thread doesn't even manage to utilize one full core. - EDIT: I was a retard it does make a huge difference I wrote down the times wrong...
After this I crank the thread pool size up to 100 and slice the number of files down to a 100. And that's where freaky stuff starts to happen. First, all my CPU cores go brrrr and my laptop properly melts through the table as one would expect. OS gives zero responsiveness everything is a slideshow.
The first 20 or so files gets processed within the same second after which the processing of individual files go to ~3 seconds each (neatly after each other, one message 3-5 seconds after the other). The last 10 or so files gets processed within the same second again.
Why does 10 threads doesn't make a difference compared to 1 thread?
Shouldn't I see files to be processed in clusters, where the cluster size is comparable to the number of logical cores, instead of timestamps one after the other?
Is there a way to "leave" a core to process something else, while calculations still go to Neptune with all the other cores?
EDIT: I wont delete this, maybe somebody will learn from it :)
So to answer my own questions:
It does, I could not measure, could not write, and could not read my CPU meter either at this point... totally my fault
This one I still don't fully get, but after a few runs I suspect that when you start a whole buttload of threads, you make the whole system hang so much just by the strain of starting them all that by the time its able to spew out the first log, its already done with a bunch of calculation.
Yeah this is also kinda obvious, do not use so many threads that the thread management itself will make the OS throw a shitfit.
In the end I got the best results with 11 threads btw.

Invoke progressively getting slower

I've been investigating performance issues in my app, and it boils down to the time taken to call Invoke progressively getting longer. I am using System.Diagnostics.Stopwatch to time the Invoke call itself, and while it starts off at 20ms, after a few hundred calls it is around 4000ms. Logging shows the time steadily increasing (at first by ~2ms per call, then by ~100ms and more). I have three Invokes, all are exhibiting the same behaviour.
I am loading medical images, and I need to keep my UI responsive while doing so, hence the use of a background worker, where I can load and process images, but once loaded they need to be added to the man UI for the user to see.
The problem didn't present itself until I tried to load a study of over 800 images. Previously my test sets have been ~100 images, ranging in total size from 400MB to 16GB. The problem set is only 2GB in size and takes close to 10 minutes to approach 50%, and the 16GB set loads in ~30s total, thus ruling out total image size as the issue. For reference my development machine has 32GB RAM. I have ensured that it is not the contents of the invoked method by commenting the entire thing out.
What I want to understand is how is it possible for the time taken to invoke to progressively increase? Is this actually a thing? My call stacks are not getting deeper, Number of threads is consistent, what resource is being consumed to cause this? What am I missing!?
public void UpdateThumbnailInfo(Thumbnail thumb, ThumbnailInfo info)
{
if (InvokeRequired)
{
var sw = new Stopwatch();
sw.Start();
Invoke((Action<Thumbnail, ThumbnailInfo>) UpdateThumbnailInfo, thumb, info);
Log.Debug("Update Thumbnail Info Timer: {Time} ms - {File}", (int) sw.ElapsedMilliseconds, info.Filename);
}
else
{
// Do stuff here
}
}
Looks like you are calling UpdateThumbnailInfo from a different thread. If so, then this is the expected behavior. What is happening is you are queuing hundreds of tasks on the UI thread. For every loaded image the UI needs to do a lot of things, so as the number of images increases, the overall operations grow slow.
A few things that you can do:
* Use BeginInvoke in place of Invoke. As your function is void type, you will not need EndInvoke
* Use SuspendLayout and ResumeLayout to prevent UI from incrementally updating, and rather update everything once when all images are loaded.

How to measure multithreaded process time on a multitasking environment?

Since I am running performance evaluation tests of my multithreaded program on a (preemptive) multitasking, multicore environment, the process can get swapped out periodically. I want to compute the latency, i.e., only the duration when the process was active. This will allow me to extrapolate how the performance would be on a non-multitasking environment, i.e., where only one program is running (most of the time), or on different workloads.
Usually two kinds of time are measured:
The wall-clock time (i.e., the time since the process started) but this includes the time when the process was swapped out.
The processor time (i.e., sum total of CPU time used by all threads) but this is not useful to compute the latency of the process.
I believe what I need is makespan of times of individual threads, which can be different from the maximum CPU time used by any thread due to the task dependency structure among the threads. For example, in a process with 2 threads, thread 1 is heavily loaded in the first two-third of the runtime (for CPU time t) while thread 2 is loaded in the later two-third of the runtime of the process (again, for CPU time t). In this case:
wall-clock time would return 3t/2 + context switch time + time used by other processes in between,
max CPU time of all threads would return a value close to t, and
total CPU time is close to 2t.
What I hope to receive as output of measure is the makespan, i.e., 3t/2.
Furthermore, multi-threading brings indeterminacy on its own. This issue can probably be taken care of running the test multiple times and summarizing the results.
Moreover, the latency also depends on how the OS schedules the threads; things get more complicated if some threads of a process wait for CPU while others run. But lets forget about this.
Is there an efficient way to compute/approximate this makespan time? For providing code examples, please use any programming language, but preferably C or C++ on linux.
PS: I understand this definition of makespan is different from what is used in scheduling problems. The definition used in scheduling problems is similar to wall-clock time.
Reformulation of the Question
I have written a multi-threaded application which takes X seconds to execute on my K-core machine.
How do I estimate how long the program will take to run on a single-core computer?
Empirically
The obvious solution is to get a computer with one core, and run your application, and use Wall-Clock time and/or CPU time as you wish.
...Oh, wait, your computer already has one core (it also has some others, but we won't need to use them).
How to do this will depend on the Operating System, but one of the first results I found from Google explains a few approaches for Windows XP and Vista.
http://masolution.blogspot.com/2008/01/how-to-use-only-one-core-of-multi-core.html
Following that you could:
Assign your Application's process to a single core's affinity. (you can also do this in your code).
Start your operating system only knowing about one of your cores. (and then switch back afterwards)
Independent Parallelism
Estimating this analytically requires knowledge about your program, the method of parallelism, etc.
As an simple example, suppose I write a multi-threaded program that calculates the ten billionth decimal digit of pi and the ten billionth decimal digit of e.
My code looks like:
public static int main()
{
Task t1 = new Task( calculatePiDigit );
Task t2 = new Task( calculateEDigit );
t1.Start();
t2.Start();
Task.waitall( t1, t2 );
}
And the happens-before graph looks like:
Clearly these are independent.
In this case
Time calculatePiDigit() by itself.
Time calculateEDigit() by itself.
Add the times together.
2-Stage Pipeline
When the tasks are not independent, you won't be able to just add the individual times together.
In this next example, I create a multi-threaded application to: take 10 images, convert them to grayscale, and then run a line detection algorithm. For some external reason, every images are not allowed to be processed out of order. Because of this, I create a pipeline pattern.
My code looks something like this:
ConcurrentQueue<Image> originalImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> grayscaledImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> completedImages = new ConcurrentQueue<Image>();
public static int main()
{
PipeLineStage p1 = new PipeLineStage(originalImages, grayScale, grayscaledImages);
PipeLineStage p2 = new PipeLineStage(grayscaledImages, lineDetect, completedImages);
p1.Start();
p2.Start();
originalImages.add( image1 );
originalImages.add( image2 );
//...
originalImages.add( image10 );
originalImages.add( CancellationToken );
Task.WaitAll( p1, p2 );
}
A data centric happens-before graph:
If this program had been designed as a sequential program to begin with, for cache reasons it would be more efficient to take each image one at a time and move them to completed, before moving to the next image.
Anyway, we know that GrayScale() will be called 10 times and LineDetection() will be called 10 times, so we can just time each independently and then multiply them by 10.
But what about the costs of pushing/popping/polling the ConcurrentQueues?
Assuming the images are large, that time will be negligible.
If there are millions of small images, with many consumers at each stage, then you will probably find that the overhead of waiting on locks, mutexes, etc, is very small when a program is run sequentially (assuming that the amount of work performed in the critical sections is small, such as inside the concurrent queue).
Costs of Context Switching?
Take a look at this question:
How to estimate the thread context switching overhead?
Basically, you will have context switches in multi-core environments and in single-core environments.
The overhead to perform a context switch is quite small, but they also occur very many times per second.
The danger is that the cache gets fully disrupted between context switches.
For example, ideally:
image1 gets loaded into the cache as a result of doing GrayScale
LineDetection will run much faster on image1, since it is in the cache
However, this could happen:
image1 gets loaded into the cache as a result of doing GrayScale
image2 gets loaded into the cache as a result of doing GrayScale
now pipeline stage 2 runs LineDetection on image1, but image1 isn't in the cache anymore.
Conclusion
Nothing beats timing on the same environment it will be run in.
Next best is to simulate that environment as well as you can.
Regardless, understanding your program's design should give you an idea of what to expect in a new environment.

Need thoughts on profiling of multi-threading in C on Linux

My application scenario is like this: I want to evaluate the performance gain one can achieve on a quad-core machine for processing the same amount of data. I have following two configurations:
i) 1-Process: A program without any threading and processes data from 1M .. 1G, while system was assumed to run only single core of its 4-cores.
ii) 4-threads-Process: A program with 4-threads (all threads performing same operation) but processing 25% of the input data.
In my program for creating 4-threads, I used pthread's default options (i.e., without any specific pthread_attr_t). I believe the performance gain of 4-thread configuration comparing to 1-Process configuration should be closer to 400% (or somewhere between 350% and 400%).
I profiled the time spent in creation of threads just like this below:
timer_start(&threadCreationTimer);
pthread_create( &thread0, NULL, fun0, NULL );
pthread_create( &thread1, NULL, fun1, NULL );
pthread_create( &thread2, NULL, fun2, NULL );
pthread_create( &thread3, NULL, fun3, NULL );
threadCreationTime = timer_stop(&threadCreationTimer);
pthread_join(&thread0, NULL);
pthread_join(&thread1, NULL);
pthread_join(&thread2, NULL);
pthread_join(&thread3, NULL);
Since increase in the size of the input data may also increase in the memory requirement of each thread, then so loading all data in advance is definitely not a workable option. Therefore, in order to ensure not to increase the memory requirement of each thread, each thread reads data in small chunks, process it and reads next chunk process it and so on. Hence, structure of the code of my functions run by threads is like this:
timer_start(&threadTimer[i]);
while(!dataFinished[i])
{
threadTime[i] += timer_stop(&threadTimer[i]);
data_source();
timer_start(&threadTimer[i]);
process();
}
threadTime[i] += timer_stop(&threadTimer[i]);
Variable dataFinished[i] is marked true by process when the it received and process all needed data. Process() knows when to do that :-)
In the main function, I am calculating the time taken by 4-threaded configuration as below:
execTime4Thread = max(threadTime[0], threadTime[1], threadTime[2], threadTime[3]) + threadCreationTime.
And performance gain is calculated by simply
gain = execTime1process / execTime4Thread * 100
Issue:
On small data size around 1M to 4M, the performance gain is generally good (between 350% to 400%). However, the trend of performance gain is exponentially decreasing with increase in the input size. It keeps decreasing until some data size of upto 50M or so, and then become stable around 200%. Once it reached that point, it remains almost stable for even 1GB of data.
My question is can anybody suggest the main reasoning of this behaviour (i.e., performance drop at the start and but remaining stable later)?
And suggestions how to fix that?
For your information, I also investigated the behaviour of threadCreationTime and threadTime for each thread to see what's happening. For 1M of data the values of these variables are small and but with increase in the data size both these two variables increase exponentially (but threadCreationTime should remain almost same regardless of data size and threadTime should increase at a rate corresponding to data being processing). After keep on increasing until 50M or so threadCreationTime becomes stable and threadTime (just like performance drop becomes stable) and threadCreationTime keep increasing at a constant rate corresponding to increase in data to be processed (which is considered understandable).
Do you think increasing the stack size of each thread, process priority stuff or custom values of other parameters type of scheduler (using pthread_attr_init) can help?
PS: The results are obtained while running the programs under Linux's fail safe mode with root (i.e., minimal OS is running without GUI and networking stuff).
Since increase in the size of the input data may also increase in the
memory requirement of each thread, then so loading all data in advance
is definitely not a workable option. Therefore, in order to ensure not
to increase the memory requirement of each thread, each thread reads
data in small chunks, process it and reads next chunk process it and
so on.
Just this, alone, can cause a drastic speed decrease.
If there is sufficient memory, reading one large chunk of input data will always be faster than reading data in small chunks, especially from each thread. Any I/O benefits from chunking (caching effects) disappears when you break it down into pieces. Even allocating one big chunk of memory is much cheaper than allocating small chunks many, many times.
As a sanity check, you can run htop to ensure that at least all your cores are being topped out during the run. If not, your bottleneck could be outside of your multi-threading code.
Within the threading,
threading context switches due to many threads can cause sub-optimal speedup
as mentioned by others, a cold cache due to not reading memory contiguously can cause slowdowns
But re-reading your OP, I suspect the slowdown has something to do with your data input/memory allocation. Where exactly are you reading your data from? Some kind of socket? Are you sure you need to allocate memory more than once in your thread?
Some algorithm in your worker threads is likely to be suboptimal/expensive.
Are your thread starting on creation ? If it is the case, then the following will happen :
while your parent thread is creating thread, the thread already created will start to run. When you hit timerStop (ThreadCreation timer), the four have already run
for a certain time. So threadCreationTime overlaps threadTime[i]
As it is now, you don't know what you are measuring. This won't solve your problem, because obviously you have a problem since threadTime does not augment linearly, but at least you won't add overlapping times.
To have more info you can use the perf tool if it is available on your distro.
for example :
perf stat -e cache-misses <your_prog>
and see what happens with a two thread version, a three thread version etc...

Speed Up with multithreading

i have a parse method in my program, which first reads a file from disk then, parses the lines and creats an object for every line. For every file a collection with the objects from the lines is saved afterwards. The files are about 300MB.
This takes about 2.5-3 minutes to complete.
My question: Can i expect a significant speed up if i split the tasks up to one thread just reading files from disk, another parsing the lines and a third saving the collections? Or would this maybe slow down the process?
How long is it common for a modern notebook harddisk to read 300MB? I think, the bottleneck is the cpu in my task, because if i execute the method one core of cpu is always at 100% while the disk is idle more then the half time.
greetings, rain
EDIT:
private CANMessage parseLine(String line)
{
try
{
CANMessage canMsg = new CANMessage();
int offset = 0;
int offset_add = 0;
char[] delimiterChars = { ' ', '\t' };
string[] elements = line.Split(delimiterChars);
if (!isMessageLine(ref elements))
{
return canMsg = null;
}
offset = getPositionOfFirstWord(ref elements);
canMsg.TimeStamp = Double.Parse(elements[offset]);
offset += 3;
offset_add = getOffsetForShortId(ref elements, ref offset);
canMsg.ID = UInt16.Parse(elements[offset], System.Globalization.NumberStyles.HexNumber);
offset += 17; // for signs between identifier and data length number
canMsg.DataLength = Convert.ToInt16(elements[offset + offset_add]);
offset += 1;
parseDataBytes(ref elements, ref offset, ref offset_add, ref canMsg);
return canMsg;
}
catch (Exception exp)
{
MessageBox.Show(line);
MessageBox.Show(exp.Message + "\n\n" + exp.StackTrace);
return null;
}
}
}
So this is the parse method. It works this way, but maybe you are right and it is inefficient. I have .NET Framwork 4.0 and i am on Windows 7. I have a Core i7 where every core has HypterThreading, so i am only using about 1/8 of the cpu.
EDIT2: I am using Visual Studio 2010 Professional. It looks like the tools for a performance profiling are not available in this version (according to msdn MSDN Beginners Guide to Performance Profiling).
EDIT3: I changed the code now to use threads. It looks now like this:
foreach (string str in checkedListBoxImport.CheckedItems)
{
toImport.Add(str);
}
for(int i = 0; i < toImport.Count; i++)
{
String newString = new String(toImport.ElementAt(i).ToArray());
Thread t = new Thread(() => importOperation(newString));
t.Start();
}
While the parsing you saw above is called in the importOperation(...).
With this code it was possible to reduce the time from about 2.5 minutes to "only" 40 seconds. I got some concurrency problems i have to track but at least this is much faster then before.
Thank you for your advice.
It's unlikely that you are going to get consistent metrics for laptop hard disk performance as we have no idea how old your laptop is nor do we know if it is sold state or spinning.
Considering you have already done some basic profiling, I'd wager the CPU really is your bottleneck as it is impossible for a single threaded application to use more than 100% of a single cpu. This is of course ignoring your operating system splitting the process over multiple cores and other oddities. If you were getting 5% CPU usage instead, it'd be most likely were bottle necking at IO.
That said your best bet would be to create a new thread task for each file you are processing and send that to a pooled thread manager. Your thread manager should limit the number of threads you are running to either the number of cores you have available or if memory is an issue (you did say you were generating 300MB files after all) the maximum amount of ram you can use for the process.
Finally, to answer the reason why you don't want to use a separate thread for each operation, consider what you already know about your performance bottlenecks. You are bottle necked on cpu processing and not IO. This means that if you split your application into separate threads your read and write threads would be starved most of the time waiting for your processing thread to finish. Additionally, even if you made them process asynchronously, you have the very real risk of running out of memory as your read thread continues to consume data that your processing thread can't keep up with.
Thus, be careful not to start each thread immediately and let them instead be managed by some form of blocking queue. Otherwise you run the risk of slowing your system to a crawl as you spend more time in context switches than processing. This is of course assuming you don't crash first.
It's unclear how many of these 300MB files you've got. A single 300MB file takes about 5 or 6 seconds to read on my netbook, with a quick test. It does indeed sound like you're CPU-bound.
It's possible that threading will help, although it's likely to complicate things significantly of course. You should also profile your current code - it may well be that you're just parsing inefficiently. (For example, if you're using C# or Java and you're concatenating strings in a loop, that's frequently a performance "gotcha" which can be easily remedied.)
If you do opt for a multi-threaded approach, then to avoid thrashing the disk, you may want to have one thread read each file into memory (one at a time) and then pass that data to a pool of parsing threads. Of course, that assumes you've also got enough memory to do so.
If you could specify the platform and provide your parsing code, we may be able to help you optimize it. At the moment all we can really say is that yes, it sounds like you're CPU bound.
That long for only 300 MB is bad.
There's different things that could be impacting performance as well depending upon the situation, but typically it's reading the hard disk is still likely the biggest bottleneck unless you have something intense going on during the parsing, and which seems the case here because it only takes several seconds to read 300MB from a harddisk (unless it's way bad fragged maybe).
If you have some inefficient algorithm in the parsing, then picking or coming up with a better algorithm would probably be more beneficial. If you absolutely need that algorithm and there's no algorithmic improvement available, it sounds like you might be stuck.
Also, don't try to multithread to read and write at the same time with the multithreading, you'll likely slow things way down to increased seeking.
Given that you think this is a CPU bound task, you should see some overall increase in throughput with separate IO threads (since otherwise your only processing thread would block waiting for IO during disk read/write operations).
Interestingly I had a similar issue recently and did see a significant net improvement by running separate IO threads (and enough calculation threads to load all CPU cores).
You don't state your platform, but I used the Task Parallel Library and a BlockingCollection for my .NET solution and the implementation was almost trivial. MSDN provides a good example.
UPDATE:
As Jon notes, the time spent on IO is probably small compared to the time spent calculating, so while you can expect an improvement, the best use of time may be profiling and improving the calculation itself. Using multiple threads for the calculation will speed up significantly.
Hmm.. 300MB of lines that have to be split up into a lot of CAN message objects - nasty! I suspect the trick might be to thread off the message assembly while avoiding excessive disk-thrashing between the read and write operations.
If I was doing this as a 'fresh' requirement, (and of course, with my 20/20 hindsight, knowing that CPU was going to be the problem), I would probably use just one thread for reading, one for writing the disk and, initially at least, one thread for the message object assembly. Using more than one thread for message assembly means the complication of resequencing the objects after processing to prevent the output file being written out-of-order.
I would define a nice disk-friendly sized chunk-class of lines and message-object array instances, say 1024 of them, and create a pool of chunks at startup, 16 say, and shove them onto a storage queue. This controls and caps memory use, greatly reduces new/dispose/malloc/free, (looks like you have a lot of this at the moment!), improves the efficiency of the disk r/w operations as only large r/w are performed, (except for the last chunk which will be, in general, only partly filled), provides inherent flow-control, (the read thread cannot 'run away' because the pool will run out of chunks and the read thread will block on the pool until the write thread returns some chunks), and inhibits excess context-switching because only large chunks are processed.
The read thread opens the file, gets a chunk from the queue, reads the disk, parses into lines and shoves the lines into the chunk. It then queues the whole chunk to the processing thread and loops around to get another chunk from the pool. Possibly, the read thread could, on start or when idle, be waiting on its own input queue for a message class instance that contains the read/write filespecs. The write filespec could be propagated through a field of the chunks, so supplying the the write thread wilth everything it needs via. the chunks. This makes a nice subsystem to which filespecs can be queued and it will process them all without any further intervention.
The processing thread gets chunks from its input queue and splits the the lines up into the message objects in the chunk and then queues the completed, whole chunks to the write thread.
The write thread writes the message objects to the output file and then requeues the chunk to the storage pool queue for re-use by the read thread.
All the queues should be blocking producer-consumer queues.
One issue with threaded subsystems is completion notification. When the write thread has written the last chunk of a file, it probably needs to do something. I would probably fire an event with the last chunk as a parameter so that the event handler knows which file has been completely written. I would probably somethihng similar with error notifications.
If this is not fast enough, you could try:
1) Ensure that the read and write threads cannot be preemepted in favour of the other during chunk-disking by using a mutex. If your chunks are big enough, this probably won't make much difference.
2) Use more than one processing thread. If you do this, chunks may arrive at the write-thread 'out-of-order'. You would maybe need a local list and perhaps some sort of sequence-number in the chunks to ensure that the disk writes are correctly ordered.
Good luck, whatever design you come up with..
Rgds,
Martin

Resources