How do you regulate concurrency/relative process performance in Erlang? - multithreading

Let's say I have to read from a directory that has many large XML files in it, and I have to parse that and send them to some service via network, and then write the response to disk again.
If it were Java or C++ etc., I may do something like this (hope this makes sense):
(File read & xml parsing process) -> bounded-queue -> (sender process) -> service
service -> bounded-queue -> (process to parse result and write to disk)
And then I'd assign whatever suitable number of threads to each process. This way I can limit the concurrency of each process at its optimal value, and the bounded queue will ensure there won't be memory shortage etc.
What should I do though when coding in Erlang? I guess I could just implement the whole flow in a function, then iterate the directory and spawn these "start-to-end" processes as fast as possible. This sounds suboptimal though because if parsing of XML takes longer than reading the files etc. the app. could go into memory shortage for having many XML documents in-memory at once etc., and you can't keep the concurrency at the optimal level. E.g. if the "service" is most efficient when concurrency is 4, it would be very inefficient to hit it with enormous concurrency.
How should erlang programmers deal with such situation? I.e. what is the erlang substitute for fixed thread pool and bounded queue?

There is no real way to limit the queue sizes of a process except by handling them all in a timely fashion. Best way would be to simply check available resources before spawning and wait if they are insufficient. So if you are worried about memory, check memory before spawning a new process. if discspace, check diskspace, ect.
Limiting the number of processes spawned is also possible. A simple construction would be:
pool(Max) ->
process_flag(trap_exit, true),
pool(0, Max);
pool(Current, Max) ->
receive
{'EXIT', _, _} ->
pool(Current - 1, Max);
{ work, F, Pid} when Current < Max ->
Pid ! accepted,
spawn_link(F),
pool(Current + 1, Max);
{ work, _, Pid} ->
Pid ! rejected,
pool(Current, Max);
end.
This is a rough sketch how a process would limit the number of processes it spawns. It is however considered better to limit on the real reasons instead of an artificial number.

You can definitely run your own process pool in Erlang, but it is a poor way memory usage since it doesn't take into account the size of the XML data being read (or the total memory used by the processes for that matter).
I would suggest implementing the whole workflow in a functional library, as you suggested, and spawn processes that execute this workflow. Add a check for memory usage which will look at the size of the data to be read in and the available memory (hint: use memsup).

I would suggest you do it in event-driven paradigm.
Imagine you started OTP gen_server with the list of file names.
gen_servers checks resources and spawns next worker if permitted, removing file name from the list and passing it to worker.
Worker processes file and casts message back to gen_server when ready (or you can just trap EXIT).
gen_server receives such message and performs step 1 until file list is empty.
So workers do the heavy lifting, gen_server controls the flow.
You can also create distributed system, but it's a bit more complex as you need to spawn intermediate gen_servers on each computer and query them if resources are available there and then choose which computer should process next file based on replies. And you probably need something like NFS to avoid sending long messages.
Workers can be further split if you need more concurrency.

Related

Which one I should use in Clojure? go block or thread?

I want to see the intrinsic difference between a thread and a long-running go block in Clojure. In particular, I want to figure out which one I should use in my context.
I understand if one creates a go-block, then it is managed to run in a so-called thread-pool, the default size is 8. But thread will create a new thread.
In my case, there is an input stream that takes values from somewhere and the value is taken as an input. Some calculations are performed and the result is inserted into a result channel. In short, we have input and out put channel, and the calculation is done in the loop. So as to achieve concurrency, I have two choices, either use a go-block or use thread.
I wonder what is the intrinsic difference between these two. (We may assume there is no I/O during the calculations.) The sample code looks like the following:
(go-loop []
(when-let [input (<! input-stream)]
... ; calculations here
(>! result-chan result))
(recur))
(thread
(loop []
(when-let [input (<!! input-stream)]
... ; calculations here
(put! result-chan result))
(recur)))
I realize the number of threads that can be run simultaneously is exactly the number of CPU cores. Then in this case, is go-block and thread showing no differences if I am creating more than 8 thread or go-blocks?
I might want to simulate the differences in performance in my own laptop, but the production environment is quite different from the simulated one. I could draw no conclusions.
By the way, the calculation is not so heavy. If the inputs are not so large, 8,000 loops can be run in 1 second.
Another consideration is whether go-block vs thread will have an impact on GC performance.
There's a few things to note here.
Firstly, the thread pool that threads are created on via clojure.core.async/thread is what is known as a cached thread pool, meaning although it will re-use recently used threads inside that pool, it's essentially unbounded. Which of course means it could potentially hog a lot of system resources if left unchecked.
But given that what you're doing inside each asynchronous process is very lightweight, threads to me seem a little overkill. Of course, it's also important to take into account the quantity of items you expect to hit the input stream, if this number is large you could potentially overwhelm core.async's thread pool for go macros, potentially to the point where we're waiting for a thread to become available.
You also didn't mention preciously where you're getting the input values from, are the inputs some fixed data-set that remains constant at the start of the program, or are inputs continuously feed into the input stream from some source over time?
If it's the former then I would suggest you lean more towards transducers and I would argue that a CSP model isn't a good fit for your problem since you aren't modelling communication between separate components in your program, rather you're just processing data in parallel.
If it's the latter then I presume you have some other process that's listening to the result channel and doing something important with those results, in which case I would say your usage of go-blocks is perfectly acceptable.

Comparison of Nodejs EventLoop (with cluster module) and Golang Scheduler

In nodejs the main critics are based on its single threaded event loop model.
The biggest disadvantage of nodejs is that one can not perform CPU intensive tasks in the application. For demonstration purpose, lets take the example of a while loop (which is perhaps analogous to a db function returning hundred thousand of records and then processing those records in nodejs.)
while(1){
x++
}
Such sort of the code will block the main stack and consequently all other tasks waiting in the Event Queue will never get the chance to be executed. (and in a web Applications, new users will not be able to connect to the App).
However, one could possibly use module like cluster to leverage the multi core system and partially solve the above issue. The Cluster module allows one to create a small network of separate processes which can share server ports, which gives the Node.js application access to the full power of the server. (However, one of the biggest disadvantage of using Cluster is that the state cannot be maintained in the application code).
But again there is a high possibility that we would end up in the same situation (as described above) again if there is too much server load.
When I started learning the Go language and had a look at its architecture and goroutines, I thought it would possibly solve the problem that arises due to the single threaded event loop model of nodejs. And that it would probably avoid the above scenario of CPU intensive tasks, until I came across this interesting code, which blocks all of the GO application and nothing happens, much like a while loop in nodejs.
func main() {
var x int
threads := runtime.GOMAXPROCS(0)
for i := 0; i < threads; i++ {
go func() {
for { x++ }
}()
}
time.Sleep(time.Second)
fmt.Println("x =", x)
}
//or perhaps even if we use some number that is just greater than the threads.
So, the question is, if I have an application which is load intensive and there would be lot of CPU intensive tasks as well, I could probably get stuck in the above sort of scenario. (where db returns numerous amount of rows and then the application need to process and modify some thing in those rows). Would not the incoming users would be blocked and so would all other tasks as well?
So, how could the above problem be solved?
P.S
Or perhaps, the use cases I have mentioned does not make much of the sense? :)
Currently (Go 1.11 and earlier versions) your so-called
tight loop will indeed clog the code.
This would happen simply because currently the Go compiler
inserts code which does "preemption checks" («should I yield
to the scheduler so it runs another goroutine?») only in
prologues of the functions it compiles (almost, but let's not digress).
If your loop does not call any function, no preemption checks
will be made.
The Go developers are well aware of this
and are working on eventually alleviating this issue.
Still, note that your alleged problem is a non-issue in
most real-world scenarious: the code which performs long
runs of CPU-intensive work without calling any function
is rare and far in between.
In the cases, where you really have such code and you have
detected it really makes other goroutines starve
(let me underline: you have detected that through profiling—as
opposed to just conjuring up "it must be slow"), you may
apply several techniques to deal with this:
Insert calls to runtime.Gosched() in certain key points
of your long-running CPU-intensive code.
This will forcibly relinquish control to another goroutine
while not actually suspending the caller goroutine (so it will
run as soon as it will have been scheduled again).
Dedicate OS threads for the goroutines running
those CPU hogs:
Bound the set of such CPU hogs to, say, N "worker goroutines";
Put a dispatcher in front of them (this is called "fan-out");
Make sure that N is sensibly smaller than runtime.GOMAXPROCS
or raise the latter so that you have those N extra threads.
Shovel units of work to those dedicated goroutines via the dispatcher.

File writing from multiple threads.

I have an application A which calls another application B which does some calculation and writes to a file File.txt
A invokes multiple instances of B through multiple threads and each instances tries to write to same file File.txt
Here comes the actual problem :
Since multiple threads tries to access the same file , the file access throws out which is common.
I tried an approach of using a concurrent queue in a singleton class and each instances of B adds to the queue And another thread in this class takes care of dequeing the items from queue and writes to the file File.txt. The queue is fetched synchronously and write operation succeeded . This works fine .
If I have too many threads and too many items in queue the file writing works but if for some reason my queue crashes or stops abruptly all the information which is supposed to be written to file is lost .
If I make the file writing synchronous from the B without using the queue then it will be slow as it needs to check for file locking but here there are less chances of data being missed as after B immediately writes to file.
What could be there best approach or design to handle this scenario? I don't need the response after file writing is completed . I can't make B wait for the file writing to be completed.
Would async await file writing could be of any use here ?
I think what you've done is the best that can be done. You may have to tune your producer/consumer queue solution if there are still problems, but it seems to me that you've done rather well with this approach.
If an in-memory queue isn't the answer, perhaps externalizing that to a message queue and a pool of listeners would be an improvement.
Relational databases and transaction managers are born to solve this problem. Why continue with a file based solution? Is it possible to explore an alternative?
is there a better approach or design to handle this scenario?
You can make each producer thread write to it's own rolling file instead of queuing the operation. Every X seconds the producers move to new files and some aggregation thread wakes up, read the previous files (of each producer) and writes the results to the final File.txt output file. No read / write locks are required here.
This ensures safe recovery since the rolling files exist until you process and delete them.
This also mean that you always write to disk, which is much slower than queuing tasks in memory and write to disk in bulks. But that's the price you pay for consistency.
Would async await file writing could be of any use here ?
Using asynchronous IO has nothing to do with this. The problems you mentioned were 1) shared resources (the output file) and 2) lack of consistency (when the queue crash), none of which async programming is about.
Why the async is in picture is because I dont want to delay the existing work by B because of this file writing operation
async would indeed help you with that. Whatever pattern you choose to implement (to solve the original problem) it can always be async by merely using the asynchronous IO api's.

What is the general design ideas of read-compute-write thread-safe program based on it's single-threaded version?

Consider that the sequental version of the program already exists and implements a sequence of "read-compute-write" operations on a single input file and other single output file. "Read" and "write" operations are performed by the 3rd-party library functions which are hard (but possible) to modify, while the "compute" function is performed by the program itself. Read-write library functions seems to be not thread-safe, since they operate with internal flags and internal memory buffers.
It was discovered that the program is CPU-bounded, and it is planned to improve the program by taking advantage of multiple CPUs (up to 80) by designing the multi-processor version of the program and using OpenMP for that purpose. The idea is to instantiate multiple "compute" functions with same single input and single output.
It is obvious that something nedds to be done in insuring the consistent access to reads, data transfers, computations and data storages. Possible solutions are: (hard) rewrite the IO library functions in thread-safe manner, (moderate) write a thread-safe wrapper for IO functions that would also serve as a data cacher.
Is there any general patterns that cover the subject of converting, wrapping or rewriting the single-threaded code to comply with OpenMP thread-safety assumptions?
EDIT1: The program is fresh enough for changes to make it multi-threaded (or, generally a parallel one, implemented either by multi-threading, multi-processing or other ways).
As a quick response, if you are processing a single file and writing to another, with openMP its easy to convert the sequential version of the program to a multi-thread version without taking too much care about the IO part, provided that the compute algorithm itself can be parallelized.
This is true because usually the main thread, takes care of the IO. If this cannot be achieved because the chunks of data are too big to read at once, and the compute algorithm cannot process smaller chunks, you can use the openMP API to synchronize the IO in each thread. This does not mean that the whole application will stop or wait until the other threads finish computing so new data can be read or written, it means that only the read and write parts need to be done atomically.
For example, if the flow of your sequencial application is as follows:
1) Read
2) compute
3) Write
Given that it truly can be parallelized, and each chunk of data needs to be read from within each thread, each thread could follow the next design:
1) Synchronized read of chunk from input (only one thread at the time could execute this section)
2) Compute chunk of data (done in parallel)
3) Synchronized write of computed chunk to output (only one thread at the time could execute this section)
if you need to write the chunks in the same order you have read them, you need to buffer first, or adopt a different strategy like fseek to the correct position, but that really depends if the output file size is known from the start, ...
Take special attention to the openMP scheduling strategy, because the default may not be the best to your compute algorithm. And if you need to share results between threads, like the offset of the input file you have read, you may use reduction operations provided by the openMP API, which is way more efficient than making a single part of your code run atomically between all threads, just to update a global variable, openMP knows when its safe to write.
EDIT:
In regards of the "read, process, write" operation, as long as you keep each read and write atomic between every worker, I can't think any reason you'll find any trouble. Even when the data read is being stored in a internal buffer, having every worker accessing it atomically, that data is acquired in the exact same order. You only need to keep special attention when saving that chunk to the output file, because you don't know the order each worker will finish processing its attributed chunk, so, you could have a chunk ready to be saved that was read after others that are still being processed. You just need each worker to keep track of the position of each chunk and you can keep a list of pointers to chunks that need to be saved, until you have a sequence of finished chunks since the last one saved to the output file. Some additional care may need to be taken here.
If you are worried about the internal buffer itself (and keeping in mind I don't know the library you are talking about, so I can be wrong) if you make a request to some chunk of data, that internal buffer should only be modified after you requested that data and before the data is returned to you; and as you made that request atomically (meaning that every other worker will need to keep in line for its turn) when the next worker asks for his piece of data, that internal buffer should be in the same state as when the last worker received its chunk. Even in the case that the library particularly says it returns a pointer to a position of the internal buffer and not a copy of the chunk itself, you can make a copy to the worker's memory before releasing the lock on the whole atomic read operation.
If the pattern I suggested is followed correctly, I really don't think you would find any problem you wouldn't find in the same sequential version of the algorithm.
with a little of synchronisation you can go even further. Consider something like this:
#pragma omp parallel sections num_threads
{
#pragma omp section
{
input();
notify_read_complete();
}
#pragma omp section
{
wait_read_complete();
#pragma omp parallel num_threads(N)
{
do_compute_with_threads();
}
notify_compute_complete();
}
#pragma omp section
{
wait_compute_complete();
output();
}
}
So, the basic idea would be that input() and output() read/write chunks of data. The compute part then would work on a chunk of data while the other threads are reading/writing. It will take a bit of manual synchronization work in notify*() and wait*(), but that's not magic.
Cheers,
-michael

Speed Up with multithreading

i have a parse method in my program, which first reads a file from disk then, parses the lines and creats an object for every line. For every file a collection with the objects from the lines is saved afterwards. The files are about 300MB.
This takes about 2.5-3 minutes to complete.
My question: Can i expect a significant speed up if i split the tasks up to one thread just reading files from disk, another parsing the lines and a third saving the collections? Or would this maybe slow down the process?
How long is it common for a modern notebook harddisk to read 300MB? I think, the bottleneck is the cpu in my task, because if i execute the method one core of cpu is always at 100% while the disk is idle more then the half time.
greetings, rain
EDIT:
private CANMessage parseLine(String line)
{
try
{
CANMessage canMsg = new CANMessage();
int offset = 0;
int offset_add = 0;
char[] delimiterChars = { ' ', '\t' };
string[] elements = line.Split(delimiterChars);
if (!isMessageLine(ref elements))
{
return canMsg = null;
}
offset = getPositionOfFirstWord(ref elements);
canMsg.TimeStamp = Double.Parse(elements[offset]);
offset += 3;
offset_add = getOffsetForShortId(ref elements, ref offset);
canMsg.ID = UInt16.Parse(elements[offset], System.Globalization.NumberStyles.HexNumber);
offset += 17; // for signs between identifier and data length number
canMsg.DataLength = Convert.ToInt16(elements[offset + offset_add]);
offset += 1;
parseDataBytes(ref elements, ref offset, ref offset_add, ref canMsg);
return canMsg;
}
catch (Exception exp)
{
MessageBox.Show(line);
MessageBox.Show(exp.Message + "\n\n" + exp.StackTrace);
return null;
}
}
}
So this is the parse method. It works this way, but maybe you are right and it is inefficient. I have .NET Framwork 4.0 and i am on Windows 7. I have a Core i7 where every core has HypterThreading, so i am only using about 1/8 of the cpu.
EDIT2: I am using Visual Studio 2010 Professional. It looks like the tools for a performance profiling are not available in this version (according to msdn MSDN Beginners Guide to Performance Profiling).
EDIT3: I changed the code now to use threads. It looks now like this:
foreach (string str in checkedListBoxImport.CheckedItems)
{
toImport.Add(str);
}
for(int i = 0; i < toImport.Count; i++)
{
String newString = new String(toImport.ElementAt(i).ToArray());
Thread t = new Thread(() => importOperation(newString));
t.Start();
}
While the parsing you saw above is called in the importOperation(...).
With this code it was possible to reduce the time from about 2.5 minutes to "only" 40 seconds. I got some concurrency problems i have to track but at least this is much faster then before.
Thank you for your advice.
It's unlikely that you are going to get consistent metrics for laptop hard disk performance as we have no idea how old your laptop is nor do we know if it is sold state or spinning.
Considering you have already done some basic profiling, I'd wager the CPU really is your bottleneck as it is impossible for a single threaded application to use more than 100% of a single cpu. This is of course ignoring your operating system splitting the process over multiple cores and other oddities. If you were getting 5% CPU usage instead, it'd be most likely were bottle necking at IO.
That said your best bet would be to create a new thread task for each file you are processing and send that to a pooled thread manager. Your thread manager should limit the number of threads you are running to either the number of cores you have available or if memory is an issue (you did say you were generating 300MB files after all) the maximum amount of ram you can use for the process.
Finally, to answer the reason why you don't want to use a separate thread for each operation, consider what you already know about your performance bottlenecks. You are bottle necked on cpu processing and not IO. This means that if you split your application into separate threads your read and write threads would be starved most of the time waiting for your processing thread to finish. Additionally, even if you made them process asynchronously, you have the very real risk of running out of memory as your read thread continues to consume data that your processing thread can't keep up with.
Thus, be careful not to start each thread immediately and let them instead be managed by some form of blocking queue. Otherwise you run the risk of slowing your system to a crawl as you spend more time in context switches than processing. This is of course assuming you don't crash first.
It's unclear how many of these 300MB files you've got. A single 300MB file takes about 5 or 6 seconds to read on my netbook, with a quick test. It does indeed sound like you're CPU-bound.
It's possible that threading will help, although it's likely to complicate things significantly of course. You should also profile your current code - it may well be that you're just parsing inefficiently. (For example, if you're using C# or Java and you're concatenating strings in a loop, that's frequently a performance "gotcha" which can be easily remedied.)
If you do opt for a multi-threaded approach, then to avoid thrashing the disk, you may want to have one thread read each file into memory (one at a time) and then pass that data to a pool of parsing threads. Of course, that assumes you've also got enough memory to do so.
If you could specify the platform and provide your parsing code, we may be able to help you optimize it. At the moment all we can really say is that yes, it sounds like you're CPU bound.
That long for only 300 MB is bad.
There's different things that could be impacting performance as well depending upon the situation, but typically it's reading the hard disk is still likely the biggest bottleneck unless you have something intense going on during the parsing, and which seems the case here because it only takes several seconds to read 300MB from a harddisk (unless it's way bad fragged maybe).
If you have some inefficient algorithm in the parsing, then picking or coming up with a better algorithm would probably be more beneficial. If you absolutely need that algorithm and there's no algorithmic improvement available, it sounds like you might be stuck.
Also, don't try to multithread to read and write at the same time with the multithreading, you'll likely slow things way down to increased seeking.
Given that you think this is a CPU bound task, you should see some overall increase in throughput with separate IO threads (since otherwise your only processing thread would block waiting for IO during disk read/write operations).
Interestingly I had a similar issue recently and did see a significant net improvement by running separate IO threads (and enough calculation threads to load all CPU cores).
You don't state your platform, but I used the Task Parallel Library and a BlockingCollection for my .NET solution and the implementation was almost trivial. MSDN provides a good example.
UPDATE:
As Jon notes, the time spent on IO is probably small compared to the time spent calculating, so while you can expect an improvement, the best use of time may be profiling and improving the calculation itself. Using multiple threads for the calculation will speed up significantly.
Hmm.. 300MB of lines that have to be split up into a lot of CAN message objects - nasty! I suspect the trick might be to thread off the message assembly while avoiding excessive disk-thrashing between the read and write operations.
If I was doing this as a 'fresh' requirement, (and of course, with my 20/20 hindsight, knowing that CPU was going to be the problem), I would probably use just one thread for reading, one for writing the disk and, initially at least, one thread for the message object assembly. Using more than one thread for message assembly means the complication of resequencing the objects after processing to prevent the output file being written out-of-order.
I would define a nice disk-friendly sized chunk-class of lines and message-object array instances, say 1024 of them, and create a pool of chunks at startup, 16 say, and shove them onto a storage queue. This controls and caps memory use, greatly reduces new/dispose/malloc/free, (looks like you have a lot of this at the moment!), improves the efficiency of the disk r/w operations as only large r/w are performed, (except for the last chunk which will be, in general, only partly filled), provides inherent flow-control, (the read thread cannot 'run away' because the pool will run out of chunks and the read thread will block on the pool until the write thread returns some chunks), and inhibits excess context-switching because only large chunks are processed.
The read thread opens the file, gets a chunk from the queue, reads the disk, parses into lines and shoves the lines into the chunk. It then queues the whole chunk to the processing thread and loops around to get another chunk from the pool. Possibly, the read thread could, on start or when idle, be waiting on its own input queue for a message class instance that contains the read/write filespecs. The write filespec could be propagated through a field of the chunks, so supplying the the write thread wilth everything it needs via. the chunks. This makes a nice subsystem to which filespecs can be queued and it will process them all without any further intervention.
The processing thread gets chunks from its input queue and splits the the lines up into the message objects in the chunk and then queues the completed, whole chunks to the write thread.
The write thread writes the message objects to the output file and then requeues the chunk to the storage pool queue for re-use by the read thread.
All the queues should be blocking producer-consumer queues.
One issue with threaded subsystems is completion notification. When the write thread has written the last chunk of a file, it probably needs to do something. I would probably fire an event with the last chunk as a parameter so that the event handler knows which file has been completely written. I would probably somethihng similar with error notifications.
If this is not fast enough, you could try:
1) Ensure that the read and write threads cannot be preemepted in favour of the other during chunk-disking by using a mutex. If your chunks are big enough, this probably won't make much difference.
2) Use more than one processing thread. If you do this, chunks may arrive at the write-thread 'out-of-order'. You would maybe need a local list and perhaps some sort of sequence-number in the chunks to ensure that the disk writes are correctly ordered.
Good luck, whatever design you come up with..
Rgds,
Martin

Resources