Spring Batch - How to read one big file in multiple threads?

Spring Batch - How to read one big file in multiple threads? - multithreading

Problem: Read file of size > 10 MB and load it in staging table using Spring Batch. How can we maintain state while reading a file, in order to restart the job if it fails?
As per the documentation the FileItemReader is not thread safe and if we try to make it thread safe, we end up loosing restartability. So basic questions are:
Is there a way to read the file in blocks and each thread knows which block it needs to read?
If we make the read synchronous, what changes are required to make the job restartable in this scenario?
If anyone has faced similar issues or have any analysis of how it performs would help us take decision.
Also, any pointers or sample codes are appreciated.

Multithreading is only useful if your threads are doing different things at the same time. For example, you can have two threads running on separate CPUs. Or one thread can be waiting for a network message while the other is painting the screen.
But in your case, both threads would be waiting for the same IO from the same device, so there's no point using more than one.
See also this question Reading a file by multiple threads

Related

Running every GStreamer pipeline into a separate (GLib) thread

All of the GStreamer samples are initializing GLib main thread through some form of:
loop = g_main_loop_new(NULL, FALSE);
g_main_loop_run(loop);
As far as I understood this main loop is used for all the signals processing.
Also Bus messages are processed in it.
So I'm a little bit concerned what will happen if I run multiple pipelines simultaneously.
Or there is an issue/improper implementation in some of them.
Most probably for heavy loads the best solution is pipelines to be separated to multiple processes, thus mitigating all the possible problems with memory leaks, hangs, dead locks etc., without affecting the main application.
Anyway at least running them into separate threads will be beneficial.
Obviously it is possible to be started more than one GLib main thread, with creation of the GMainContext first. But I cannot understand (apparently I'm missing knowledge) how after that to "assign" them to pipelines or signaling to them, etc.
For example in "g_signal_connect" and "g_signal_emit" is not specified on which "main" thread to be executed.
Some posts here claim it is possible (GStreamer supports different main thread), but I wasn't able to find details.
Similar problem is discussed in this thread but to be honest I wasn't able to understand it.
In this StackOverflow post is discussed how timeouts could be attached to different GLib main threads.
I suppose that something similar could be made and for the GStreamer pipelines and objects, but I'm not sure.
Could someone enlighten me a little bit?

I'm posting here the answer of the same question in the GStreamer-devel forum:
http://gstreamer-devel.966125.n4.nabble.com/Running-every-GStremer-pipeline-into-a-separate-GLib-thread-td4694469.html
citate:
GMainLoop is optional with GStreamer (convenient but optional). You can
use the GstBus API directly. As for signals, these are synchronous, and
not using a messageé.
If you decide to use a GMainLoop, you will only need one to handle
asynchronous messages, as all message gets serialized into the loop
queue. Multiple pipeline is were that becomes convenient as you don't
have to deal with multiple GstBus object.
Streamer splits the streaming into seperate threads already. The
mainloop thread is always free for other task (like UI task).
:end of citate

Concurrent processes a lot slower than single process

I am modelling and solving a nonlinear program (NLP) using single-threaded CPLEX with AMPL (I am constraining CPLEX to use only one thread explicitly) in CentOS 7. I am using a processor with 6 independent cores (intel i7 8700) to solve 6 independent test instances.
When I run these tests sequentially, it is much faster than when I run these 6 instances concurrenctly (about 63%) considering time elapsed. They are executed in independent processes, reading distinct data files, and writting results in distinct output files. I have also tried to solve these tests sequentially with multithread, and I got similar times to those cases with only one thread sequentially.
I have checked the behaviour of these processes using top/htop. They get different processors to execute. So my question is how the execution of these tests concurrently would get so much impact on time elapsed if they are solving in different cores with only one thread and they are individual processes?
Any thoughts would be appreciated.

It's very easy to make many threads perform worse than a single thread. The key to successful multi-threading and speedup is to understand not just the fact that the program is multi-threaded, but to know exactly how your threads interact. Here are a few questions you should ask yourself as you review your code:
1) Do the individual threads share resources? If so what are those resources and when you are accessing them do they block other threads?
2) What's the slowest resource your multi-threaded code relies on? A common bottleneck (and oft neglected) is disk IO. Multiple threads can process data much faster but they won't make a disk read faster and in many cases multithreading can make it much worse (e.g. thrashing).
3) Is access to common resources properly synchronized?
To this end, and without knowing more about your problem, I'd recommend:
a) Not reading different files from different threads. You want to keep your disk IO as sequential as possible and this is easier from a single thread. Maybe batch read files from a single thread and then farm them out for processing.
b) Keep your threads as autonomous as possible - any communication back and forth will cause thread contention and slow things down.

Semaphores & threads - what is the point?

I've been reading about semaphores and came across this article:
www.csc.villanova.edu/~mdamian/threads/posixsem.html
So, this page states that if there are two threads accessing the same data, things can get ugly. The solution is to allow only one thread to access the data at the same time.
This is clear and I understand the solution, only why would anyone need threads to do this? What is the point? If the threads are blocked so that only one can execute, why use them at all? There is no advantage. (or maybe this is a just a dumb example; in such a case please point me to a sensible one)
Thanks in advance.

Consider this:
void update_shared_variable() {
sem_wait( &g_shared_variable_mutex );
g_shared_variable++;
sem_post( &g_shared_variable_mutex );
}
void thread1() {
do_thing_1a();
do_thing_1b();
do_thing_1c();
update_shared_variable(); // may block
}
void thread2() {
do_thing_2a();
do_thing_2b();
do_thing_2c();
update_shared_variable(); // may block
}
Note that all of the do_thing_xx functions still happen simultaneously. The semaphore only comes into play when the threads need to modify some shared (global) state or use some shared resource. So a thread will only block if another thread is trying to access the shared thing at the same time.
Now, if the only thing your threads are doing is working with one single shared variable/resource, then you are correct - there is no point in having threads at all (it would actually be less efficient than just one thread, due to context switching.)

When you are using multithreading not everycode that runs will be blocking. For example, if you had a queue, and two threads are reading from that queue, you would make sure that no thread reads at the same time from the queue, so that part would be blocking, but that's the part that will probably take the less time. Once you have retrieved the item to process from the queue, all the rest of the code can be run asynchronously.

The idea behind the threads is to allow simultaneous processing. A shared resource must be governed to avoid things like deadlocks or starvation. If something can take a while to process, then why not create multiple instances of those processes to allow them to finish faster? The bottleneck is just what you mentioned, when a process has to wait for I/O.
Being blocked while waiting for the shared resource is small when compared to the processing time, this is when you want to use multiple threads.

This is of course a SSCCE (Short, Self Contained, Correct Example)
Let's say you have 2 worker threads that do a lot of work and write the result to a file.
you only need to lock the file (shared resource) access.

The problem with trivial examples....
If the problem you're trying to solve can be broken down into pieces that can be executed in parallel then threads are a good thing.
A slightly less trivial example - imagine a for loop where the data being processed in each iteration is different every time. In that circumstance you could execute each iteration of the for loop simultaneously in separate threads. And indeed some compilers like Intel's will convert suitable for loops to threads automatically for you. In that particular circumstances no semaphores are needed because of the iterations' data independence.
But say you were wanting to process a stream of data, and that processing had two distinct steps, A and B. The threadless approach would involve reading in some data then doing A then B and then output the data before reading more input. Or you could have a thread reading and doing A, another thread doing B and output. So how do you get the interim result from the first thread to the second?
One way would be to have a memory buffer to contain the interim result. The first thread could write the interim result to a memory buffer and the second could read from it. But with two threads operating independently there's no way for the first thread to know if it's safe to overwrite that buffer, and there's no way for the second to know when to read from it.
That's where you can use semaphores to synchronise the action of the two threads. The first thread takes a semaphore that I'll call empty, fills the buffer, and then posts a semaphore called filled. Meanwhile the second thread will take the filled semaphore, read the buffer, and then post empty. So long as filled is initialised to 0 and empty is initialised to 1 it will work. The second thread will process the data only after the first has written it, and the first won't write it until the second has finished with it.
It's only worth it of course if the amount of time each thread spends processing data outweighs the amount of time spent waiting for semaphores. This limits the extent to which splitting code up into threads yields a benefit. Going beyond that tends to mean that the overall execution is effectively serial.
You can do multithreaded programming without semaphores at all. There's the Actor model or Communicating Sequential Processes (the one I favour). It's well worth looking up JCSP on Wikipedia.
In these programming styles data is shared between threads by sending it down communication channels. So instead of using semaphores to grant another thread access to data it would be sent a copy of that data down something a bit like a network socket, or a pipe. The advantage of CSP (which limits that communication channel to send-finishes-only-if-receiver-has-read) is that it stops you falling into the many many pitfalls that plague multithreaded do programs. It sounds inefficient (copying data is inefficient), but actually it's not so bad with Intel's QPI architecture, AMD's Hypertransport. And it means hat the 'channel' really could be a network connection; scalability built in by design.

C# TPL Tasks - How many at one time

I'm learning how to use the TPL for parellizing an application I have. The application processes ZIP files, exctracting all of the files held within them and importing the contents into a database. There may be several thousand zip files waiting to be processed at a given time.
Am I right in kicking off a separate task for each of these ZIP files or is this an inefficient way to use the TPL?
Thanks.

This seems like a problem better suited for worker threads (separate thread for each file) managed with the ThreadPool rather than the TPL. TPL is great when you can divide and conquer on a single item of data but your zip files are treated individually.
Disc I/O is going to be your bottle neck so I think that you will need to throttle the number of jobs running simultaneously. It's simple to manage this with worker threads but I'm not sure how much control you have (if nay) over the parallel for, foreach as far as how parallelism goes on at once, which could choke your process and actually slow it down.

Anytime that you have a long running process, you can typically gain additional performance on multi-processor systems by making different threads for each input task. So I would say that you are most likely going down the right path.

I would have thought that this would depend on if the process is limited by CPU or disk. If the process is limited by disk I'd thought that it might be a bad idea to kick off too many threads since the various extractions might just compete with each other.
This feels like something you might need to measure to get the correct answer for what's best.

I have to disagree with certain statements here guys.
First of all, I do not see any difference between ThreadPool and Tasks in coordination or control. Especially when tasks runs on ThreadPool and you have easy control over tasks, exceptions are nicely propagated to the caller during await or awaiting on Tasks.WhenAll(tasks) etc.
Second, I/O wont have to be the only bottleneck here, depending on data and level of compression the ZIPping is going to take msot likely more time than reading the file from the disc.
It can be thought of in many ways, but I would best go for something like number of CPU cores or little less.
Loading file paths to ConcurrentQueue and then allowing running tasks to dequeue filepaths, load files, zip them, save them.
From there you can tweak the number of cores and play with load balancing.
I do not know if ZIP supports file partitioning during compression, but in some advanced/complex cases it could be good idea especially on large files...
WOW, it is 6 years old question, bummer! I have not noticed...:)

Threads or asynch?

How do you make your application multithreaded ?
Do you use asynch functions ?
or do you spawn a new thread ?
I think that asynch functions are already spawning a thread so if your job is doing just some file reading, being lazy and just spawning your job on a thread would just "waste" ressources...
So is there some kind of design when using thread or asynch functions ?

If you are talking about .Net, then don't forget the ThreadPool. The thread pool is also what asynch functions often use. Spawning to much threads can actually hurt your performance. A thread pool is designed to spawn just enough threads to do the work the fastest. So do use a thread pool instead of spwaning your own threads, unless the thread pool doesn't meet your needs.
PS: And keep an eye out on the Parallel Extensions from Microsoft

Spawning threads is only going to waste resources if you start spawning tons of them, one or two extra threads isn't going to effect the platforms proformance, infact System currently has over 70 threads for me, and msn is using 32 (I really have no idea how a messenger can use that many threads, exspecialy when its minimised and not really doing anything...)
Useualy a good time to spawn a thread is when something will take a long time, but you need to keep doing something else.
eg say a calculation will take 30 seconds. The best thing to do is spawn a new thread for the calculation, so that you can continue to update the screen, and handle any user input because users will hate it if your app freezes untill its finished doing the calculation.
On the other hand, creating threads to do something that can be done almost instantly is nearly pointless, since the overhead of creating (or even just passing work to an existing thread using a thread pool) will be higher than just doing the job in the first place.
Sometimes you can break your app into a couple of seprate parts which run in their own threads. For example in games the updates/physics etc may be one thread, while grahpics are another, sound/music is a third, and networking is another. The problem here is you really have to think about how these parts will interact or else you may have worse proformance, bugs that happen seemingly "randomly", or it may even deadlock.

I'll second Fire Lancer's answer - creating your own threads is an excellent way to process big tasks or to handle a task that would otherwise be "blocking" to the rest of synchronous app, but you have to have a clear understanding of the problem that you must solve and develope in a way that clearly defines the task of a thread, and limits the scope of what it does.
For an example I recently worked on - a Java console app runs periodically to capture data by essentially screen-scraping urls, parsing the document with DOM, extracting data and storing it in a database.
As a single threaded application, it, as you would expect, took an age, averaging around 1 url a second for a 50kb page. Not too bad, but when you scale out to needing to processes thousands of urls in a batch, it's no good.
Profiling the app showed that most of the time the active thread was idle - it was waiting for I/O operations - opening of a socket to the remote URL, opening a connection to the database etc. It's this sort of situation that can easily be improved with multithreading. Rewriting to be multi-threaded and with just 5 threads instead of one, even on a single core cpu, gave an increase in throughput of over 20 times.
In this example, each "worker" thread was explicitly limited to what it did - open the remote a remote url, parse the data, store it in the db. All the "high level" processing - generating the list of urls to parse, working out which next, handling errors, all remained with the control of the main thread.

The use of threads makes you think more about the way your application needs threading and can in the long run make it easier to improve / control your performance.
Async methods are faster to use but they are a bit magic - a lot of things happen to make them possible - so it's probable that at some point you will need something that they can't give you. Then you can try and roll some custom threading code.
It all depends on your needs.

The answer is "it depends".
It depends on what you're trying to achieve. I'm going to assume that you're aiming for more performance.
The simplest solution is to find another way to improve your performance. Run a profiler. Look for hot spots. Reduce unnecessary IO.
The next solution is to break your program into multiple processes, each of which can run in their own address space. This is easiest because there is no chance of the individual processes messing each other up.
The next solution is to use threads. At this point you're opening a major can of worms, so start small, and only multi-thread the critical path of the code.
The next solution is to use asynch IO. Generally only recommended for people writing some of very heavily loaded server, and even then I would rather re-use one of the existing frameworks that abstract away the details e.g. the C++ framework ICE, or an EJB server under java.
Note that each of these solutions has multiple sub-solutions - there are different breeds of threads and different kinds of asynch IO, each with slightly different performance characteristics, but again, it's generally best to let the framework handle it for you.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string