Spring batch multithreading to read big file

Spring batch multithreading to read big file - multithreading

I have below use case where we have to do the following using spring batch.
Read big file
process each records/line from file
write it to DB
Currently it is single thread and working fine but i want to make it multithread where multiple thread will process and write to DB.
I have following approaches .
if I use itemReader, itemProcessor, itemWrite then reading file using multiple thread is not a good idea and might degrade the performance. where multiple thread will reading same file using itemReaderClass.
Second solution is , in 'step1' i read file and place in BlockingQueue and in 'step2' would
be multithread where i read,process,write from BlockingQueu.Then run 'step1' and 'step2' in parallel. so 'step1' is single threaded producer and 'step2' is multi threaded consumer.
Please suggest which one is right way to do it.

Related

Can I do writes do a sqlite database from multiple threads if I create a dedicated agent for it

My scenario is as follows:
I have 10 datasets I need to process. I will be using 10 threads to process them all in parallel (each can take up to an hour). Once I find some info in the dataset I want, I will write it to the sqlite database. I might also have to update an existing row. I won't be doing any selects or deletes until after all datasets are finished being processed.
From what I understand sqlite will not handle this scenario well since only 1 thread can lock the file to write and I don't want to hold up other threads to wait until the lock is aquired.
So my idea is that I create another thread to handle all these writes. When a processing thread finds something it wants to write to the db, it sends it to the writer thread. The writer thread can then create a new thread to write it to the db so it can handle if another request comes in and add it to a queue if something is already writing it to the db. Therefore, we only have 1 thread trying to actually write to the db.
My main question is as follows:
Will this work / is this sane? Also is there something that does this already?
I'm using python if that matters.
Thanks

TensorFlow: More than one thread in shuffle_batch for single sample files

I'm trying to understand the significance of using num_threads>1 in tf.train.shuffle_batch connected to tf.WholeFileReader reading image files (each file contains a single data sample). Will setting num_threads>1 make any difference in such case compared to num_threads=1? What is the mechanics of the file and batch queues in such case?

A short answer: probably it will make the execution faster. Here is some authoritative explanation from the guide:
single reader via the tf.train.shuffle_batch with num_threads bigger
than 1. This will make it read from a single file at the same time
(but faster than with 1 thread), instead of N files at once. This can
be important:
If you have more reading threads than input files, to avoid the risk
that you will have two threads reading the same example from the same
file near each other.
Or if reading N files in parallel causes too
many disk seeks. How many threads do you need?
the
tf.train.shuffle_batch* functions add a summary to the graph that
indicates how full the example queue is. If you have enough reading
threads, that summary will stay above zero.

File writing from multiple threads.

I have an application A which calls another application B which does some calculation and writes to a file File.txt
A invokes multiple instances of B through multiple threads and each instances tries to write to same file File.txt
Here comes the actual problem :
Since multiple threads tries to access the same file , the file access throws out which is common.
I tried an approach of using a concurrent queue in a singleton class and each instances of B adds to the queue And another thread in this class takes care of dequeing the items from queue and writes to the file File.txt. The queue is fetched synchronously and write operation succeeded . This works fine .
If I have too many threads and too many items in queue the file writing works but if for some reason my queue crashes or stops abruptly all the information which is supposed to be written to file is lost .
If I make the file writing synchronous from the B without using the queue then it will be slow as it needs to check for file locking but here there are less chances of data being missed as after B immediately writes to file.
What could be there best approach or design to handle this scenario? I don't need the response after file writing is completed . I can't make B wait for the file writing to be completed.
Would async await file writing could be of any use here ?

I think what you've done is the best that can be done. You may have to tune your producer/consumer queue solution if there are still problems, but it seems to me that you've done rather well with this approach.
If an in-memory queue isn't the answer, perhaps externalizing that to a message queue and a pool of listeners would be an improvement.
Relational databases and transaction managers are born to solve this problem. Why continue with a file based solution? Is it possible to explore an alternative?

is there a better approach or design to handle this scenario?
You can make each producer thread write to it's own rolling file instead of queuing the operation. Every X seconds the producers move to new files and some aggregation thread wakes up, read the previous files (of each producer) and writes the results to the final File.txt output file. No read / write locks are required here.
This ensures safe recovery since the rolling files exist until you process and delete them.
This also mean that you always write to disk, which is much slower than queuing tasks in memory and write to disk in bulks. But that's the price you pay for consistency.
Would async await file writing could be of any use here ?
Using asynchronous IO has nothing to do with this. The problems you mentioned were 1) shared resources (the output file) and 2) lack of consistency (when the queue crash), none of which async programming is about.
Why the async is in picture is because I dont want to delay the existing work by B because of this file writing operation
async would indeed help you with that. Whatever pattern you choose to implement (to solve the original problem) it can always be async by merely using the asynchronous IO api's.

Multi threaded Lucene Index Writer with Spring Batch

I have a multi step Spring Batch job and in one of steps I create Lucene indices for the data read in reader so subsequent steps can search in that Lucene index.
Based on read data in ItemReader, I spread indices to few separate directories.
If I specify, Step Task Executor to be a SimpleAsyncTaskExecutor , I don't get any issue as long as indices are always written to different directories but sometimes I get a locking exception. I guess, two threads tried to write to same Index.
If I remove SimpleAsyncTaskExecutor, I don't get any issues but write becomes sequential and slow.
Is it possible to use multi threading for a Lucene Index writer if indices are being written to a single directory?
Do I need to make index creator code to be thread safe to use SimpleAsyncTaskExecutor?
index creator code is in step processor.

I am using Lucene 6.0.0 and as per IndexWriter API Doc,
NOTE: IndexWriter instances are completely thread safe, meaning
multiple threads can call any of its methods, concurrently. If your
application requires external synchronization, you should not
synchronize on the IndexWriter instance as this may cause deadlock;
use your own (non-Lucene) objects instead.
I was creating multiple instances of writer and that was causing problems. Single writer instance can be passed to as many threads as you like provided rest of the code around that writer is thread safe.
I used a single writer instance and parallelized chunks. Each parallel chunk wrote to same directory without any issues.
To parallelize chunks, I had to made my chunk components - reader , processor and writer to be thread safe.

How to process rows of a CSV file using Groovy/GPars most efficiently?

The question is a simple one and I am surprised it did not pop up immediately when I searched for it.
I have a CSV file, a potentially really large one, that needs to be processed. Each line should be handed to a processor until all rows are processed. For reading the CSV file, I'll be using OpenCSV which essentially provides a readNext() method which gives me the next row. If no more rows are available, all processors should terminate.
For this I created a really simple groovy script, defined a synchronous readNext() method (as the reading of the next line is not really time consuming) and then created a few threads that read the next line and process it. It works fine, but...
Shouldn't there be a built-in solution that I could just use? It's not the gpars collection processing, because that always assumes there is an existing collection in memory. Instead, I cannot afford to read it all into memory and then process it, it would lead to outofmemory exceptions.
So.... anyone having a nice template for processing a CSV file "line by line" using a couple of worker threads?

Concurrently accessing a file might not be a good idea and GPars' fork/join-processing is only meant for in-memory data (collections). My sugesstion would be to read the file sequentially into a list. When the list reaches a certain size, process the entries in the list concurrently using GPars, clear the list and then move on with reading lines.

This might be a good problem for actors. A synchronous reader actor could hand off CSV lines to parallel processor actors. For example:
#Grab(group='org.codehaus.gpars', module='gpars', version='0.12')
import groovyx.gpars.actor.DefaultActor
import groovyx.gpars.actor.Actor
class CsvReader extends DefaultActor {
void act() {
loop {
react {
reply readCsv()
}
}
}
}
class CsvProcessor extends DefaultActor {
Actor reader
void act() {
loop {
reader.send(null)
react {
if (it == null)
terminate()
else
processCsv(it)
}
}
}
}
def N_PROCESSORS = 10
def reader = new CsvReader().start()
(0..<N_PROCESSORS).collect { new CsvProcessor(reader: reader).start() }*.join()

I'm just wrapping up an implementation of a problem just like this in Grails (you don't specify if you're using grails, plain hibernate, plain JDBC or something else).
There isn't anything out of the box that you can get that I'm aware of. You could look at integrating with Spring Batch, but the last time I looked at it, it felt very heavy to me (and not very groovy).
If you're using plain JDBC, doing what Christoph recommends probably is the easiest thing to do (read in N rows and use GPars to spin through those rows concurrently).
If you're using grails, or hibernate, and want your worker threads to have access to the spring context for dependency injection, things get a bit more complicated.
The way I solved it is using the Grails Redis plugin (disclaimer: I'm the author) and the Jesque plugin, which is a java implementation of Resque.
The Jesque plugin lets you create "Job" classes that have a "process" method with arbitrary parameters that are used to process work enqueued on a Jesque queue. You can spin up as many workers as you want.
I have a file upload that an admin user can post a file to, it saves the file to disk and enqueues a job for the ProducerJob that I've created. That ProducerJob spins through the file, for each line, it enqueues a message for a ConsumerJob to pick up. The message is simply a map of the values read from the CSV file.
The ConsumerJob takes those values and creates the appropriate domain object for it's line and saves it to the database.
We already were using Redis in production so using this as a queueing mechanism made sense. We had an old synchronous load that ran through file loads serially. I'm currently using one producer worker and 4 consumer workers and loading things this way is over 100x faster than the old load was (with much better progress feedback to the end user).
I agree with the original question that there is probably room for something like this to be packaged up as this is a relatively common thing.
UPDATE: I put up a blog post with a simple example doing imports with Redis + Jesque.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string