Spring batch Multithreaded processing for Single file to Multiple FIle - multithreading

My problem statement. Read a csv file with 10 million data and store it in db. with as minimal time as possible.
I had implemented it using Simple multi threaded executor of java and the logic is almost similar to spring batch's chunk. Read preconfigured number of data from csv file and then create a thread, and passing the data to thread which validates data and then writes to file which runs in multi thread. once all the task is done I'm calling sql loader to load each file. Now I want to move this code to spring batch(I'm newbie to spring batch)
Here are my question
1. In task, is it possible to make ItemReader to Item writer multi threaded(as I read the file create a new thread to process the data before the thread writes to data)? if not I need to create two steps first step read the file which is single threaded and another step which is multi threaded writing to individual file, but how do I pass the list of data to another task from previous task.
2. In case if there are any failures in a single thread, how can I stop whole batch job processing.
3. How to retry the batch job in case of failure after certain interval. I know that there is retry option in case of failure but I could not find an option to retry the task after certain interval in case of failure. here I'm not talking about scheduler because I've batch job already runs under scheduler, but on failure it has to be re-run after 3 minutes are so.

Here is how I solved the problem.
Read a file and chunk the file( split the file) using Buffered and File Channel reader and writer ( the fastest way of File read/write, even spring batch uses the same). I implemented such that this is executed before job is started( However it can be executed using job as step using method invoker)
Start the Job with directory location as job parameter.
Use multiResourcePartitioner which will get the directory location and for each file a slave step is created in separate thread
In the Slave step get the file passed from Partitioner and use spring batchs itemreader to read the file
Use the Database item writer( I'm using mybatis batch itemwriter) to push the data to Database.
Its better to use the split count equal to commit-count of step.

About multi-thread read How to set up multi-threading in Spring Batch? answer; it will point you to right direction. Also, in this sample there are some consideration about restart for CSV file
Job should automatically fails if some error on thread: I have never tried, but this should be the default behaviour
Spring Batch How to set time interval between each call in a Chunk tasklet can be a start. Also, official doc about Backoff Policies - When retrying after a transient failure it often helps to wait a
bit before trying again, because usually the failure is caused by
some problem that will only be resolved by waiting. If a
RetryCallback fails, the RetryTemplate can pause execution according
to the BackoffPolicy in place.
Let me known if this help or how you solve problem because I'm interested for my (future) work!
I hope my indications can be helpful.

You can split your input file to many file , the use Partitionner and load small files with threads, but on error , you must restart all job after DB cleaned.
<batch:job id="transformJob">
<batch:step id="deleteDir" next="cleanDB">
<batch:tasklet ref="fileDeletingTasklet" />
</batch:step>
<batch:step id="cleanDB" next="split">
<batch:tasklet ref="countThreadTasklet" />
</batch:step>
<batch:step id="split" next="partitionerMasterImporter">
<batch:tasklet>
<batch:chunk reader="largeCSVReader" writer="smallCSVWriter" commit-interval="#{jobExecutionContext['chunk.count']}" />
</batch:tasklet>
</batch:step>
<batch:step id="partitionerMasterImporter" next="partitionerMasterExporter">
<partition step="importChunked" partitioner="filePartitioner">
<handler grid-size="10" task-executor="taskExecutor" />
</partition>
</batch:step>
</batch:job>
Full example code (on Github).
Hope this help.

Related

How to set skipLimit when using spring batch framework to execute a step concurrently?

I use spring batch to load data from a file to a database.The job contains only one step.I use a ThreadPoolTaskExecutor to execute step concurrently.The step is similar to this one.
public Step MyStep(){
return StepBuilderFactory.get("MyStep")
.chunk(10000)
.reader(flatFileItemWriter)
.writer(jdbcBatchItemWriter)
.faultTolerant()
.skip(NumberFormatException.class)
.skip(FlatFileParseException.class)
.skipLimit(3)
.throttleLImit(10)
.taskExecutor(taskExecutor)
.build();
}
There are 3 "numberformat" errors in my file,so I set skipLimit 3,but I find that when I execute the job,it will start 10 threads and each thread has 3 skips,so I have 3 * 10 = 30 skips in total,while I only need 3.
So the question is will this cause any problems?And is there any other way to skip exactly 3 times while executing a step concurrently?
github issue
Robert Kasanicky opened BATCH-926 and commented
When chunks execute concurrently each chunk works with its own local copy of skipCount (from StepContribution). Given skipLimit=1 and 10 chunks execute concurrenlty we can end up with successful job execution and 10 skips. As the number of concurrent chunks increases the skipLimit becomes more a limit per chunk rather than job.
Dave Syer commented
I vote for documenting the current behaviour.
Robert Kasanicky commented
documented current behavior in the factory bean
However, this seems to be a correct thing for a very old version of spring batch.
I have a different problem in my code, but the skipLimit seems to be aggregated when I use multiple threads. Albeit the job sometimes hangs without properly failing when SkipLimitException is thrown.

Netsuite Map Reduce yielding

I read in documentation that soft limits on governance cause map reduce scripts to yield and reschedule. My problem is I cannot see in docs where it explains what happens in the yield. Is the getInputData called again to regather the same data set ok to be mapped or is the initial data set persisted somewhere and already mapped and reduced records are Excluded from processing?
With yielding, the getInputData stage is not called again. From the docs;
If a job monopolizes a processor for too long, the system can
naturally finish the job after the current map or reduce function has
completed. In this case, the system creates a new job to continue
executing remaining key/value pairs. Based on its priority and
submission timestamp, the new job either starts right after the
original job has finished, or it starts later, to allow
higher-priority jobs processing other scripts to execute. For more
details, see SuiteScript 2.0 Map/Reduce Yielding.
This is different from server restarts or interruptions, however.

Spark stateSnapshots() not working with saveAsHadoopFiles

In my Spark Streaming 1.6 application, I want to store certain value in mapWithState and then periodically save them to disk as a backup option.
JavaMapWithStateDStream<String, SwMessage, CcState, Tuple2<String, CcState>> SwMessageWithState =
pairSwMsg.
mapWithState(StateSpec.function(mappingFunc).
initialState(cStateMap));
For backing up I am using the stateSnapshots() method as follows.
SwMessageWithState.stateSnapshots().
saveAsHadoopFiles("/ccd/snap", "txt",String.class,CcState.class, TextOutputFormat.class);
The problem I am facing is that the program stops consuming the messages after the first batch and then does nothing.
If I comment the stateSnapshots() line the program works fine.
Can somebody suggest what exactly is wrong with the above statement?
Also I was thinking of making the snapshopt directory as an initialState state file next time the SparkStreaming job runs.

Spark Streaming Execution Flow

I am a newbie to Spark Streaming and I have some doubts regarding the same like
Do we need always more than one executor or with one we can do our job
I am pulling data from kafka using createDirectStream which is receiver less method and batch duration is one minute , so is my data is received for one batch and then processed during other batch duration or it is simultaneously processed
If it is processed simultaneously then how is it assured that my processing is finished in the batch duration
How to use the that web UI to monitor and debugging
Do we need always more than one executor or with one we can do our job
It depends :). If you have a very small volume of traffic coming in, it could very well be that one machine code suffice in terms of load. In terms of fault tolerance that might not be a very good idea, since a single executor could crash and make your entire stream fault.
I am pulling data from kafka using createDirectStream which is
receiver less method and batch duration is one minute , so is my data
is received for one batch and then processed during other batch
duration or it is simultaneously processed
Your data is read once per minute, processed, and only upon the completion of the entire job will it continue to the next. As long as your batch processing time is less than one minute, there shouldn't be a problem. If processing takes more than a minute, you will start to accumulate delays.
If it is processed simultaneously then how is it assured that my
processing is finished in the batch duration?
As long as you don't set spark.streaming.concurrentJobs to more than 1, a single streaming graph will be executed, one at a time.
How to use the that web UI to monitor and debugging
This question is generally too broad for SO. I suggest starting with the Streaming tab that gets created once you submit your application, and start diving into each batch details and continuing from there.
To add a bit more on monitoring
How to use the that web UI to monitor and debugging
Monitor your application in the Streaming tab on localhost:4040, the main metrics to look for are Processing Time and Scheduling Delay. Have a look at the offical doc : http://spark.apache.org/docs/latest/streaming-programming-guide.html#monitoring-applications
batch duration is one minute
Your batch duration a bit long, try to adjust it with lower values to improve your latency. 4 seconds can be a good start.
Also it's a good idea to monitor these metrics on Graphite and set alerts. Have a look at this post https://stackoverflow.com/a/29983398/3535853

Spring Integation - batch process to proceed 18000 jobs in 15 mins

I have a scenario below and is currently leverage Spring integration as the technology to achieve.
I have around 18000 staff Id data
for each staff, a process needs to kick off to do 1 HTTP call to retrieve staff profile information from mail calender server, then 1 HTTP call to retrieve some other information, then may need to send out 3-5 more HTTP calls in a single task
I need to finish this process for above 50000 staff in 15 mins.
I will need this whole batch process to run every 15mins again and again.
Assume each job takes 5 seconds to finish.. i still need 30 mins to finish
=================
Inital Thinking
I can use spring integration to have something like:
- create one job for each staff - 18000 jobs. The job request likely only contains a staff ID so request is very light weight.
- add all the jobs to the int:queue at once so it triggers the input channel - calenderSynRequestChannel
- have a poller - 100 concurrent workers to clean up the job in 15 mins.
Questions:
it is a good way to do this kind of batch processing? some concerns i have is the size of the queue to support 18000 jobs at once
should I use file base approach to store all the staff id in multiple files and get picked up later by the poller? however, this will also complicate the design as there could have concurrent issue for read/write/delete the files by the workers.
Current solution:
<int:service-activator ref="synCalenderService" method="synCalender" input-channel="calenderSynRequestChannel">
<int:poller fixed-delay="50" time-unit="MILLISECONDS" task-executor="taskExecutor" receive-timeout="0" />
</int:service-activator>
<task:executor id="taskExecutor" pool-size="50" keep-alive="120" queue-capacity="500"/>
Anyone encounters similar problem might give a bit of insight on how to address using Spring Integration
Why not do a spring batch job that:
Reader that reads the staff data
Processor that make the HTTP calls
Writer that writes the result to a logfile (for example)
Then utilize the TaskScheduler (spring batch framework) to schedule execution for every 15 minutes, or maybe even better with a fixed delay.
If you want to do it more in parallel, utilize the org.springframework.batch.integration.async.AsyncItemProcessor (and writer).

Resources