How to implement multithreading in Spring Batch - multithreading

We have implemented spring batch and which is handling large data set and because of that it is taking too much time to finish it ( around 2 to 4 hours). We have ItemReade, ItemProcesser and ItemWriter.
We are trying to improve performance , code is fine so we are planning to implement multithreading.
Can anybody give suggestion , how to implement multithreading in Spring Batch
I think we should also take care of member variables which is inject as AutoWired.
Please give your suggestion

We have used Executor to achieve this , we did following things,
Find out the code or portion of the code which took more time to execute.
Did analysis that can we use multithreading.
Use Executor with our oven framework which make sure that threads are returning in sequence manner how it is inserted.
Convert all Member variables to local variables which are having dependency to execute batch as that can cause multithreading problem.
We had able to reduce total time from 12 hours to 4 hours of all the spring batch job.
H

Related

Invisible Delays between Spark Jobs

There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance
I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.

How to set skipLimit when using spring batch framework to execute a step concurrently?

I use spring batch to load data from a file to a database.The job contains only one step.I use a ThreadPoolTaskExecutor to execute step concurrently.The step is similar to this one.
public Step MyStep(){
return StepBuilderFactory.get("MyStep")
.chunk(10000)
.reader(flatFileItemWriter)
.writer(jdbcBatchItemWriter)
.faultTolerant()
.skip(NumberFormatException.class)
.skip(FlatFileParseException.class)
.skipLimit(3)
.throttleLImit(10)
.taskExecutor(taskExecutor)
.build();
}
There are 3 "numberformat" errors in my file,so I set skipLimit 3,but I find that when I execute the job,it will start 10 threads and each thread has 3 skips,so I have 3 * 10 = 30 skips in total,while I only need 3.
So the question is will this cause any problems?And is there any other way to skip exactly 3 times while executing a step concurrently?
github issue
Robert Kasanicky opened BATCH-926 and commented
When chunks execute concurrently each chunk works with its own local copy of skipCount (from StepContribution). Given skipLimit=1 and 10 chunks execute concurrenlty we can end up with successful job execution and 10 skips. As the number of concurrent chunks increases the skipLimit becomes more a limit per chunk rather than job.
Dave Syer commented
I vote for documenting the current behaviour.
Robert Kasanicky commented
documented current behavior in the factory bean
However, this seems to be a correct thing for a very old version of spring batch.
I have a different problem in my code, but the skipLimit seems to be aggregated when I use multiple threads. Albeit the job sometimes hangs without properly failing when SkipLimitException is thrown.

Spark optimize "DataFrame.explain" / Catalyst

I've got a complex software which performs really complex SQL queries (well not queries, Spark plans you know). <-- The plans are dynamic, they change based on user input so I can't "cache" them.
I've got a phase in which spark takes 1.5-2min building the plan. Just to make sure, I added "logXXX", then explain(true), then "logYYY" and it takes 1minute 20 seconds for the explain to execute.
I've trying breaking the lineage but this seems to cause worse performance because the actual execution time becomes longer.
I can't parallelize driver work (already did, but this task can't be overlapped with anything else).
Any ideas/guide on how to improve the plan builder in Spark? (like for example, flags to try enabling/disabling and such...)
Is there a way to cache plans in Spark? (so I can run that in parallel and then execute it)
I've tried disabling all possible optimizer rules, setting min iterations to 30... but nothing seems to affect that concrete point :S
I tried disabling wholeStageCodegen and it helped a little, but the execution is longer so :).
Thanks!,
PS: The plan does contain multiple unions (<20, but quite complex plans inside each union) which are the cause for the time, but splitting them apart also affects execution time.
Just in case it helps someone (and if no-one provides more insights).
As I couldn't manage to reduce optimizer times (and well, not sure if reducing optimizer times would be good, as I may lose execution time).
One of the latest parts of my plan was scanning two big tables and getting one column from each one of them (using windows, aggregations etc...).
So I splitted my code in two parts:
1- The big plan (cached)
2- The small plan which scans and aggregates two big tables (cached)
And added one more part:
3- Left Join/enrich the big plan with the output of "2" (this takes like 10seconds, the dataset is not so big) and finish the remainder computation.
Now I launch both actions (1,2) in parallel (using driver-level parallelism/threads), cache the resulting DataFrames and then wait+ afterwards perform 3.
With this, while Spark driver (thread 1) is calculating the big plan (~2minutes) the executors will be executing part "2" (which has a small plan, but big scans/shuffles) and then both get "mixed" in like 10-15seconds, which a good improvement in execution time over the 1:30 I save while calculating the plan.
Comparing times:
Before I would have
1:30 Spark optimizing time + 6 minutes execution time
Now I have
max
(
1:30 Spark Optimizing time + 4 minutes execution time,
0:02 Spark Optimizing time + 2 minutes execution time
)
+ 15 seconds joining both parts
Not so much, but quite a few "expensive" people will be waiting for it to finish :)

Is there any python wrapper to submit my single job to High Performance Cluster managed by SGE?

I have already called multiprocessing package and used up all the CPUs in one node. I would like to use another 10 nodes to complete my job. Thus, I need 10 * 10 task threads to calculate it. Is there some example code? I found this post "How to use multiple nodes/cores on a cluster with parellelized Python code"
. But I am still in confusion. For instance, the implemented interface,
job_task = perform_job(job_params, nodes, cups)
Any suggestion is greatly appreciated.

Quartz multiple independent Spring jobs to run together using same CRON expression

I have multiple Spring batch jobs that I am required to schedule such that they are triggered at the same time.So far I am not having luck around this and I also do not under stand why.I read some where that having multiple jobs to trigger on same CRON expression(e.g. after every 10 mins) is not possible with Quartz and I have to increase the SIZE of the thread pool some how to make it possible.I just want to understand is this correct?If yes then is there an alternate way of achieving this sort of concurrency and if not then how can I increase the thread pool size in the xml configuration?Many Thanks in advance

Resources