How to set skipLimit when using spring batch framework to execute a step concurrently？

How to set skipLimit when using spring batch framework to execute a step concurrently？ - multithreading

I use spring batch to load data from a file to a database.The job contains only one step.I use a ThreadPoolTaskExecutor to execute step concurrently.The step is similar to this one.
public Step MyStep(){
return StepBuilderFactory.get("MyStep")
.chunk(10000)
.reader(flatFileItemWriter)
.writer(jdbcBatchItemWriter)
.faultTolerant()
.skip(NumberFormatException.class)
.skip(FlatFileParseException.class)
.skipLimit(3)
.throttleLImit(10)
.taskExecutor(taskExecutor)
.build();
}
There are 3 "numberformat" errors in my file,so I set skipLimit 3,but I find that when I execute the job,it will start 10 threads and each thread has 3 skips,so I have 3 * 10 = 30 skips in total,while I only need 3.
So the question is will this cause any problems?And is there any other way to skip exactly 3 times while executing a step concurrently?

github issue
Robert Kasanicky opened BATCH-926 and commented
When chunks execute concurrently each chunk works with its own local copy of skipCount (from StepContribution). Given skipLimit=1 and 10 chunks execute concurrenlty we can end up with successful job execution and 10 skips. As the number of concurrent chunks increases the skipLimit becomes more a limit per chunk rather than job.
Dave Syer commented
I vote for documenting the current behaviour.
Robert Kasanicky commented
documented current behavior in the factory bean
However, this seems to be a correct thing for a very old version of spring batch.
I have a different problem in my code, but the skipLimit seems to be aggregated when I use multiple threads. Albeit the job sometimes hangs without properly failing when SkipLimitException is thrown.

Related

Dataflow exceeds number_of_worker_harness_threads

I deployed Dataflow job with param --number_of_worker_harness_threads=5 (streaming mode).
Next I send 20x PubSub messages triggering 20x loading big CSV files from GCS and start processing.
In the logs I see that job took 10 messages and process it in parallel on 6-8 threads (I checked several times, sometimes it was 6, sometimes 8).
Nevertheless all the time it was more than 5.
Any idea how it works? It does not seem to be expected behavior.

Judging from the flag name, you are using Beam Python SDK.
For Python streaming, the total number of threads running DoFns on 1 worker VM in current implementation may be up to the value provided in --number_of_worker_harness_threads times the number of SDK processes running on the worker, which by default is the number of vCPU cores. There is a way to limit number of processes to 1 regardless of # of vCPUs. To do so, set --experiments=no_use_multiple_sdk_containers.
For example, if you are using --machine_type=n1-standard-2 and --number_of_worker_harness_threads=5, you may have up to 10 DoFn instances in different threads running concurrently on the same machine.
If --number_of_worker_harness_threads is not specified, up to 12 threads per process are used. See also: https://cloud.google.com/dataflow/docs/resources/faq#how_many_instances_of_dofn_should_i_expect_dataflow_to_spin_up_

Spark structured streaming asynchronous batch blocking

I’m using Apache Spark structured streaming for reading from Kafka. Sometimes my micro batches get processed in a greater time than specified, due to heavy writes IO operations. I was wondering if there’s an option of starting the next batch before the first one has finished, but make the second batch blocked by the first?
I mean that if the first one took 7 seconds and the batch is set for 5 seconds, then start the second batch on the fifth second. But if the second batch finishes block it so it won’t write before it’s previous batch (because of the will to keep the correct messages order).

No. Next batch only starts if previous completed. I think you mean term interval. It would become a mess otherwise.
See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers

JCL should read internal reader than completely submit outer JCL

I have a batch job that has 10 steps in STEP5. I have written an internal JCL and I want after Internal reader step are completed successfully my next step in the parent job which is STEP06 to execute. Could you please give any resolution to this problem.

For what you have described, there are 2 approaches:
Break your process into 3 jobs - steps 1-5 as one job, the second
job consisting of the JCL submitted in sep 5 and the third job
consisting of step 6 (or steps 6-10 - you do not make it clear if
the main JCL has 6 steps and the 'inner' JCL 4 steps, making the 10
steps you mention, or if the main JCL has 10 steps).
The execution of the 3 jobs will need to be serialised somehow.
Simply have the 'inner' JCL as a series of steps in the 'outer' JCL
so that you only have once job with steps that run in order.

The typical approach to this sort of issue would be to use scheduler to handle the 3 part process as 3 jobs the middle one perhaps not submitted by the scheduler but monitored/tracked by it.
With a good scheduler setup, there is a good chance that even if the jobs were run on different machines or even different types of machines that they could be tracked.
To have a single job delayed halfway through is possible but would require some sort of program to run in a loop (waiting so as not to use excessive cpu) checking for an event (a dataset being created or deleted, the job could itself could be checked or numerous other ways).
Another way could be to have part 1 submit a job to do part 2 and that job to then submit another as part 3.
Yet another way, perhaps frowned upon depedning upon it's importance, would be to have three jobs the first part submitted to run, the third part submitted but on hold. The first submits the second which releases the third.
Of course there is also the possibility that one job could do everthing as a single job.

Best practice beanstalkd (queue) and node.js

I currently do service using beanstalkd and node.js.
I would like when jobs fail, retry n time before give up the job.
If the job succede i want do it the same job 10 time.
So, what is the best practice, stock in mongo db with the jobId the error and success count, or delete and put a new job with a an error and success count in the body.
I dont know if i'm clear? so tell me , thanks a lot

There is a stats-job <id>\r\n that should also be available via the API library that returns, among other things, how many times the specific job has been reserved, released, buried, and so on.
This allows for a number of retries of failed jobs by checking previous reservation/releases.
To run the same job multiple times, I would personally create either one additional job, with a success count that would then be incremented (into another new job) - or, all nine new jobs, with optional delays before they start.

You have a couple of ways to do this:
you can release the job, and obtain from stats the number of reserves
you can put a new job with a retry count, and keep track of history in the data payload
You should do the later, and you don't need MongoDB as a second dependency.

How to implement multithreading in Spring Batch

We have implemented spring batch and which is handling large data set and because of that it is taking too much time to finish it ( around 2 to 4 hours). We have ItemReade, ItemProcesser and ItemWriter.
We are trying to improve performance , code is fine so we are planning to implement multithreading.
Can anybody give suggestion , how to implement multithreading in Spring Batch
I think we should also take care of member variables which is inject as AutoWired.
Please give your suggestion

We have used Executor to achieve this , we did following things,
Find out the code or portion of the code which took more time to execute.
Did analysis that can we use multithreading.
Use Executor with our oven framework which make sure that threads are returning in sequence manner how it is inserted.
Convert all Member variables to local variables which are having dependency to execute batch as that can cause multithreading problem.
We had able to reduce total time from 12 hours to 4 hours of all the spring batch job.
H

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to set skipLimit when using spring batch framework to execute a step concurrently？ - multithreading

Related

Dataflow exceeds number_of_worker_harness_threads

Spark structured streaming asynchronous batch blocking

JCL should read internal reader than completely submit outer JCL

Best practice beanstalkd (queue) and node.js

How to implement multithreading in Spring Batch

Categories

Resources