Writing to the StepExecutionContext in multi-threaded steps in Spring Batch - multithreading

I am using Spring Batch and I've created a tasklet that is run by using a SimpleAsyncTaskExecutor. In this step, I am retrieving the StepExecutionContext with
#BeforeStep
public void saveStepExecution(StepExecution stepExecution) {
this.stepExecution = stepExecution;
}
In the processing method of the tasklet, I try to update the context:
stepExecution.getExecutionContext().put("info", contextInfo);
This leads to ConcurrentModificationExceptions on the stepExecution.
How can I avoid these and update my context in this multi-threaded environment?

The step execution context is a shared resource. Are you really trying to put one "info" per thread? Depending on your context, there are many ways to solve this, since it is a threading issue, not Spring batch.
1) if there is one info per thread, have the thread put a threadlocal in the context (once), and then use the threadlocal to store the "info".
2) if context info is "global", then you should do the put in a synchronized block and check for its existence before putting.
Hope this helps.

Related

Spring Batch thread-safe ItemReader (process indicator pattern)

I'm already implemented Remote Chunking using AMQP (RabbitMQ). Now I need to run parallel jobs from within a web container.
My simple controller (testJob use remote chunking):
#Controller
public class JobController {
#Autowired
private JobLauncher jobLauncher;
#Autowired
private Job testJob;
#RequestMapping("/job/test")
public void test() {
JobParametersBuilder jobParametersBuilder = new JobParametersBuilder();
jobParametersBuilder.addDate("date",new Date());
try {
jobLauncher.run(personJob,jobParametersBuilder.toJobParameters());
} catch (JobExecutionAlreadyRunningException | JobRestartException | JobParametersInvalidException | JobInstanceAlreadyCompleteException e) {
e.printStackTrace();
}
}
}
testJob reads data from filesystem (master chunk) and send it to remote chunk (slave chunk). The problem is that ItemReader is not thread safe.
There are some practical limitations of using multi-threaded Steps for some common Batch use cases. Many participants in a Step (e.g. readers and writers) are stateful, and if the state is not segregated by thread, then those components are not usable in a multi-threaded Step. In particular most of the off-the-shelf readers and writers from Spring Batch are not designed for multi-threaded use. It is, however, possible to work with stateless or thread safe readers and writers, and there is a sample (parallelJob) in the Spring Batch Samples that show the use of a process indicator (see Section 6.12, “Preventing State Persistence”) to keep track of items that have been processed in a database input table.
I'm considered on parallelJob sample on spring batch github repository
https://github.com/spring-projects/spring-batch/blob/master/spring-batch-samples/src/main/java/org/springframework/batch/sample/common/StagingItemReader.java
I'm a bit confused about Process indicator pattern. Where I can find more detailed information about this pattern?
If all you're concerned with is that the ItemReader instance would be shared across job invocations, you can declare the ItemReader as a step scope and you'll get a new instance per invocation which would remove the threading concerns.
But to answer your direct question about the process indicator pattern I'm not sure where good documentation on it by itself is. There is a sample of it's implementation in the Spring Batch Samples (the parallel job uses it).
The idea behind it is that you provide a status to the records you are going to process. At the beginning of the job/step you mark those records as in process. As the records are committed, you mark them as processed. This removes the need to track the state in the reader since your state is actually in the db (your query only looks for records marked as in process).

<Spring Batch> Why does making ItemReader thread-safe leads us to loosing restartability?

I have a multi-threaded batch job reading from a DB and I am concerned about different threads re-reading records as ItemReader is not thread safe in Spring batch. I went through SpringBatch FAQ section which states that
You can synchronize the read() method (e.g. by wrapping it in a delegator that does the synchronization). Remember that you will lose restartability, so best practice is to mark the step as not restartable and to be safe (and efficient) you can also set saveState=false on the reader.
I want to know why will I loose re-startability in this case? What has restartability got to do with synchronizing my read operations? It can always try again,right?
Also, will this piece of code be enough for synchronizing the reader?
public SynchronizedItemReader<T> implements ItemReader<T> {
private final ItemReader<T> delegate;
public SynchronizedItemReader(ItemReader<T> delegate) {
this.delegate = delegate;
}
public synchronized T read () {
return delegate.read();
}
}
When using an ItemReader with multithreads, the lack of restartability is not about the read itself. It's about saving the state of the reader which occurs in the update method. The issue is that there needs to be coordination between the calls to read() - the method providing the data and update() - the method persisting the state. When you use multiple threads, the internal state of the reader (and therefore the update() call) may or may not reflect the work that has been done. Take for example the FlatFileItemReader using a chunk size of 5 and running on multiple threads. You could have thread1 having read 5 items (time to update), yet thread 2 could have read an additional 3. This means that the call to update would save that 8 items have been read. If the chunk on thread 2 fails, the state would due incorrect and the restart would miss the three items that were already read.
This is not to say that it is impossible to write a thread safe ItemReader. However, as your example above illustrates, if delegate is a stateful ItemReader (implements ItemStream as well), the state will not be persisted correctly with calls to update (in fact, your example above doesn't even take the ItemStream aspect of stageful readers into account).
If you want make restartable your job, with parallel execution of items, you can save item, that reader read plus state of this item by yourself.

Distributed\Parallel computing using app-engine (java api)

I want to use the master-slave (worker) paradigm, to solve a problem. I have read that opening new threads manually (for example using thread pool) is not available and I need to use queue, attached code example:
class MyDeferred implements DeferredTask {
#Override
public void run() {
// Do something interesting
}
};
MyDeferred task = new MyDeferred();
// Set instance variables etc as you wish
Queue queue = QueueFactory.getDefaultQueue();
queue.add(withPayload(task));
How can I get the result of the workers (which were added to the queue)?
I need this info, in-order to solve the bigger problem.
Actually you can use threads on GAE, but there are limitations. If you need long-running threads you can use background threads, but this requires you to use backend instances.
If you opt to use task queue, then keep in mind that tasks do not "return" to caller. To aggregate results you'll need to use datastore.
You will have to write the results into the datastore.
Just as a starting point to think about it, you might pass a JobId as a parameter to the tasks, have each task write an entity with the result and the JobId, and then later query the datstore for the given JobId to get all the results.

Best NHibernate multithreading pattern?

As we know, NHibernate sessions are not thread safe. But we have a code path split in several long running threads, all using objects loaded in the initial thread.
using (var session = factory.OpenSession())
{
var parent = session.Get<T>(parentId);
DoSthWithParent(session, parent);
foreach (var child in parent.children)
{
parallelThreadMethodLongRunning.BeginInvoke(session, child);
//[Thread #1] DoSthWithChild(child #1) -> SaveOrUpdate(child #1) + Flush()
//[Thread #2] DoSthWithChild(child #2) -> SaveOrUpdate(child #2) + Flush()
//[Thread #3] DoSthWithChild(child #3) -> SaveOrUpdate(child #3) + Flush()
// -> etc... changes to be persisted immediately, not all at the end.
EndInvoke();
}
DoFinalChangesOnParentAndChildren(parent);
session.Flush();
}
}
One way would be a session for each thread, but that would require the parent object to be reloaded in each. Plus, the final method is also doing changes on the children and would run in a StaleObjectException if another session changed it meanwhile, or had to be evicted/reloaded.
So all threads have to use the same session. What is the best way to do this?
Use save queue in initial thread (thread safe implementation), which is polled in a loop (instead of EndInvoke()) from the main thread. Child threads can insert NHibernate objects to be saved by the main thread.
Use some callback mechanism to save/flush objects in main thread. Is there something similar possible to UI thread callback in WPF, Control.Invoke() or BackgroundWorker?
Put Save/Flush accesses into lock(session) blocks? Maybe dangerous, because modifying the NHibernate objects might change the session, even if not doing a Save()/Flush().
Or should I live with the database overhead to load the same objects for separate sessions in each thread, evict and reload them in the main thread and then do changes again? [edit: bad "solution" due to object concurrency/risk of stale objects]
Consider also that the application has a business logic layer above NHibernate, which has similar objects, but sends it's property values to the NHibernate objects on it's own Save() command, only then modifying them and doing NHibernate Save()/Flush() immediately.
Edit:
It's important that any read operation on NHibernate objects may change the session - lazy loading, chilren collection change under certain conditions. So it is really better to have a business object layer on top, which synchronizes all access to NHibernate objects. Considering the database operations take only a minimum time of the threads (mainly occasional status settings), and most is for calculations, watching, web service access and similar, the performance loss by data layer synchronization is negligible.
Firstly, if I understand correctly, different threads may be updating the same objects. In that case, nHibernate or not, you're performing several updates on the same objects concurrently, which may lead to unexpected results.
You may want to tweak your design a bit to ensure that an object can be only updated by (at most) a single thread.
Now, assuming your flow may include having the same threads reading the same data (but writing different data), I'd suggest using different sessions- one per thread, and utilizing 2nd level cache;
2nd level cache is kept at the SessionFactory (rather than in the Session) level, and is therefore shared by all session instances.
The session object is not thread safe, you can't use it over different threads. The SaveOrUpdate in your sepperate threads will most likely crash your program or corrupt your database. However what about creating the data set you want to update and do the SaveOrUpdate actions in your main thread (were your session is created)?
You should observe the following practices when creating NHibernate
Sessions: • Never create more than one concurrent ISession or
ITransaction instance per database connection.
• Be extremely careful when creating more than one ISession per
database per transaction. The ISession itself keeps track of updates
made to loaded objects, so a different ISession might see stale data.
• The ISession is not threadsafe! Never access the same ISession in
two concurrent threads. An ISession is usually only a single
unit-of-work!

Spring #Async limit number of threads

My question is very similar to this one : #Async prevent a thread to continue until other thread have finished
Basically i need run ~ hundreds of computations in more threads. I want to run only some amount of parallel threads e.g. 5 threads with 5 computationis in paralell.
I am using spring framework and #Async option is natural choice. I do not need full-featured JMS queue, that`s a bit overhead for me.
Any ideas ?
Thank you
If you are using Spring's Java-configuration, your config class needs to implements AsyncConfigurer:
#Configuration
#EnableAsync
public class AppConfig implements AsyncConfigurer {
[...]
#Override
public Executor getAsyncExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(2);
executor.setMaxPoolSize(5);
executor.setQueueCapacity(50);
executor.setThreadNamePrefix("MyExecutor-");
executor.initialize();
return executor;
}
}
See #EnableAsync documentation for more details : http://docs.spring.io/spring/docs/3.1.x/javadoc-api/org/springframework/scheduling/annotation/EnableAsync.html
Have you checked out Task Executor? You can define a Thread Pool, with a maximum number of threads to execute your tasks.
If you want to use it with #Async, use this in your spring-config:
<task:annotation-driven executor="myExecutor" scheduler="myScheduler"/>
<task:executor id="myExecutor" pool-size="5"/>
<task:scheduler id="myScheduler" pool-size="10"/>
Full reference here (25.5.3). Hope this helps.
Since spring boot 2.1 you can use auto configuration and change the maximum number of threads in the application properties file
spring.task.execution.pool.max-size=4
See the full documentation:
https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#boot-features-task-execution-scheduling

Resources