Spring Batch: problems (mix data) when converting to multithread

Spring Batch: problems (mix data) when converting to multithread - multithreading

Maybe this is a recurrent question, but I need some customization with my context.
I'm using Spring Batch 3.0.1.RELEASE
I have a simple job with some steps. One step is a chunk like this:
<tasklet transaction-manager="myTransactionManager">
<batch:chunk reader="myReader" processor="myProcessor" writer="myWriter" commit-interval="${commit.interval}">
</batch:chunk>
<bean id="myProcessor" class="org.springframework.batch.item.support.CompositeItemProcessor" scope="step">
<property name="delegates">
<list>
<bean class="...MyFirstProcessor">
</bean>
<bean class="...MySecondProcessor">
</bean>
</list>
</property>
Reader: JdbcCursorItemReader
Processor: CompositeProcessor with my delegates
Writer: CompositeWriter with my delegates
With this configuration, my job works perfectly.
Now, I want to convert this to a multi-threaded job.
Following the documentation to basic multi-thread jobs, I included a SympleAsyncTaskExecutor in the tasklet, but it failed.
I have readed JdbcCursorItemReader does not work properly with multi-thread execution (is it right?). I have changed the reader to a JdbcPagingItemReader, and it has been a nightmare: job does not fail, writing process are ok, but data has mixed among the threads, and customer data were not right and coherent (customers have got services, addreses, etc. from others).
So, why does it happen? How could I change to a multi-thread job?
Are the composite processor and writer right for multithread?
How could I make a custom thread-safe composite processor?
Maybe could it be the JDBC reader: Is there any thread-safe JDBC reader for multi-thread?
I'm very locked and confused with this, so any help would be very appreciated.
Thanks a lot.
[EDIT - SOLVED]
Well, the right and suitable fix to my issue is to design the job for multithread and thread-safe execution from the beggining. It's habitual to practice first with one-thread step execution, to understand and know Spring Batch concepts; but if you consider you are leaving this phase behind, considerations like immutable objects, thread-safe list, maps, etc... must raise.
And the current fix in the current state of my issue has been the next I describe later. After test Martin's suggestions and taking into account Michael's guidelines, I have finally fix my issue as good as I could. The next steps aren't good practice, but I couldn't rebuild my job from the beggining:
Change itemReader to JdbcPagingItemReader with setState to false.
Change List by CopyOnWriteArrayList.
Change HashMap by ConcurrentHashMap.
In each delegated processor, get a new instance of every bean property (fortunately, there was only one injected bean) by passing the context (implements ApplicationContextAware) and getting a unique instance of the bean (configure every injected bean as scope="prototype").
So, if the delegated bean was:
<bean class="...MyProcessor">
<property name="otherBean" ref="otherBeanID" />
Change to:
<bean class="...MyProcessor">
<property name="otherBean" value="otherBeanID" />
And, inside MyProcessor, get a single instance for otherBeanID from the context; otherBeanID must be configurated with scope="protoype".
As I tell before, they're no good style, but it was my best option, and I can assert each thread has its own and different item instance and other bean instance.
It proves that some classes has not been well designed for a right multithread execution.
Martin, Michael, thanks for your support.
I hope it helps to anyone.

You have asked a lot in your question (in the future, please break this type of question up into multiple, more specific questions). However, item by item:
Is JdbcCursorItemReader thread-safe?
As the documentation states, it is not. The reason for this is that the JdbcCursorItemReader wraps a single ResultSet which is not thread safe.
Are the composite processor and writer right for multithread?
The CompositeItemProcessor provided by Spring Batch is considered thread safe as long as the delegate ItemProcessor implementations are thread safe as well. You provide no code in relation to your implementations or their configurations so I can't verify their thread safety. However, given the symptoms you are describing, my hunch is that there is some form of thread safety issues going on within your code.
You also don't identify what ItemWriter implementations or their configurations you are using so there may be thread related issues there as well.
If you update your question with more information about your implementations and configurations, we can provide more insight.
How could I make a custom thread-safe composite processor?
There are two things to consider when implementing any ItemProcessor:
Make it thread safe: Following basic thread safety rules (read the book Java Concurrency In Practice for the bible on the topic) will allow you to scale your components by just adding a task executor.
Make it idempotent: During skip/retry processing, items may be re-processed. By making your ItemProcessor implementation idempotent, this will prevent side effects from this multiple trips through a processor.
Maybe could it be the JDBC reader: Is there any thread-safe JDBC reader for multi-thread?
As you have noted, the JdbcPaginingItemReader is thread safe and noted as such in the documentation. When using multiple threads, each chunk is executed in it's own thread. If you've configured the page size to match the commit-interval, that means each page is processed in the same thread.
Other options for scaling a single step
While you went down the path of implementing a single, multi-threaded step, there may be better options. Spring Batch provides 5 core scaling options:
Multithreaded step - As you are trying right now.
Parallel Steps - Using Spring Batch's split functionality you can execute multiple steps in parallel. Given that you're working with composite ItemProcessor and composite ItemWriters in the same step, this may be something to explore (breaking your current composite scenarios into multiple, parallel steps).
Async ItemProcessor/ItemWriters - This option allows you to execute the processor logic in a different thread. The processor spins the thread off and returns a Future to the AsyncItemWriter which will block until the Future returns to be written.
Partitioning - This is the division of the data into blocks called partitions that are processed in parallel by child steps. Each partition is processed by an actual, independent step so using step scoped components can prevent thread safety issues (each step gets it's own instance). Partition processing can be preformed either locally via threads or remotely across multiple JVMs.
Remote Chunking - This option farms the processor logic out to other JVM processes. It really should only be used if the ItemProcessor logic is the bottle neck in the flow.
You can read about all of these options in the documentation for Spring Batch here: http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html
Thread safety is a complex problem. Just adding multiple threads to code that used to work in a single threaded environment will typically uncover issues in your code.

Related

Which reader, writer, processor should I use for restartable multithreaded spring batch?

I want to create multi threaded step in my restartable job. As I saw many of Spring Batch readers are not thread-safe.
I have some questions related with that;
Is there any readers/writer/processor that can i use for restartable multi threaded step ?
If not, how can I do this without adding processors or status column to table ? Because our all tables are stable and can not be change for this.
I'm doing some researches but I want to ask here also.

Restartability is not possible with a multi-threaded step, because threads can (and will) override each others state in the execution context. That's why the javadoc of stateful readers mentions to disable state management in multi-threaded steps. Here is an excerpt from the JdbcPagingItemReader:
The implementation is thread-safe in between calls to open(ExecutionContext),
but remember to use saveState=false if used in a multi-threaded client
(no restart available)
Same mention in the Javadoc of JpaPagingItemReader and others. The process indicator pattern is the way to go, but you said you can't add the column in your table..
What you can probably try is create a temporary table with the required data + column (in a first step) and remove it (in a final step) once the job is successfully executed. The lifecycle of the temporary table will be the same as the job instance (ie it can survive a job execution failure and be available for the next run). The successful job execution would remove it (this can be done automatically in a final step of the job to avoid manual house-keeping of temporary tables in your db).

Replacing bad performing workers in pool

I have a set of actors that are somewhat stateless and perform similar tasks.
Each of these workers is unreliable and potentially low performing. In my design- I can easily spawn more actors to replace lazy ones.
The performance of an actor is assessed by itself. Is there a way to make the supervisor/actor pool do this assessment, to help decide which workers are slow enough for me to replace? Or is my current strategy "the" right strategy?

I'm new to akka myself, so only trying to help, but my attack would be something along the following lines:
Write your own routing logic, something along the following lines https://github.com/akka/akka/blob/v2.3.5/akka-actor/src/main/scala/akka/routing/SmallestMailbox.scala Keep in mind that a new instance is created for every pool, so each instance can store information about how many messages have been processed by each actor so far. In this instance, once you find an actor underperforming, mark it as 'removable' (once it is no longer processing any new messages) in a separate data structure and stop sending further messages.
Write your own router pool: override createRouterActor https://github.com/akka/akka/blob/v2.3.5/akka-actor/src/main/scala/akka/routing/RouterConfig.scala:236 to provide your own CustomRouterPoolActor
Write your CustomRouterPoolActor along the following lines: https://github.com/akka/akka/blob/8485cd2ebb46d2fba851c41c03e34436e498c005/akka-actor/src/main/scala/akka/routing/Resizer.scala (See ResizablePoolActor). This actor will have access to your strategy instance. From this strategy instance- remove the routees already marked for removal. Look at ResizablePoolCell to see how to remove actors.

Question is - why some of your workers perform badly? Is there anything difference between them (I assume not). If not, that maybe some payloads simply require more work the the others - what's the point of terminating them then?
Once we had similar problem - and used SmallestMailboxRoutingLogic. It basically try to distribute the workload based on mailbox sizes.
Anyway, I would rather try to answer the question - why some of the workers are unstable and perform poorly - because this looks like a biggest problem you are just trying to cover elsewhere.

Are there greenDAO thread safety best practices?

I'm having a go with greenDAO and so far it's going pretty well. One thing that doesn't seem to be covered by the docs or website (or anywhere :( ) is how it handles thread safety.
I know the basics mentioned elsewhere, like "use a single dao session" (general practice for Android + SQLite), and I understand the Java memory model quite well. The library internals even appear threadsafe, or at least built with that intention. But nothing I've seen covers this:
greenDAO caches entities by default. This is excellent for a completely single-threaded program - transparent and a massive performance boost for most uses. But if I e.g. loadAll() and then modify one of the elements, I'm modifying the same object globally across my app. If I'm using it on the main thread (e.g. for display), and updating the DB on a background thread (as is right and proper), there are obvious threading problems unless extra care is taken.
Does greenDAO do anything "under the hood" to protect against common application-level threading problems? For example, modifying a cached entity in the UI thread while saving it in a background thread (better hope they don't interleave! especially when modifying a list!)? Are there any "best practices" to protect against them, beyond general thread safety concerns (i.e. something that greenDAO expects and works well with)? Or is the whole cache fatally flawed from a multithreaded-application safety standpoint?

I've no experience with greenDAO but the documentation here:
http://greendao-orm.com/documentation/queries/
Says:
If you use queries in multiple threads, you must call forCurrentThread() on the query to get a Query instance for the current thread. Starting with greenDAO 1.3, object instances of Query are bound to their owning thread that build the query. This lets you safely set parameters on the Query object while other threads cannot interfere. If other threads try to set parameters on the query or execute the query bound to another thread, an exception will be thrown. Like this, you don’t need a synchronized statement. In fact you should avoid locking because this may lead to deadlocks if concurrent transactions use the same Query object.
To avoid those potential deadlocks completely, greenDAO 1.3 introduced the method forCurrentThread(). This will return a thread-local instance of the Query, which is safe to use in the current thread. Every time, forCurrentThread() is called, the parameters are set to the initial parameters at the time the query was built using its builder.
While so far as I can see the documentation doesn't explicitly say anything about multi threading other than this this seems pretty clear that it is handled. This is talking about multiple threads using the same Query object, so clearly multiple threads can access the same database. Certainly it's normal for databases and DAO to handle concurrent access and there are a lot of proven techniques for working with caches in this situation.

By default GreenDAO caches and returns cached entity instances to improve performance. To prevent this behaviour, you need to call:
daoSession.clear()
to clear all cached instances. Alternatively you can call:
objectDao.detachAll()
to clear cached instances only for the specific DAO object.
You will need to call these methods every time you want to clear the cached instances, so if you want to disable all caching, I recommend calling them in your Session or DAO accessor methods.
Documentation:
http://greenrobot.org/greendao/documentation/sessions/#Clear_the_identity_scope
Discussion: https://github.com/greenrobot/greenDAO/issues/776

Lucene NIOFSDirectory and SimpleFSDirectory with multiple threads

My basic question is: what's the proper way to create/use instances of NIOFSDirectory and SimpleFSDirectory when there's multiple threads that need to make queries (reads) on the same index. More to the point: should an instance of the XXXFSDirectory be created for each thread that needs to do a query and retrieve some results (and then in the same thread have it closed immediatelly after), or should I make a "global" (singleton?) instance which is passed to all threads and then they all use it at the same time (and it's no longer up to each thread to close it when it's done with a query)?
Here's more details:
I've read the docs on both NIOFSDirectory and SimpleFSDirectory and what I got is:
they both support multithreading:
NIOFSDirectory : "An FSDirectory implementation that uses java.nio's FileChannel's positional read, which allows multiple threads to read from the same file without synchronizing."
SimpleFSDirectory : "A straightforward implementation of FSDirectory using java.io.RandomAccessFile. However, this class has poor concurrent performance (multiple threads will bottleneck) as it synchronizes when multiple threads read from the same file. It's usually better to use NIOFSDirectory or MMapDirectory instead."
NIOFSDirectory is better suited (basically, faster) than SimpleFSDirectory in a multi threaded context (see above)
NIOFSDIrectory does not work well on Windows. On Windows SimpleFSDirectory is recomended. However on *nix OS NIOFSDIrectory works fine, and due to better performance when multi threading, it's recommended over SimpleFSDirectory.
"NOTE: NIOFSDirectory is not recommended on Windows because of a bug in how FileChannel.read is implemented in Sun's JRE. Inside of the implementation the position is apparently synchronized."
The reason I'm asking this is that I've seen some actual projects, where the target OS is Linux, NIOFSDirectory is used to read from the index, but an instance of it is created for each request (from each thread), and once the query is done and the results returned, the thread closes that instance (only to create a new one at the next request, etc). So I was wondering if this is really a better approach than to simply have a single NIOFSDirectory instance shared by all threads, and simply have it opened when the application starts, and closed much later on when a certain (multi threaded) job is finished...
More to the point, for a web application, isn't it better to have something like a context listener which creates an instance of NIOFSDirectory , places it in to the Application Context, all Servlets share and use it, and then the same context listener closes it when the app shuts down?

Official Lucene FAQ suggests the following:
Share a single IndexSearcher across queries and across threads in your
application.
IndexSearcher requires single IndexReader and the latter can be produced with a DirectoryReader.open(Directory) which would only require a single instance of Directory.

struts action singleton

Why is the struts action class is singleton ?
Actually I am getting point that it is multithreaded. but at time when thousand of request hitting same action, and we put synchronized for preventing threading issue, then it not give good performance bcoz thread going in wait state and it take time to proced.
Is that any way to remove singleton from action class ?
for more info Please visit : http://rameshsengani.in

You are asking about why the Action class is a singleton but I think you also have some issues understanding thread safety so I will try to explain both.
First of all, a Struts Action class is not implemented as a singleton, the framework just uses one instance of it. But because only one instance is used to process all incoming requests, care must be taken not to do something with in the Action class that is not thread safe. But the thing is: by default, Struts Action classes are not thread safe.
Thread safety means that a piece of code or object can be safely used in a multi-threaded environment. An Action class can be safely used in a multi-threaded environment and you can have it used in a thousand threads at the same time with no issues... that is if you implement it correctly.
From the Action class JavaDoc:
Actions must be programmed in a thread-safe manner, because the controller will share the same instance for multiple simultaneous requests. This means you should design with the following items in mind:
Instance and static variables MUST NOT be used to store information related to the state of a particular request. They MAY be used to share global resources across requests for the same action.
Access to other resources (JavaBeans, session variables, etc.) MUST be synchronized if those resources require protection. (Generally, however, resource classes should be designed to provide their own protection where necessary.
You use the Struts Action by deriving it and creating your own. When you do that, you have to take care to respect the rules above. That means something like this is a NO-NO:
public class MyAction extends Action {
private Object someInstanceField;
public ActionForward execute(...) {
// modify someInstanceField here without proper synchronization ->> BAD
}
}
You don't need to synchronize Action classes unless you did something wrong with them like in the code above. The thing is that the entry point of execution into your action is the execute method.
This method receives all it needs as parameters. You can have a thousand threads executed at the same time in the execute method with no issues because each thread has its own execution stack for the method call but not for data that is in the heap (like someInstanceField) which is shared between all threads.
Without proper synchronization when modifying someInstanceField all threads will do as they please with it.
So yes, Struts 1 Action classes are not thread safe but this is in the sense that you can't safely store state in them (i.e.make them statefulf) or if you do it must be properly synchronized.
But if you keep your Action class implementation stateless you are OK, no synchronization needed and threads don't wait for one another.
Why is the struts action class is singleton ?
It's by design. Again the JavaDoc explains it:
An Action is an adapter between the contents of an incoming HTTP request and the corresponding business logic that should be executed to process this request
The request parameters are tied to the web tier and you don't want to send that type of data into your business logic classes because that will create a tight coupling
between the two layers which will then make it impossible to easily reuse your business layer.
Because transforming web objects into model objects (and I don't mean the ActionForm beans here) should be the main purpose of Action classes, they don't need to maintain any state (and shoudn't) and also, there is no reason to have more instances of these guys, all doing the same thing. Just one will do.
If you want you can safely maintain state in your model by persisting info to a database for example, or you can maintain web state by using the http session. It is wrong to maintain state in the Action classes as this introduces the need for syncronisation as explained above.
Is there a way to remove singleton from action class?
I guess you could extend Struts and override the default behavior of RequestProcessor.processActionCreate to create yourself an Action per request
but that means adding another layer on top of Struts to change its "normal" behavior. I've already seen stuff like this go bad in a few applications so I would not recommend it.
My suggestion is to keep your Action classes stateless and go for the single instance that is created for it.
If your app is new and you absolutely need statefull Actions, I guess you could go for Struts 2 (they changed the design there and the Action instances are now one per request).
But Struts 2 is very different from Struts 1 so if you app is old it might be difficult to migrate to Struts 2.
Hope this makes it clear now.

This has changed in Struts2 http://struts.apache.org/release/2.1.x/docs/comparing-struts-1-and-2.html
*Struts 2 Action objects are instantiated for each request, so there are no thread-safety issues. (In practice, servlet containers generate many throw-away objects per request, and one more object does not impose a performance penalty or impact garbage collection.) *

I don't know much about struts, but it appears that this changed in Struts 2, so perhaps you should switch to Struts 2?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string