I have a spring batch job which takes-in flat file, processes the records and writes-out the output to another flat file.
I have used FlatFileItemReader and FlatFileItemWriter as reader and writer respectively.
However, when I try to implement multi-threaded steps, my job does not work properly. I get following warnings in my log file
WARN ChunkMonitor:109 - No ItemReader set (must be concurrent step), so ignoring offset data.
WARN ChunkMonitor:141 - ItemStream was opened in a different thread. Restart data could be compromised.
Can you please help me out in implementing multi-threaded step?
This is because FlatFileItemReader is not thread-safe as the doc says :
/**
* Abstract superclass for {#link ItemReader}s that supports restart by storing
* item count in the {#link ExecutionContext} (therefore requires item ordering
* to be preserved between runs).
*
* Subclasses are inherently *not* thread-safe.
*
* #author Robert Kasanicky
*/
To implements a multi-thread reader, you will have to write a custom reader that synchronize
the calls to open and update from the ItemStream interface. If not, your job won't be safe to be restartable.
Hope it helps
Regards
Related
Can any one explain what is the difference between rcu_dereference() and rcu_dereference_protected()?
rcu_dereference() contains barrier code and rcu_dereference_protected() do not contain.
When to use rcu_dereference() and when to use rcu_dereference_protected()?
In short:
rcu_dereference() should be used at read-side, protected by rcu_read_lock() or similar.
rcu_dereference_protected() should be used at write-side (update-side) by the single writer, or protected by the lock which prevents several writers from concurrent modification of the dereferenced pointer. In such cases pointer cannot be modified outside of the current thread, so neither compiler- nor cpu-barriers are needed.
If doubt, using rcu_dereference is always safe, and its perfomance penalties (compared to rcu_dereference_protected) are low.
Exact description for rcu_dereference_protected in the kernel 4.6:
/**
* rcu_dereference_protected() - fetch RCU pointer when updates prevented
* #p: The pointer to read, prior to dereferencing
* #c: The conditions under which the dereference will take place
*
* Return the value of the specified RCU-protected pointer, but omit
* both the smp_read_barrier_depends() and the READ_ONCE(). This
* is useful in cases where update-side locks prevent the value of the
* pointer from changing. Please note that this primitive does -not-
* prevent the compiler from repeating this reference or combining it
* with other references, so it should not be used without protection
* of appropriate locks.
*
* This function is only for update-side use. Using this function
* when protected only by rcu_read_lock() will result in infrequent
* but very ugly failures.
*/
I'm already implemented Remote Chunking using AMQP (RabbitMQ). Now I need to run parallel jobs from within a web container.
My simple controller (testJob use remote chunking):
#Controller
public class JobController {
#Autowired
private JobLauncher jobLauncher;
#Autowired
private Job testJob;
#RequestMapping("/job/test")
public void test() {
JobParametersBuilder jobParametersBuilder = new JobParametersBuilder();
jobParametersBuilder.addDate("date",new Date());
try {
jobLauncher.run(personJob,jobParametersBuilder.toJobParameters());
} catch (JobExecutionAlreadyRunningException | JobRestartException | JobParametersInvalidException | JobInstanceAlreadyCompleteException e) {
e.printStackTrace();
}
}
}
testJob reads data from filesystem (master chunk) and send it to remote chunk (slave chunk). The problem is that ItemReader is not thread safe.
There are some practical limitations of using multi-threaded Steps for some common Batch use cases. Many participants in a Step (e.g. readers and writers) are stateful, and if the state is not segregated by thread, then those components are not usable in a multi-threaded Step. In particular most of the off-the-shelf readers and writers from Spring Batch are not designed for multi-threaded use. It is, however, possible to work with stateless or thread safe readers and writers, and there is a sample (parallelJob) in the Spring Batch Samples that show the use of a process indicator (see Section 6.12, “Preventing State Persistence”) to keep track of items that have been processed in a database input table.
I'm considered on parallelJob sample on spring batch github repository
https://github.com/spring-projects/spring-batch/blob/master/spring-batch-samples/src/main/java/org/springframework/batch/sample/common/StagingItemReader.java
I'm a bit confused about Process indicator pattern. Where I can find more detailed information about this pattern?
If all you're concerned with is that the ItemReader instance would be shared across job invocations, you can declare the ItemReader as a step scope and you'll get a new instance per invocation which would remove the threading concerns.
But to answer your direct question about the process indicator pattern I'm not sure where good documentation on it by itself is. There is a sample of it's implementation in the Spring Batch Samples (the parallel job uses it).
The idea behind it is that you provide a status to the records you are going to process. At the beginning of the job/step you mark those records as in process. As the records are committed, you mark them as processed. This removes the need to track the state in the reader since your state is actually in the db (your query only looks for records marked as in process).
I have a multi-threaded batch job reading from a DB and I am concerned about different threads re-reading records as ItemReader is not thread safe in Spring batch. I went through SpringBatch FAQ section which states that
You can synchronize the read() method (e.g. by wrapping it in a delegator that does the synchronization). Remember that you will lose restartability, so best practice is to mark the step as not restartable and to be safe (and efficient) you can also set saveState=false on the reader.
I want to know why will I loose re-startability in this case? What has restartability got to do with synchronizing my read operations? It can always try again,right?
Also, will this piece of code be enough for synchronizing the reader?
public SynchronizedItemReader<T> implements ItemReader<T> {
private final ItemReader<T> delegate;
public SynchronizedItemReader(ItemReader<T> delegate) {
this.delegate = delegate;
}
public synchronized T read () {
return delegate.read();
}
}
When using an ItemReader with multithreads, the lack of restartability is not about the read itself. It's about saving the state of the reader which occurs in the update method. The issue is that there needs to be coordination between the calls to read() - the method providing the data and update() - the method persisting the state. When you use multiple threads, the internal state of the reader (and therefore the update() call) may or may not reflect the work that has been done. Take for example the FlatFileItemReader using a chunk size of 5 and running on multiple threads. You could have thread1 having read 5 items (time to update), yet thread 2 could have read an additional 3. This means that the call to update would save that 8 items have been read. If the chunk on thread 2 fails, the state would due incorrect and the restart would miss the three items that were already read.
This is not to say that it is impossible to write a thread safe ItemReader. However, as your example above illustrates, if delegate is a stateful ItemReader (implements ItemStream as well), the state will not be persisted correctly with calls to update (in fact, your example above doesn't even take the ItemStream aspect of stageful readers into account).
If you want make restartable your job, with parallel execution of items, you can save item, that reader read plus state of this item by yourself.
I have 50,000 tasks and want to execute them with 10 threads.
In Java I should create Executers.threadPool(10) and pass runnable to is then wait to process all. Scala as I understand especially useful for that task, but I can't find solution in docs.
Scala 2.9.3 and later
THe simplest approach is to use the scala.concurrent.Future class and associated infrastructure. The scala.concurrent.future method asynchronously evaluates the block passed to it and immediately returns a Future[A] representing the asynchronous computation. Futures can be manipulated in a number of non-blocking ways, including mapping, flatMapping, filtering, recovering errors, etc.
For example, here's a sample that creates 10 tasks, where each tasks sleeps an arbitrary amount of time and then returns the square of the value passed to it.
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
val tasks: Seq[Future[Int]] = for (i <- 1 to 10) yield future {
println("Executing task " + i)
Thread.sleep(i * 1000L)
i * i
}
val aggregated: Future[Seq[Int]] = Future.sequence(tasks)
val squares: Seq[Int] = Await.result(aggregated, 15.seconds)
println("Squares: " + squares)
In this example, we first create a sequence of individual asynchronous tasks that, when complete, provide an int. We then use Future.sequence to combine those async tasks in to a single async task -- swapping the position of the Future and the Seq in the type. Finally, we block the current thread for up to 15 seconds while waiting for the result. In the example, we use the global execution context, which is backed by a fork/join thread pool. For non-trivial examples, you probably would want to use an application specific ExecutionContext.
Generally, blocking should be avoided when at all possible. There are other combinators available on the Future class that can help program in an asynchronous style, including onSuccess, onFailure, and onComplete.
Also, consider investigating the Akka library, which provides actor-based concurrency for Scala and Java, and interoperates with scala.concurrent.
Scala 2.9.2 and prior
This simplest approach is to use Scala's Future class, which is a sub-component of the Actors framework. The scala.actors.Futures.future method creates a Future for the block passed to it. You can then use scala.actors.Futures.awaitAll to wait for all tasks to complete.
For example, here's a sample that creates 10 tasks, where each tasks sleeps an arbitrary amount of time and then returns the square of the value passed to it.
import scala.actors.Futures._
val tasks = for (i <- 1 to 10) yield future {
println("Executing task " + i)
Thread.sleep(i * 1000L)
i * i
}
val squares = awaitAll(20000L, tasks: _*)
println("Squares: " + squares)
You want to look at either the Scala actors library or Akka. Akka has cleaner syntax, but either will do the trick.
So it sounds like you need to create a pool of actors that know how to process your tasks. An Actor can basically be any class with a receive method - from the Akka tutorial (http://doc.akkasource.org/tutorial-chat-server-scala):
class MyActor extends Actor {
def receive = {
case "test" => println("received test")
case _ => println("received unknown message")
}}
val myActor = Actor.actorOf[MyActor]
myActor.start
You'll want to create a pool of actor instances and fire your tasks to them as messages. Here's a post on Akka actor pooling that may be helpful: http://vasilrem.com/blog/software-development/flexible-load-balancing-with-akka-in-scala/
In your case, one actor per task may be appropriate (actors are extremely lightweight compared to threads so you can have a LOT of them in a single VM), or you might need some more sophisticated load balancing between them.
EDIT:
Using the example actor above, sending it a message is as easy as this:
myActor ! "test"
The actor will then output "received test" to standard output.
Messages can be of any type, and when combined with Scala's pattern matching, you have a powerful pattern for building flexible concurrent applications.
In general Akka actors will "do the right thing" in terms of thread sharing, and for the OP's needs, I imagine the defaults are fine. But if you need to, you can set the dispatcher the actor should use to one of several types:
* Thread-based
* Event-based
* Work-stealing
* HawtDispatch-based event-driven
It's trivial to set a dispatcher for an actor:
class MyActor extends Actor {
self.dispatcher = Dispatchers.newExecutorBasedEventDrivenDispatcher("thread-pool-dispatch")
.withNewThreadPoolWithBoundedBlockingQueue(100)
.setCorePoolSize(10)
.setMaxPoolSize(10)
.setKeepAliveTimeInMillis(10000)
.build
}
See http://doc.akkasource.org/dispatchers-scala
In this way, you could limit the thread pool size, but again, the original use case could probably be satisfied with 50K Akka actor instances using default dispatchers and it would parallelize nicely.
This really only scratches the surface of what Akka can do. It brings a lot of what Erlang offers to the Scala language. Actors can monitor other actors and restart them, creating self-healing applications. Akka also provides Software Transactional Memory and many other features. It's arguably the "killer app" or "killer framework" for Scala.
If you want to "execute them with 10 threads", then use threads. Scala's actor model, which is usually what people is speaking of when they say Scala is good for concurrency, hides such details so you won't see them.
Using actors, or futures with all you have are simple computations, you'd just create 50000 of them and let them run. You might be faced with issues, but they are of a different nature.
Here's another answer similar to mpilquist's response but without deprecated API and including the thread settings via a custom ExecutionContext:
import java.util.concurrent.Executors
import scala.concurrent.{ExecutionContext, Await, Future}
import scala.concurrent.duration._
val numJobs = 50000
var numThreads = 10
// customize the execution context to use the specified number of threads
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numThreads))
// define the tasks
val tasks = for (i <- 1 to numJobs) yield Future {
// do something more fancy here
i
}
// aggregate and wait for final result
val aggregated = Future.sequence(tasks)
val oneToNSum = Await.result(aggregated, 15.seconds).sum
What I need is a system I can define simple objects on (say, a "Server" than can have an "Operating System" and "Version" fields, alongside other metadata (IP, MAC address, etc)).
I'd like to be able to request objects from the system in a safe way, such that if I define a "Server", for example, can be used by 3 clients concurrently, then if 4 clients ask for a Server at the same time, one will have to wait until the server is freed.
Furthermore, I need to be able to perform requests in some sort of query-style, for example allocate(type=System, os='Linux', version=2.6).
Language doesn't matter too much, but Python is an advantage.
I've been googling for something like this for the past few days and came up with nothing, maybe there's a better name for this kind of system that I'm not aware of.
Any recommendations?
Thanks!
Resource limitation in concurrent applications - like your "up to 3 clients" example - is typically implemented by using semaphores (or more precisely, counting semaphores).
You usually initialize a semaphore with some "count" - that's the maximum number of concurrent accesses to that resource - and you decrement this counter every time a client starts using that resource and increment it when a client finishes using it. The implementation of semaphores guarantees the "increment" and "decrement" operations will be atomic.
You can read more about semaphores on Wikipedia. I'm not too familiar with Python but I think these two links can help:
Python Threading Library
Semaphore Objects in Python.
For Java there is a very good standard library that has this functionality:
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/package-summary.html
Just create a class with Semaphore field:
class Server {
private static final MAX_AVAILABLE = 100;
private final Semaphore available = new Semaphore(MAX_AVAILABLE, true);
// ... put all other fields (OS, version) here...
private Server () {}
// add a factory method
public static Server getServer() throws InterruptedException {
available.acquire();
//... do the rest here
}
}
Edit:
If you want things to be more "configurable" look into using AOP techniques i.e. create semaphore-based synchronization aspect.
Edit:
If you want completely standalone system, I guess you can try to use any modern DB (e.g. PostgreSQL) system that support row-level locking as semaphore. For example, create 3 rows for each representing a server and select them with locking if they are free (e.g. "select * from server where is_used = 'N' for update"), mark selected server as used, unmark it in the end, commit transaction.