Passing data between subsequental Consumers in LMAX Disruptor (from unmarchaler to business logic)? - disruptor-pattern

As explained in https://martinfowler.com/articles/lmax.html, I would need to process my RingBuffer's events first with Unmarchaler and then with Business Logic Processor. Suppose it is configured like (https://lmax-exchange.github.io/disruptor/docs/com/lmax/disruptor/dsl/Disruptor.html)
Disruptor<MyEvent> disruptor = new Disruptor<MyEvent>(MyEvent.FACTORY, 32, Executors.newCachedThreadPool());
EventHandler<MyEvent> handler1 = new EventHandler<MyEvent>() { ... };
EventHandler<MyEvent> handler2 = new EventHandler<MyEvent>() { ... };
disruptor.handleEventsWith(handler1);
disruptor.after(handler1).handleEventsWith(handler2);
Idea is then that handler1 is unmarchaler and handler2 consumes stuff processed by handler1.
Quesion: How I can exactly code the "unmarchaling and putting back to disruptor" part? I found this https://groups.google.com/forum/#!topic/lmax-disruptor/q6h5HBEBRUk explanation but I didn't quite understand. Suppose the event is arrived at callback for handler1
void onEvent(T event, long sequence, boolean endOfBatch)
(javadoc: https://lmax-exchange.github.io/disruptor/docs/com/lmax/disruptor/EventHandler.html)
that un-marchals some data from event. Now I need to append unmarchaled data to the event for handler2 that will be dealing with unmarchaled object.
What needs to be done to "update" event? Is modifying "event" object enough?

The impact of this really depends on your particular scenario, and as always, if you are after low latency, you should try both and benchmark.
The most straightforward is to update the 'event' object, however depending on your particular approach, this might miss a lot of the single-writer benefits of the disruptor. I will explain and offer some options.
Suppose for example you have handler1 and handler2, handler1 is running in thread1 and handler2 is running in thread2. The initial event publisher is on thread0.
Thread0 writes an entry into the buffer at slot 1
Thread1 reads the entry in slot 1 and writes into slot 1
Thread0 writes an entry into the buffer at slot 2
Thread2 reads from slot 1 and writes to output
Thread1 reads the entry in slot 2 and writes into slot 2
Thread2 reads from slot 2 and writes to output
If you think of the physical memory layout, slot1 and slot2 are hopefully next to each other in memory. For example they could be some subset of a byte array. As you can see, you are reading and writing alternatively from different threads (probably different cpu cores) into very adjacent chunks of memory, which can lead to false sharing / cache lines bouncing around. On top of that, your reads and writes through memory are not likely to be linear so you will miss out on some of the benefits of the CPU caches.
Some other options which might be nicer:
Have separate ringbuffers, where the first ringbuffer is raw data, and the second ringbuffer is unmarshalled events. This way the data is sufficiently separated in memory to avoid these costs. However this will have a bandwidth impact.
Have the unmarshaller and the work done directly in the same handler. Depending on the amount of work in your unmarshaller and your handler this might be viable.

Related

Why does the Disruptor hold lots of data when the producer is much faster than the consumer?

I'm learning about the LMAX Disruptor and have a problem: When I have a very large ring buffer, like 1024, and my producer is much faster than my consumer, the ring buffer will hold lots of data, but will not publish the events until my application ends. Which means my application will lose lots of data (my application is not a daemon).
I've tried to slow down the rate of the producer, which works. But I can't use this approach in my application, it would reduce my application's performance greatly.
val ringBufferSize = 1024
val disruptor = new Disruptor[util.Map[String, Object]](new MessageEventFactory, ringBufferSize, new MessageThreadFactory, ProducerType.MULTI, new BlockingWaitStrategy)
disruptor.handleEventsWith(new MessageEventHandler(batchSize, this))
disruptor.setDefaultExceptionHandler(new MessageExceptionHandler)
val ringBuffer = disruptor.start
val producer = new MessageEventProducer(ringBuffer)
part.foreach { row =>
// Thread.sleep(2000)
accm.add(1)
producer.onData(row)
// flush(row)
}
I want to find a way to control the batch size of the disruptor by myself, and is there any method to consume the rest of the data held at the end of my application?
If you let your application end abruptly, your consumers will end abruptly, too, of course. There is no need to slow down the producer, you simply need to block your application from exiting until all consumers (i. e. event handlers) have finished working on the outstanding events.
The normal way to do this is to invoke Disruptor.shutdown() on the main thread, thus blocking the application from exiting until Disruptor.shutdown() has returned.
In your code snipplet above, you'd add that command before you exit the routine after the part.foreach statement, blocking until the routine returns normally. That would ensure that all events are properly handled to completion.
The Disruptor excels mainly in buffering (smoothing out) bursts of data coming from a single (extremely fast) or multiple (still pretty fast) producer threads, to feed that data to consumers which perform in a predictable manner, thus eliminating as much latency and overhead due to lock contention as possible. You may find that simply invoking the consumer code from within your lambda may yield better or similar results if your producers are in fact much faster than your consumers, unless you use advanced techniques such as batching or setting up the Disruptor to run multiple instances of the same consumer in parallel threads, which requires the event handler implementation to be modified though (see the Disruptor FAQ).
In your example, it seems that all you try to accomplish is to feed an already available set of data (your "part" collection) into a single event handler (MessageEventHandler). In such a use case, you might be better off saying something like parts.stream().parallel().foreach(... messageEventHanler.onEvent(event) ...)

Serial Dispatch Queue with Asynchronous Blocks

Is there ever any reason to add blocks to a serial dispatch queue asynchronously as opposed to synchronously?
As I understand it a serial dispatch queue only starts executing the next task in the queue once the preceding task has completed executing. If this is the case, I can't see what you would you gain by submitting some blocks asynchronously - the act of submission may not block the thread (since it returns straight-away), but the task won't be executed until the last task finishes, so it seems to me that you don't really gain anything.
This question has been prompted by the following code - taken from a book chapter on design patterns. To prevent the underlying data array from being modified simultaneously by two separate threads, all modification tasks are added to a serial dispatch queue. But note that returnToPool adds tasks to this queue asynchronously, whereas getFromPool adds its tasks synchronously.
class Pool<T> {
private var data = [T]();
// Create a serial dispath queue
private let arrayQ = dispatch_queue_create("arrayQ", DISPATCH_QUEUE_SERIAL);
private let semaphore:dispatch_semaphore_t;
init(items:[T]) {
data.reserveCapacity(data.count);
for item in items {
data.append(item);
}
semaphore = dispatch_semaphore_create(items.count);
}
func getFromPool() -> T? {
var result:T?;
if (dispatch_semaphore_wait(semaphore, DISPATCH_TIME_FOREVER) == 0) {
dispatch_sync(arrayQ, {() in
result = self.data.removeAtIndex(0);
})
}
return result;
}
func returnToPool(item:T) {
dispatch_async(arrayQ, {() in
self.data.append(item);
dispatch_semaphore_signal(self.semaphore);
});
}
}
Because there's no need to make the caller of returnToPool() block. It could perhaps continue on doing other useful work.
The thread which called returnToPool() is presumably not just working with this pool. It presumably has other stuff it could be doing. That stuff could be done simultaneously with the work in the asynchronously-submitted task.
Typical modern computers have multiple CPU cores, so a design like this improves the chances that CPU cores are utilized efficiently and useful work is completed sooner. The question isn't whether tasks submitted to the serial queue operate simultaneously — they can't because of the nature of serial queues — it's whether other work can be done simultaneously.
Yes, there are reasons why you'd add tasks to serial queue asynchronously. It's actually extremely common.
The most common example would be when you're doing something in the background and want to update the UI. You'll often dispatch that UI update asynchronously back to the main queue (which is a serial queue). That way the background thread doesn't have to wait for the main thread to perform its UI update, but rather it can carry on processing in the background.
Another common example is as you've demonstrated, when using a GCD queue to synchronize interaction with some object. If you're dealing with immutable objects, you can dispatch these updates asynchronously to this synchronization queue (i.e. why have the current thread wait, but rather instead let it carry on). You'll do reads synchronously (because you're obviously going to wait until you get the synchronized value back), but writes can be done asynchronously.
(You actually see this latter example frequently implemented with the "reader-writer" pattern and a custom concurrent queue, where reads are performed synchronously on concurrent queue with dispatch_sync, but writes are performed asynchronously with barrier with dispatch_barrier_async. But the idea is equally applicable to serial queues, too.)
The choice of synchronous v asynchronous dispatch has nothing to do with whether the destination queue is serial or concurrent. It's simply a question of whether you have to block the current queue until that other one finishes its task or not.
Regarding your code sample code, that is correct. The getFromPool should dispatch synchronously (because you have to wait for the synchronization queue to actually return the value), but returnToPool can safely dispatch asynchronously. Obviously, I'm wary of seeing code waiting for semaphores if that might be called from the main thread (so make sure you don't call getFromPool from the main thread!), but with that one caveat, this code should achieve the desired purpose, offering reasonably efficient synchronization of this pool object, but with a getFromPool that will block if the pool is empty until something is added to the pool.

<Spring Batch> Why does making ItemReader thread-safe leads us to loosing restartability?

I have a multi-threaded batch job reading from a DB and I am concerned about different threads re-reading records as ItemReader is not thread safe in Spring batch. I went through SpringBatch FAQ section which states that
You can synchronize the read() method (e.g. by wrapping it in a delegator that does the synchronization). Remember that you will lose restartability, so best practice is to mark the step as not restartable and to be safe (and efficient) you can also set saveState=false on the reader.
I want to know why will I loose re-startability in this case? What has restartability got to do with synchronizing my read operations? It can always try again,right?
Also, will this piece of code be enough for synchronizing the reader?
public SynchronizedItemReader<T> implements ItemReader<T> {
private final ItemReader<T> delegate;
public SynchronizedItemReader(ItemReader<T> delegate) {
this.delegate = delegate;
}
public synchronized T read () {
return delegate.read();
}
}
When using an ItemReader with multithreads, the lack of restartability is not about the read itself. It's about saving the state of the reader which occurs in the update method. The issue is that there needs to be coordination between the calls to read() - the method providing the data and update() - the method persisting the state. When you use multiple threads, the internal state of the reader (and therefore the update() call) may or may not reflect the work that has been done. Take for example the FlatFileItemReader using a chunk size of 5 and running on multiple threads. You could have thread1 having read 5 items (time to update), yet thread 2 could have read an additional 3. This means that the call to update would save that 8 items have been read. If the chunk on thread 2 fails, the state would due incorrect and the restart would miss the three items that were already read.
This is not to say that it is impossible to write a thread safe ItemReader. However, as your example above illustrates, if delegate is a stateful ItemReader (implements ItemStream as well), the state will not be persisted correctly with calls to update (in fact, your example above doesn't even take the ItemStream aspect of stageful readers into account).
If you want make restartable your job, with parallel execution of items, you can save item, that reader read plus state of this item by yourself.

Effects of swapping buffers on concurrent access

Consider an application with two threads, Producer and Consumer.
Both threads are running approximately equally frequent, multiple times in a second.
Both threads access the same memory region, where Producer writes to the memory, and Consumer reads the current chunk of data and does something with it, without invalidating the data.
A classical approach is this one:
int[] sharedData;
//Called frequently by thread Producer
void WriteValues(int[] data)
{
lock(sharedData)
{
Array.Copy(data, sharedData, LENGTH);
}
}
//Called frequently by thread Consumer
void WriteValues()
{
int[] data;
lock(sharedData)
{
Array.Copy(sharedData, data, LENGTH);
}
DoSomething(data);
}
If we assume that the Array.Copy takes time, this code would run slow, since Producer always has to wait for Consumer during copying and vice versa.
An approach to this problem would be to create two buffers, one which is accessed by the Consumer, and one which is written to by the Producer, and swap the buffers, as soon as writing has finished.
int[] frontBuffer;
int[] backBuffer;
//Called frequently by thread Producer
void WriteValues(int[] data)
{
lock(backBuffer)
{
Array.Copy(data, backBuffer, LENGTH);
int[] temp = frontBuffer;
frontBuffer = backBuffer;
backBuffer = temp;
}
}
//Called frequently by thread Consumer
void WriteValues()
{
int[] data;
int[] currentFrontBuffer = frontBuffer;
lock(currentForntBuffer)
{
Array.Copy(currentFrontBuffer , data, LENGTH);
}
DoSomething(currentForntBuffer );
}
Now, my questions:
Is locking, as shown in the 2nd example, safe? Or does the change of references introduce problems?
Will the code in the 2nd example execute faster than the code in the 1st example?
Are there any better methods to efficiently solve the problem described above?
Could there be a way to solve this problem without locks? (Even if I think it is impossible)
Note: this is no classical producer/consumer problem: It is possible for Consumer to read the values multiple times before Producer writes it again - the old data stays valid until Producer writes new data.
Is locking, as shown in the 2nd example, safe? Or does the change of references introduce problems?
As far as I can tell, because reference assignment is atomic, this may be safe but not ideal. Because the WriteValues() method reads from frontBuffer without a lock or memory barrier forcing a cache refresh, there no guarantee that the variable will ever be updated with new values from main memory. There is then a potential to continuously read the stale, cached values of that instance from the local register or CPU cache. I'm unsure of whether the compiler/JIT might infer a cache refresh anyway based on the local variable, maybe somebody with more specific knowledge can speak to this area.
Even if the values aren't stale, you may also run into more contention than you would like. For example...
Thread A calls WriteValues()
Thread A takes a lock on the instance in frontBuffer and starts copying.
Thread B calls WriteValues(int[])
Thread B writes its data, moves the currently locked frontBuffer instance into backBuffer.
Thread B calls WriteValues(int[])
Thread B waits on the lock for backBuffer because Thread A still has it.
Will the code in the 2nd example execute faster than the code in the 1st example?
I suggest that you profile it and find out. X being faster than Y only matters if Y is too slow for your particular needs, and you are the only one who knows what those are.
Are there any better methods to efficiently solve the problem described above?
Yes. If you are using .Net 4 and above, there is a BlockingCollection type in System.Collections.Concurrent that models the Producer/Consumer pattern well. If you consistently read more than you write, or have multiple readers to very few writers, you may also want to consider the ReaderWriterLockSlim class. As a general rule of thumb, you should do as little within a lock as you can, which will also help to alleviate your time issue.
Could there be a way to solve this problem without locks? (Even if I think it is impossible)
You might be able to, but I wouldn't suggest trying that unless you are extremely familiar with multi-threading, cache coherency, and potential compiler/JIT optimizations. Locking will most likely be fine for your situation and it will be much easier for you (and others reading your code) to reason about and maintain.

multithreading: how to process data in a vector, while the vector is being populated?

I have a single-threaded linux app which I would like to make parallel. It reads a data file, creates objects, and places them in a vector. Then it calls a compute-intensive method (.5 second+) on each object. I want to call the method in parallel with object creation. While I've looked at qt and tbb, I am open to other options.
I planned to start the thread(s) while the vector was empty. Each one would call makeSolids (below), which has a while loop that would run until interpDone==true and all objects in the vector have been processed. However, I'm a n00b when it comes to threading, and I've been looking for a ready-made solution.
QtConcurrent::map(Iter begin,Iter end,function()) looks very easy, but I can't use it on a vector that's changing in size, can I? And how would I tell it to wait for more data?
I also looked at intel's tbb, but it looked like my main thread would halt if I used parallel_for or parallel_while. That stinks, since their memory manager was recommended (open cascade's mmgt has poor performance when multithreaded).
/**intended to be called by a thread
\param start the first item to get from the vector
\param skip how many to skip over (4 for 4 threads)
*/
void g2m::makeSolids(uint start, uint incr) {
uint curr = start;
while ((!interpDone) || (lineVector.size() > curr)) {
if (lineVector.size() > curr) {
if (lineVector[curr]->isMotion()) {
((canonMotion*)lineVector[curr])->setSolidMode(SWEPT);
((canonMotion*)lineVector[curr])->computeSolid();
}
lineVector[curr]->setDispMode(BEST);
lineVector[curr]->display();
curr += incr;
} else {
uio::sleep(); //wait a little bit for interp
}
}
}
EDIT: To summarize, what's the simplest way to process a vector at the same time that the main thread is populating the vector?
Firstly, to benefit from threading you need to find similarly slow tasks for each thread to do. You said your per-object processing takes .5s+, how long does your file reading / object creation take? It could easily be a tenth or a thousandth of that time, in which case your multithreading approach is going to produce neglegible benefit. If that's the case, (yes, I'll answer your original question soon incase it's not) then think about simultaneously processing multiple objects. Given your processing takes quite a while, the thread creation overhead isn't terribly significant, so you could simply have your main file reading/object creation thread spawn a new thread and direct it at the newly created object. The main thread then continues reading/creating subsequent objects. Once all objects are read/created, and all the processing threads launched, the main thread "joins" (waits for) the worker threads. If this will create too many threads (thousands), then put a limit on how far ahead the main thread is allowed to get: it might read/create 10 objects then join 5, then read/create 10, join 10, read/create 10, join 10 etc. until finished.
Now, if you really want the read/create to be in parallel with the processing, but the processing to be serialised, then you can still use the above approach but join after each object. That's kind of weird if you're designing this with only this approach in mind, but good because you can easily experiment with the object processing parallelism above as well.
Alternatively, you can use a more complex approach that just involves the main thread (that the OS creates when your program starts), and a single worker thread that the main thread must start. They should be coordinated using a mutex (a variable ensuring mutually-exclusive, which means not-concurrent, access to data), and a condition variable which allows the worker thread to efficiently block until the main thread has provided more work. The terms - mutex and condition variable - are the standard terms in the POSIX threading that Linux uses, so should be used in the explanation of the particular libraries you're interested in. Summarily, the worker thread waits until the main read/create thread broadcasts it a wake-up signal indicating another object is ready for processing. You may want to have a counter with index of the last fully created, ready-for-processing object, so the worker thread can maintain it's count of processed objects and move along the ready ones before once again checking the condition variable.
It's hard to tell if you have been thinking about this problem deeply and there is more than you are letting on, or if you are just over thinking it, or if you are just wary of threading.
Reading the file and creating the objects is fast; the one method is slow. The dependency is each consecutive ctor depends on the outcome of the previous ctor - a little odd - but otherwise there are no data integrity issues so there doesn't seem to be anything that needs to be protected by mutexes and such.
Why is this more complicated than something like this (in crude pseudo-code):
while (! eof)
{
readfile;
object O(data);
push_back(O);
pthread_create(...., O, makeSolid);
}
while(x < vector.size())
{
pthread_join();
x++;
}
If you don't want to loop on the joins in your main then spawn off a thread to wait on them by passing a vector of TIDs.
If the number of created objects/threads is insane, use a thread pool. Or put a counter is the creation loop to limit the number of threads that can be created before running ones are joined.
#Caleb: quite -- perhaps I should have emphasized active threads. The GUI thread should always be considered one.

Resources