LMAX Distruptor Partition and join batch

LMAX Distruptor Partition and join batch - disruptor-pattern

So currently I have a Executor implementation with blocking queue and the implementation specific is like, I have list of items per request and I divide them into partitions each partition is then computed and finally they are joined to have the final list.
How do I go about implementing it in LMAX? I see that once I have partition and push them into RingBuffer, each partition is treated as separate item so I am custom joining them.
something like,
ConcurrentHashMap<Long, LongAdder> map = new ConcurrentHashMap<>();
#Override
public List<SomeTask> score(final List<SomeTask> tasks) {
long id = tasks.get(0).id;
map.put(id, new LongAdder());
for (SomeTask task : tasks) {
producer.onData(task);
}
while (map.get(id).intValue() != tasks.size()) ;
map.remove(id);
return tasks;
}
Is there a clean way to do it ? I looked at https://github.com/LMAX-Exchange/disruptor/tree/master/src/test/java/com/lmax/disruptor/example and KeyedBatching specifically but they seem to batch and execute on one thread.
Currently for me each partition takes up around 200ms and I wanted to parallel execute them.
Any help is greatly appreciated.

I think you should take a look at the worker-pool options and followed by a final event processor which re-combines the shards.

Related

Azure eventhub : Offset vs Sequence number

I see this question being asked on a lot of forums but none of them are solving my confusion.
This documentation seems to suggest that both offset and sequence number are unique within a partition.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.servicebus.messaging.eventdata?view=azure-dotnet
It is clearly understood that sequence number is an integer which increments sequentially :
https://social.msdn.microsoft.com/Forums/azure/en-US/acc25820-a28a-4da4-95ce-4139aac9bc44/sequence-number-vs-offset?forum=azureiothub#:~:text=SequenceNumber%20is%20the%20logical%20sequence,the%20Event%20Hub%20partition%20stream.&text=The%20sequence%20number%20can%20be,authority%20and%20not%20by%20clients.
But what of the offset ? Is it unique only within a partition , or across all partitions within a consumer group ? If it is the former condition, why have two different variables ?

An offset is a relative position within the partition's event stream. In the current Event Hubs implementation, it represents the number of bytes from the beginning of the partition to the first byte in a given event.
Within the context of a partition, the offset is unique. The same offset value may appear in other partitions - it should not be treated as globally unique across the Event Hub.
If it is the former condition, why have two different variables?
The offset is guaranteed only to uniquely identify an event within the partition. It is not safe to reason about the value or how it changes from event-to-event.
Sequence Number, on the other hand, follows a predictable pattern where numbering is contiguous and unique within the scope of a partition. Because of this, it is safe to use in calculations like "if I want to rewind by 5 events, I'll take the current sequence number and subtract 5."

Offset refers to consumer groups but not to partitions.
Offset is a small storage container created for every consumer group, every consumer group has its own read offset, you can have several consumer groups and every group will read the event hub data at its own peace. In other words, the offset container holds a small blob with data regards the read checkpoint that will advance every time you will execute the context.CheckpointAsync(). If you will delete the container created by the consumer group, then the group will begin to read the data from the beginning,
List<EventProcessorHost> eventProcessorHosts = new List<EventProcessorHost>();
var eventProcessorHost = new EventProcessorHost(
EventHubName,
PartitionReceiver.DefaultConsumerGroupName,
EventHubConnectionString,
StorageConnectionString,
StorageContainerName);
eventProcessorHosts.Add(eventProcessorHost);
eventProcessorHosts[0].RegisterEventProcessorAsync<SimpleEventProcessor>();
...
public Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages)
{
foreach (var eventData in messages)
{
var data = Encoding.UTF8.GetString(eventData.Body.Array, eventData.Body.Offset, eventData.Body.Count);
Console.WriteLine($"messages count: {messages.Count()} Message received. Partition: '{context.PartitionId}', Data: '{data}', thread:{Thread.CurrentThread.ManagedThreadId}");
}
// Writes the current offset and sequenceNumber to the checkpoint store via the
// checkpoint manager.
return context.CheckpointAsync();
}
Check the storage container that was passed to EventProcessorHost constructor.

Parallel For-Each vs Scatter Gather in mule

I have multiple records :
{
"item_id":1",
key1: "data1"
}
{
item_id:2
key1: "data1"
}
{
item_id:2
key1: "data1"
}
{
item_id:1
key1: "data1"
}
I do not want to process them sequentially.There can be more than 200 records. Should I process them using for-each parallel or scatter-gather. Which approach would be best as per my requirement.
I do not need the accumulated response, but if there is some exception while processing (hit an api for each record based on an if condition) any one of the record,processing of other records must remain unaffected.

Why not then use the VM module, break the collection into its individual records and push them to a VM queue? Then have another flow with a VM listener picking up the individual records (in parallel) and processing them.
Here's more details: https://docs.mulesoft.com/mule-runtime/4.2/reliability-patterns

Scatter-gather is meant for cases where you have a static number of routes. Imagine one route to send to the HR system and another to the Accounting system.
For processing a variable number of records you should use parallel for-each.

Use the foreach async or parallel or jms pattern. Scater-gather receives one payload for all thread and you won't be able to cycle

Time Based Eviction in Hazelcast

I am working on a requirement where i'd have N hazelcast instances running in a cluster and also have kafka consumers running on all of them.
Now the ask is, each message that comes in on kafka, should be added to the distributed map and the entry must be evicted every 20 seconds, which i did by using a combination of time to live and max-idle seconds parameters in the map config.
But what i really want is that when the entry is evicted, only one of the nodes should process it, right now, the entry eviction is being informed to all the nodes.
Let me know if any more information is needed.

You have to add a localEntryListener to your distributed map so that a member will only receive notifications for which it is an owner.
e.g.
if(map != null){
map.addLocalEntryListener(new EntryAddedListener<Long, Long>() {
#Override
public void entryAdded(EntryEvent<Long, Long> event) {
log.info("LOCAL ENTRY ADDED : {} at {}", event, System.currentTimeMillis());
}
});
The above example is for the EntryAddedListener, you can similarly implement a EntryEvictedListener as well.

Replaying an RDD in spark streaming to update an accumulator

I am actually running out of options.
In my spark streaming application. I want to keep a state on some keys. I am getting events from Kafka. Then I extract keys from the event, say userID. When there is no events coming from Kafka I want to keep updating a counter relative to each user ID each 3 seconds, since I configured the batchduration of my StreamingContext with 3 seconds.
Now the way I am doing it might be ugly, but at least it works: I have an accumulableCollection like this:
val userID = ssc.sparkContext.accumulableCollection(new mutable.HashMap[String,Long]())
Then I create a "fake" event and keep pushing it to my spark streaming context as the following:
val rddQueue = new mutable.SynchronizedQueue[RDD[String]]()
for ( i <- 1 to 100) {
rddQueue += ssc.sparkContext.makeRDD(Seq("FAKE_MESSAGE"))
Thread.sleep(3000)
}
val inputStream = ssc.queueStream(rddQueue)
inputStream.foreachRDD( UPDATE_MY_ACCUMULATOR )
This would let me access to my accumulatorCollection and update all the counters of all userIDs. Up to now everything works fine, however when I change my loop from:
for ( i <- 1 to 100) {} #This is for test
To:
while (true) {} #This is to let me access and update my accumulator through the whole application life cycle
Then when I run my ./spark-submit, my application gets stuck on this stage:
15/12/10 18:09:00 INFO BlockManagerMasterActor: Registering block manager slave1.cluster.example:38959 with 1060.3 MB RAM, BlockManagerId(1, slave1.cluster.example, 38959)
Any clue on how to resolve this ? Is there a pretty straightforward way that would allow me updating the values of my userIDs (rather than creating an unuseful RDD and pushing it periodically to the queuestream)?

The reason why the while (true) ... version does not work is that the control never returns to the main execution line and therefore nothing below that line gets executed. To solve that specific problem, we should execute the while loop in a separate thread. Future { while () ...} should probably work.
Also, the Thread.sleep(3000) when populating the QueueDStream in the example above is not needed. Spark Streaming will consume one message from the queue on each streaming interval.
A better way to trigger that inflow of 'tick' messages would be with the ConstantInputDStream that plays back the same RDD at each streaming interval, therefore removing the need to create the RDD inflow with the QueueDStream.
That said, it looks to me that the current approach seems fragile and would need revision.

Number of threads decreases as Parallel.Foreach loop goes on

I have a Parallel Foreach loop which loops through a list of items, and performs some actions against them. Some of these actions take longer than others, depending on the item.
Parallel.ForEach(list, new ParallelOptions { MaxDegreeOfParallelism = 5 }, item =>
{
var subItems = item.subItems;
foreach (var subItem in subItems)
{
//do some actions for subItem
}
Console.WriteLine("Action Complete for {0}", item);
});
After a while, when there are only about 5-10 items left in the list to run, it seems that there is only 1 thread left running. This is not ideal, because some items will then be stuck behind another one to finish.
If I stop the script, and then start it again, with only the leftover 5-10 items in the list, it spins up multiple threads to do each of the items in parallel again.
How can I ensure that the other threads will keep being used, without me needing to restart the script?

The problem here is that the default partitioner is blocking the work per task up into blocks of N items. It assumes that the number of items is large and each item takes the same amount of time then you would expect that the several threads will run the last ~ N * 5 items and all finish at the same time.
However in your case this is not true. You could write your own Partitioner to use a smaller number of items per block, See Partitioner Class. This may improve performance but it the work done per item is very small then you will increase the ratio of useful work to work done managing the tasks and possibly degrade performance.
You could also write a dynamic partitioner that decreases the partition size so that the last few items are in smaller partitions, thus ensuring that you are still using all the available threads. This MSDN article covers writing custom partitioners, Custom Partitioners for PLINQ and TPL.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

LMAX Distruptor Partition and join batch - disruptor-pattern

I think you should take a look at the worker-pool options and followed by a final event processor which re-combines the shards.

Related

Azure eventhub : Offset vs Sequence number

Parallel For-Each vs Scatter Gather in mule

Time Based Eviction in Hazelcast

Replaying an RDD in spark streaming to update an accumulator

Number of threads decreases as Parallel.Foreach loop goes on

Categories

Resources