Parallel For-Each vs Scatter Gather in mule

Parallel For-Each vs Scatter Gather in mule - multithreading

I have multiple records :
{
"item_id":1",
key1: "data1"
}
{
item_id:2
key1: "data1"
}
{
item_id:2
key1: "data1"
}
{
item_id:1
key1: "data1"
}
I do not want to process them sequentially.There can be more than 200 records. Should I process them using for-each parallel or scatter-gather. Which approach would be best as per my requirement.
I do not need the accumulated response, but if there is some exception while processing (hit an api for each record based on an if condition) any one of the record,processing of other records must remain unaffected.

Why not then use the VM module, break the collection into its individual records and push them to a VM queue? Then have another flow with a VM listener picking up the individual records (in parallel) and processing them.
Here's more details: https://docs.mulesoft.com/mule-runtime/4.2/reliability-patterns

Scatter-gather is meant for cases where you have a static number of routes. Imagine one route to send to the HR system and another to the Accounting system.
For processing a variable number of records you should use parallel for-each.

Use the foreach async or parallel or jms pattern. Scater-gather receives one payload for all thread and you won't be able to cycle

Related

Azure eventhub : Offset vs Sequence number

I see this question being asked on a lot of forums but none of them are solving my confusion.
This documentation seems to suggest that both offset and sequence number are unique within a partition.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.servicebus.messaging.eventdata?view=azure-dotnet
It is clearly understood that sequence number is an integer which increments sequentially :
https://social.msdn.microsoft.com/Forums/azure/en-US/acc25820-a28a-4da4-95ce-4139aac9bc44/sequence-number-vs-offset?forum=azureiothub#:~:text=SequenceNumber%20is%20the%20logical%20sequence,the%20Event%20Hub%20partition%20stream.&text=The%20sequence%20number%20can%20be,authority%20and%20not%20by%20clients.
But what of the offset ? Is it unique only within a partition , or across all partitions within a consumer group ? If it is the former condition, why have two different variables ?

An offset is a relative position within the partition's event stream. In the current Event Hubs implementation, it represents the number of bytes from the beginning of the partition to the first byte in a given event.
Within the context of a partition, the offset is unique. The same offset value may appear in other partitions - it should not be treated as globally unique across the Event Hub.
If it is the former condition, why have two different variables?
The offset is guaranteed only to uniquely identify an event within the partition. It is not safe to reason about the value or how it changes from event-to-event.
Sequence Number, on the other hand, follows a predictable pattern where numbering is contiguous and unique within the scope of a partition. Because of this, it is safe to use in calculations like "if I want to rewind by 5 events, I'll take the current sequence number and subtract 5."

Offset refers to consumer groups but not to partitions.
Offset is a small storage container created for every consumer group, every consumer group has its own read offset, you can have several consumer groups and every group will read the event hub data at its own peace. In other words, the offset container holds a small blob with data regards the read checkpoint that will advance every time you will execute the context.CheckpointAsync(). If you will delete the container created by the consumer group, then the group will begin to read the data from the beginning,
List<EventProcessorHost> eventProcessorHosts = new List<EventProcessorHost>();
var eventProcessorHost = new EventProcessorHost(
EventHubName,
PartitionReceiver.DefaultConsumerGroupName,
EventHubConnectionString,
StorageConnectionString,
StorageContainerName);
eventProcessorHosts.Add(eventProcessorHost);
eventProcessorHosts[0].RegisterEventProcessorAsync<SimpleEventProcessor>();
...
public Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages)
{
foreach (var eventData in messages)
{
var data = Encoding.UTF8.GetString(eventData.Body.Array, eventData.Body.Offset, eventData.Body.Count);
Console.WriteLine($"messages count: {messages.Count()} Message received. Partition: '{context.PartitionId}', Data: '{data}', thread:{Thread.CurrentThread.ManagedThreadId}");
}
// Writes the current offset and sequenceNumber to the checkpoint store via the
// checkpoint manager.
return context.CheckpointAsync();
}
Check the storage container that was passed to EventProcessorHost constructor.

How to use synchronous messages on rabbit queue?

I have a node.js function that needs to be executed for each order on my application. In this function my app gets an order number from a oracle database, process the order and then adds + 1 to that number on the database (needs to be the last thing on the function because order can fail and therefore the number will not be used).
If all recieved orders at time T are processed at the same time (asynchronously) then the same order number will be used for multiple orders and I don't want that.
So I used rabbit to try to remedy this situation since it was a queue. It seems that the processes finishes in the order they should, but a second process does NOT wait for the first one to finish (ack) to begin, so in the end I'm having the same problem of using the same order number multiple times.
Is there anyway I can configure my queue to process one message at a time? To only start process n+1 when process n has been acknowledged?
This would be a life saver to me!

If the problem is to avoid duplicate order numbers, then use an Oracle sequence, or use an identity column when you insert into a table to generate the order number:
CREATE TABLE mytab (
id NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY(START WITH 1),
data VARCHAR2(20));
INSERT INTO mytab (data) VALUES ('abc');
INSERT INTO mytab (data) VALUES ('def');
SELECT * FROM mytab;
This will give:
ID DATA
---------- --------------------
1 abc
2 def
If the problem is that you want orders to be processed sequentially, then don't pull an order from the queue until the previous one is finished. This will limit your throughput, so you need to understand your requirements and make some architectural decisions.
Overall, it sounds Oracle Advanced Queuing would be a good fit. See the node-oracledb documentation on AQ.

LMAX Distruptor Partition and join batch

So currently I have a Executor implementation with blocking queue and the implementation specific is like, I have list of items per request and I divide them into partitions each partition is then computed and finally they are joined to have the final list.
How do I go about implementing it in LMAX? I see that once I have partition and push them into RingBuffer, each partition is treated as separate item so I am custom joining them.
something like,
ConcurrentHashMap<Long, LongAdder> map = new ConcurrentHashMap<>();
#Override
public List<SomeTask> score(final List<SomeTask> tasks) {
long id = tasks.get(0).id;
map.put(id, new LongAdder());
for (SomeTask task : tasks) {
producer.onData(task);
}
while (map.get(id).intValue() != tasks.size()) ;
map.remove(id);
return tasks;
}
Is there a clean way to do it ? I looked at https://github.com/LMAX-Exchange/disruptor/tree/master/src/test/java/com/lmax/disruptor/example and KeyedBatching specifically but they seem to batch and execute on one thread.
Currently for me each partition takes up around 200ms and I wanted to parallel execute them.
Any help is greatly appreciated.

I think you should take a look at the worker-pool options and followed by a final event processor which re-combines the shards.

How do I use Multithread to iterate a list with many records

I have a list with 200 records of an object. For each record in that list, I need to call an API REST (findOrderByID) to receive the complete data. The problem is that this process is taking more than 10 seconds and I want to use Multithreading to have a shorter time.
For iteration of my list.
// My list with ~200 registers
for (Order order: orderList.getID()) {
OrderRestVO response = findOrderByID(order.getID); //Calling an API method
}
My question is, how do I utilize multi-threading to optimize my search?

Node: Check a Firebase db and execute a function when an objects time matches the current time

Background
I have a Node and React based application. I'm using Firebase for my storage and database. In my application users can fill out a form where they upload an image and select a time for the image to be added to their website. I save each image update as an object in my Firebase database like so. Images are arranged in order of ascending update time.
user-name: {
images: [
{
src: 'image-src-url',
updateTime: 1503953587727
}
{
src: 'image-src-url',
updateTime: 1503958424838
}
]
}
Scale
My applications db could potentially get very large with a lot of users and images. I'd like to ensure scalability.
Issue
How do I check when a specific image objects time has been met then execute a function? (I do not need assistance on the actual function that is being run just the checking of the db for a specific time.)
Attempts
I've thought about doing a cron job using node-cron that checks the entire database every 60s (users can only specify the minute the image will update, not the seconds.) Then if it finds a matching updateTime and executes my function. My concern is at a large scale that cron job will take a while to search the db and potentially miss a time.
I've also thought about when the user schedules a new update then dynamically create a specific cron job for that time. I'm unsure how to accomplish this.
Any other methods that may work? Are my concerns about node-cron not valid?

There are two approaches I can think of:
Keep track of the last timestamp you processed
Keep the "things to process" in a queue
Keep track of the last timestamp you processed
When you process items, you use the current timestamp as the cut-off point for your query. Something like:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
Now make sure to store this now somewhere (i.e. in your database) so that you can re-use it next time to retrieve the next batch of items:
var previous = ... previous value of now
var now = Date.now();
var query = ref.orderByChild("updateTime").startAt(previous).endAt(now);
With this you're only processing a single slice at a time. The only tricky bit is that somebody might insert a new node with an updateTime that you've already processed. If this is a concern for your use-case, you can prevent them from doing so with a validation rule on updateTime:
".validate": "newData.val() >= root.child('lastProcessed').val()"
As you add more items to the database, you will indeed be querying more items. So there is a scalability limit to this approach, but this approach should work well for anything up to a few hundreds of thousands of nodes (I haven't tested in a while so ymmv).
For a few previous questions on list size:
Firebase Performance: How many children per node?
Firebase Scalability Limit
How many records / rows / nodes is alot in firebase?
Keep the "things to process" in a queue
An alternative approach is to keep a queue of items that still need to be processed. So the clients add the items that they want processed to the queue with an updateTime of when they want to processed. And your server picks the items from the queue, performs the necessary updates, and removes the item from the queue:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
query.once("value").then(function(snapshot) {
snapshot.forEach(function(child) {
// TODO: process the child node
// remove the child node from the queue
child.ref.remove();
});
})
The difference with the earlier approach is that a queue's stable state is going to be empty (or at least quite small), so your queries will run against a much smaller list. That's also why you won't need to keep track of the last timestamp you processed: any item in the queue up to now is eligible for processing.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parallel For-Each vs Scatter Gather in mule - multithreading

Scatter-gather is meant for cases where you have a static number of routes. Imagine one route to send to the HR system and another to the Accounting system. For processing a variable number of records you should use parallel for-each.

Use the foreach async or parallel or jms pattern. Scater-gather receives one payload for all thread and you won't be able to cycle

Related

Azure eventhub : Offset vs Sequence number

How to use synchronous messages on rabbit queue?

LMAX Distruptor Partition and join batch

How do I use Multithread to iterate a list with many records

Node: Check a Firebase db and execute a function when an objects time matches the current time

Categories

Resources