Microservices sequential data processing - multithreading

Suppose I am receiving a stream of unordered sequential data in time.
For example, input could be:
[
{id:1, timestamp:1},
{id:2, timestamp:1},
{id:2, timestamp:2},
{id:1, timestamp:2},
{id:3, timestamp:1}
]
Each entity is identified by 'id' field. There could be a large amount of entities and processing for each input could take some time.
The problem is that I need to process each of those events in order it was received for each entity.
I was considering some solutions as to put messages to Kafka topic with partitions and receive parallelism?
Or create local storage of received messages and marking each processed message for each entity after successful processing (on other machine or on the same in Thread pool)?
Questions:
Is it is a good solution?
How can I reach this functionality while scaling data consumers (having fixed number of services/ creating new instances)?
Maybe there is a better way to solve such kind of problem?

"IF" the sequential data you mention just divided by id, 1 2 and 3,
Then Would be the best you make 3 background services as an consumer, just need 1 partition for the case (you can decided this on your own)
Then make 3 topic based on the data
ex :
TOPIC 1
TOPIC 2
TOPIC 3
which mean you need to make 3 kind of consumer, each of consumer would be listen to only 1 topic
Then you would be spawn new process / Thread for every new stream data,
It would work in parallel

Related

Azure Function with Event Hub trigger receives weird amount of events

I have an Event Hub and Azure Function connected to it. With small amounts of data all works well, but when I tested it with 10 000 events, I got very peculiar results.
For test purposes I send into Event hub numbers from 0 to 9999 and log data in application insights and in service bus. For the first test I see in Azure that hub got exactly 10 000 events, but service bus and AI got all messages between 0 and 4500, and every second message after 4500 (so it lost about 30%). In second test, I got all messages from 0 to 9999, but every second message between 3500 and 3200 was duplicated. I would like to get all messages once, what did I do wrong?
public async Task Run([EventHubTrigger("%EventHubName%", Connection = "AzureEventHubConnectionString")] EventData[] events, ILogger log)
{
int id = _random.Next(1, 100000);
_context.Log.TraceInfo("Started. Count: " + events.Length + ". " + id); //AI log
foreach (var message in events)
{
//log with ASB
var mess = new Message();
mess.Body = message.EventBody.ToArray();
await queueClient.SendAsync(mess);
}
_context.Log.TraceInfo("Completed. " + id); //AI log
}
By using EventData[] events, you are reading events from hub in batch mode, thats why you see X events processing at a time then next seconds you process next batch.
Instead of EventData[] use simply EventData.
When you send events to hub check that all events are sent with the same partition key if you want try batch processing otherwise they can be splitted in several partitions depending on TU (throughput units), PU (Processing Units) and CU (Capacity Units).
Egress: Up to 2 MB per second or 4096 events per second.
Refer to this document.
Throughput limits for Basic, Standard, Premium..:
There are a couple of things likely happening, though I can only speculate with the limited context that we have. Knowing more about the testing methodology, tier of your Event Hubs namespace, and the number of partitions in your Event Hub would help.
The first thing to be aware of is that the timing between when an event is published and when it is available in a partition to be read is non-deterministic. When a publish operation completes, the Event Hubs broker has acknowledged receipt of the events and taken responsibility for ensuring they are persisted to multiple replicas and made available in a specific partition. However, it is not a guarantee that the event can immediately be read.
Depending on how you sent the events, the broker may also need to route events from a gateway by performing a round-robin or applying a hash algorithm. If you're looking to optimize the time from publish to availability, taking ownership of partition distribution and publishing directly to a partition can help, as can ensuring that you're publishing with the right degree of concurrency for your host environment and scenario.
With respect to duplication, it's important to be aware that Event Hubs offers an "at least once" guarantee; your consuming application should expect some duplicates and needs to be able to handle them in the way that is appropriate for your application scenario.
Azure Functions uses a set of event processors in its infrastructure to read events. The processors collaborate with one another to share work and distribute the responsibility for partitions between them. Because collaboration takes place using storage as an intermediary to synchronize, there is an overlap of partition ownership when instances are scaled up or scaled down, during which time the potential for duplication is increased.
Functions makes the decision to scale based on the number of events that it sees waiting in partitions to be read. In the case of your test, if your publication pattern increases rapidly and Functions sees "the event backlog" grow to the point that it feels the need to scale by multiple instances, you'll see more duplication than you otherwise would for a period of 10-30 seconds until partition ownership normalizes. To mitigate this, using an approach of gradually increasing speed of publishing over a 1-2 minute period can help to smooth out the scaling and reduce (but not eliminate) duplication.

Parallel ordered processing of spring integration events

am pretty new to the integration framework. I have a spring message listener which receives updates for stocks ( S1, S2, S3 ...). The updates for same stock need to be processed sequentially where as updates for different stocks should be processed parallel.
e.g. if sequence of updates are S1-1, S1-2, S2-1, S1-3, S3-1, S2-2, S1-3, S3-2 .. then there should be three parallel stream of processing
S1-1, S1-2, S1-3
S2-1, S2-1
S3-1, S3-2
Note that there could be thousands of such stocks.
Currently I am processing everything in parallel using executor on the channel. how can i achieve my requirements. please advise. thanks.
Well, since we can't parallel "thousands of such stocks" because spawning so many threads is really inefficient, so I would go something like partitioning and queue channels for each of them.
The router component could decide by the request message which partition it belongs and send to respective queue channel.
This way a poller on that queue channel would pull messages sequentially, but different queues would be processed in parallel.
See more info in docs:
https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#router
https://docs.spring.io/spring-integration/docs/current/reference/html/core.html#channel-implementations-queuechannel

Hazelcast Jet stream processing end window emission

I've stomped across an interesting observation trying to cross check results of aggregation for my stream processing. I've created a test case when pre-defined data set was fed into a journaled map and aggregation was supposed to populate 1 result as it was inline with window size/sliding and amount of data with pre-determined timestamps. However result was never published. Window was not emitted however few accumulate/combine operations where executed. It works differently with real data, but result of aggregation is always 'behind' the amount of data drawn from the source. I guess this has something to do with Watermarks? How can I make sure in my test case that it doesn't wait for more data to come. Will allowed lateness help?
First, I'll refer you to the two sections in the manual which describe how watermarks work and also talk about the concept of stream skew:
http://docs.hazelcast.org/docs/jet/0.6.1/manual/#unbounded-stream-processing
http://docs.hazelcast.org/docs/jet/0.6.1/manual/#stream-skew
The concept of "current time" in Jet only advances as long as there's events with advancing timestamps. There's typically several factors at play here:
Allowed lateness: This defines your lag per partition, assuming you are using a partitioned source like Kafka. This describes the tolerable degree of out of orderness in terms of timestamps in a single partition. If allowed lateness is 2 sec, the window will only close when you have received an event at N + 2 seconds across all input partitions.
Stream skew: This can happen when for example you have 10 Kafka partitions but only 3 are producing any events. As Jet coalesces watermarks from all partitions, this will cause the stream to wait until the other 7 partitions have some data. There's a timeout after which these partitions are considered idle, but this is by default 60 sec and currently not configurable in the pipeline API. So in this case you won't have any output until these partitions are marked as idle.
When using test data, it's quite common to have very low volume of events and many partitions, which can make it a challenge to advance the time correctly.
Points in Can Gencer's answer are valid. But for test, you can also use a batch source, such as Sources.list. By adding timestamps to a BatchStage you convert it to a StreamStage, on which you can do window aggregation. The aggregate transform will emit pending windows at the end of the batch.
JetInstance inst = Jet.newJetInstance();
IListJet<TimestampedEntry<String, Integer>> list = inst.getList("data");
list.add(new TimestampedEntry(1, "a", 1));
list.add(new TimestampedEntry(1, "b", 2));
list.add(new TimestampedEntry(1, "a", 3));
list.add(new TimestampedEntry(1, "b", 4));
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<TimestampedEntry<String, Integer>>list("data"))
.addTimestamps(TimestampedEntry::getTimestamp, 0)
.groupingKey(TimestampedEntry::getKey)
.window(tumbling(1))
.aggregate(AggregateOperations.summingLong(TimestampedEntry::getValue))
.drainTo(Sinks.logger());
inst.newJob(p).join();
inst.shutdown();
The above code prints:
TimestampedEntry{ts=01:00:00.002, key='a', value='4'}
TimestampedEntry{ts=01:00:00.002, key='b', value='6'}
Remember to keep your data in the list ordered by time as we use allowedLag=0.
Answer is valid for Jet 0.6.1.

Optimizing a Kafka Streams Application with Multiple Sub-Topologies

I'm running a Kafka Streams application with three sub-topologies. The stages of activity are roughly as follows:
stream Topic A
selectKey and repartition Topic A to Topic B
stream Topic B
foreach Topic B to Topic C Producer
stream Topic C
Topic C to Topic D
Topics A, B, and C are each materialized, which means that if each topic has 40 partitions, my maximum parallelism is 120.
At first I was running 5 streams applications with 8 threads a piece. With this set up I was experiencing inconsistent performance. It seems like some sub-topologies sharing the same thread were hungrier for CPU than others and after a while, I'd get this error: Member [client_id] in group [consumer_group] has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator). Everything would get rebalanced, which could lead to decreased performance until the next failure and rebalance.
My questions are as follows:
How is it that multiple sub-topologies are able to be run on a single thread? A poll queue?
How does each thread decide how to allocate compute resources to each of its sub-topologies?
How do you optimize your thread to topic-partition ratio in such cases to avoid periodic consumer failures? e.g., will a 1:1 ratio ensure more consistent performance?
If you use a 1:1 ratio, how do you ensure that every thread gets assigned its own topic-partition and some threads aren't left idle?
The thread will poll() for all topics of different sub-topologies and check the records topic metadata to feed it into the correct task.
Each sub-topology is treated the same, ie, available resources are evenly distributed if you wish.
A 1:1 ratio is only useful if you have enough cores. I would recommend to monitor your CPU utilization. If it's too high (larger >80%) you should add more cores/threads.
Kafka Streams handles this for you automatically.
Couple of general comments:
you might consider to increase max.poll.interval.ms config to avoid that a consumer drops out of the group
you might consider to decrease max.poll.records to get less records per poll() call, and thus decrease the time between two consecutive calls to poll().
note, that max.poll.records does not imply increases network/broker communication -- if a single fetch request return more records than max.poll.records config, the data is just buffered within the consumer and the next poll() will be served from the buffered data avoiding a broker round trip

How to make the consumer know that the Producer has finished sending all the messages to the Broker?

1: We are working on a near real time processing or Batch Processing using Spark Streaming. Our current design has Kafka included.
2: Every 15 minutes the Producer will send the messages.
3: We plan to use Spark Streaming to consume messages from Kafka topic.
That a very broad question:
Basically, there is no such thing as "all messages" because it's stream processing (but I still understand your question).
One way would be to inject a control message at last message that "ends a burst of data"
You could also use some "side communication channel" via an RPC such that the producer send the last offset it did write to the consumer
You could put an heuristic -- if poll() does return nothing for 1 minute, you just assume that all data got consumed
And there might be other methods... But it's all hand coded -- there is no support in Kafka (cf. (1.)).

Resources