Optimizing a Kafka Streams Application with Multiple Sub-Topologies - multithreading

I'm running a Kafka Streams application with three sub-topologies. The stages of activity are roughly as follows:
stream Topic A
selectKey and repartition Topic A to Topic B
stream Topic B
foreach Topic B to Topic C Producer
stream Topic C
Topic C to Topic D
Topics A, B, and C are each materialized, which means that if each topic has 40 partitions, my maximum parallelism is 120.
At first I was running 5 streams applications with 8 threads a piece. With this set up I was experiencing inconsistent performance. It seems like some sub-topologies sharing the same thread were hungrier for CPU than others and after a while, I'd get this error: Member [client_id] in group [consumer_group] has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator). Everything would get rebalanced, which could lead to decreased performance until the next failure and rebalance.
My questions are as follows:
How is it that multiple sub-topologies are able to be run on a single thread? A poll queue?
How does each thread decide how to allocate compute resources to each of its sub-topologies?
How do you optimize your thread to topic-partition ratio in such cases to avoid periodic consumer failures? e.g., will a 1:1 ratio ensure more consistent performance?
If you use a 1:1 ratio, how do you ensure that every thread gets assigned its own topic-partition and some threads aren't left idle?

The thread will poll() for all topics of different sub-topologies and check the records topic metadata to feed it into the correct task.
Each sub-topology is treated the same, ie, available resources are evenly distributed if you wish.
A 1:1 ratio is only useful if you have enough cores. I would recommend to monitor your CPU utilization. If it's too high (larger >80%) you should add more cores/threads.
Kafka Streams handles this for you automatically.
Couple of general comments:
you might consider to increase max.poll.interval.ms config to avoid that a consumer drops out of the group
you might consider to decrease max.poll.records to get less records per poll() call, and thus decrease the time between two consecutive calls to poll().
note, that max.poll.records does not imply increases network/broker communication -- if a single fetch request return more records than max.poll.records config, the data is just buffered within the consumer and the next poll() will be served from the buffered data avoiding a broker round trip

Related

How does Hazelcast Jet assign task-to-CPU priority?

If I have the following code and let's say I'm running on 10 nodes of 32 cores each:
IList<...> ds = ....; //large collection, eg 1e6 elements
ds
.map() //expensive computation
.flatMap()//generates 10,000x more elements for every 1 incoming element
.rebalance()
.map() //expensive computation
....//other transformations (ie can be a sink, keyby, flatmap, map etc)
What will Hazelcast do with respect to task-to-CPU assignment priority when the SECOND map operation wants to process 10,000 elements that was generated from the 1st original element? Will it devote the 320 CPU cores (from 10 nodes) to processing the 1st original element's 10,000 generated elements? If so, will it "boot off" already running tasks? Or, will it wait for already running tasks to complete, and then give priority to the 10,000 elements resulting from the output of the flatmap-rebalance operations? Or, would the 10,000 elements be forced to run on a single core, since the remaining 319 cores are already being consumed by the output of the ds operation (ie the input of the 1st map). Or, is there some random competition for who gets access to the CPU cores?
What I would ideally like to have happen is that a) Hazelcast does NOT boot off running tasks (it lets them complete), but when deciding which tasks gets priority to run on a core, it chooses the path that would lead to the lowest latency, ie it would process all 10,000 elements which result from the output of the flatmap-rebalance operation on all 320 cores.
Note: I asked a virtually identical question to Flink a few weeks ago, but have since switched to trying out Hazelcast: How does Flink (in streaming mode) assign task-to-CPU priority?
First, IList is a non-distributed data structure, all its data are stored on a single node. The IList source therefore produces all data on that node. So the 1st expensive map will be all done on that member, but map is backed, by default by as many workers as there are cores, so 32 workers in your case.
The rebalance stage will cause that the 2nd map is run on all members. Each of the 10,000 elements produced in the 1st map is handled separately, so if you have 1 element in your IList, the 10k elements produced from it will be processed concurrently by 320 workers.
The workers backing different stages of the pipeline compete for cores normally. There will be total 96 workers for the 1st map, 2nd map and for the flatMap together. Jet uses cooperative scheduling for these workers, which means it cannot preempt the computation if it's taking too long. This means that one item taking a long time to process will block other workers.
Also keep in mind that the map and flatMap functions must be cooperative, that means they must not block (by waiting on IO, sleeping, or by waiting for monitors). If they block, you'll see less than 100% CPU utilization. Check out the documentation for more information.

Azure Function with Event Hub trigger receives weird amount of events

I have an Event Hub and Azure Function connected to it. With small amounts of data all works well, but when I tested it with 10 000 events, I got very peculiar results.
For test purposes I send into Event hub numbers from 0 to 9999 and log data in application insights and in service bus. For the first test I see in Azure that hub got exactly 10 000 events, but service bus and AI got all messages between 0 and 4500, and every second message after 4500 (so it lost about 30%). In second test, I got all messages from 0 to 9999, but every second message between 3500 and 3200 was duplicated. I would like to get all messages once, what did I do wrong?
public async Task Run([EventHubTrigger("%EventHubName%", Connection = "AzureEventHubConnectionString")] EventData[] events, ILogger log)
{
int id = _random.Next(1, 100000);
_context.Log.TraceInfo("Started. Count: " + events.Length + ". " + id); //AI log
foreach (var message in events)
{
//log with ASB
var mess = new Message();
mess.Body = message.EventBody.ToArray();
await queueClient.SendAsync(mess);
}
_context.Log.TraceInfo("Completed. " + id); //AI log
}
By using EventData[] events, you are reading events from hub in batch mode, thats why you see X events processing at a time then next seconds you process next batch.
Instead of EventData[] use simply EventData.
When you send events to hub check that all events are sent with the same partition key if you want try batch processing otherwise they can be splitted in several partitions depending on TU (throughput units), PU (Processing Units) and CU (Capacity Units).
Egress: Up to 2 MB per second or 4096 events per second.
Refer to this document.
Throughput limits for Basic, Standard, Premium..:
There are a couple of things likely happening, though I can only speculate with the limited context that we have. Knowing more about the testing methodology, tier of your Event Hubs namespace, and the number of partitions in your Event Hub would help.
The first thing to be aware of is that the timing between when an event is published and when it is available in a partition to be read is non-deterministic. When a publish operation completes, the Event Hubs broker has acknowledged receipt of the events and taken responsibility for ensuring they are persisted to multiple replicas and made available in a specific partition. However, it is not a guarantee that the event can immediately be read.
Depending on how you sent the events, the broker may also need to route events from a gateway by performing a round-robin or applying a hash algorithm. If you're looking to optimize the time from publish to availability, taking ownership of partition distribution and publishing directly to a partition can help, as can ensuring that you're publishing with the right degree of concurrency for your host environment and scenario.
With respect to duplication, it's important to be aware that Event Hubs offers an "at least once" guarantee; your consuming application should expect some duplicates and needs to be able to handle them in the way that is appropriate for your application scenario.
Azure Functions uses a set of event processors in its infrastructure to read events. The processors collaborate with one another to share work and distribute the responsibility for partitions between them. Because collaboration takes place using storage as an intermediary to synchronize, there is an overlap of partition ownership when instances are scaled up or scaled down, during which time the potential for duplication is increased.
Functions makes the decision to scale based on the number of events that it sees waiting in partitions to be read. In the case of your test, if your publication pattern increases rapidly and Functions sees "the event backlog" grow to the point that it feels the need to scale by multiple instances, you'll see more duplication than you otherwise would for a period of 10-30 seconds until partition ownership normalizes. To mitigate this, using an approach of gradually increasing speed of publishing over a 1-2 minute period can help to smooth out the scaling and reduce (but not eliminate) duplication.

How to define a good partition plan to ensure CPU balance in JSR 352?

JSR 352 - Batch Applications for the Java Platform provides parallelism feature using partitions. Batch runtime can execute a step in different partitions in order to accelerate the progress. JSR 352 also introduces the threads definition : we can define the number of threads to use, such as
<step id="Step1">
<chunk .../>
<partition>
<plan partitions="3" threads="2"/>
</partition>
</chunk>
</step>
Then I feel confused : how to give an appreciated partition plan so that each thread is occupied and ensure the CPU balance ?
For example, there're table A, B, C to do and their rows are respectively 1 billion, 1 million, 1 thousand. The step aims to process these entities to documents, one entity go to one document. The order of document production is not important. The CPU time for these tables' entity is respectively 1s, 2s, 5s. The threads number is 4.
If there're 3 partitions, one per table type, then the step will take 1 * 10^9 seconds to finish, because :
Partition A will take 1 * 10^9 * 1s = 1 * 10^9s, run on thread 2
Partition B will take 1 * 10^6 * 2s = 2 * 10^6s, run on thread 3
Partition C will take 1 * 10^3 * 5s = 5 * 10^3s, run on thread 4
However, while the thread 2 is occupied, thread 3 is free since 2 * 10^6s and thread 4 is free since 5 * 10^3s. So obviously, this is not a good partition plan.
My questions are :
Is there a better partition plan to complete in the above example ?
Can I consider : partitions is a queue to consume and threads consume this queue ?
In general, how many threads can I / should I use ? Is that the same number of the CPU cores ?
In general, how to give an appreciated partition plan so that each thread is occupied and ensure CPU balance ?
Answers...
Is there a better partition plan to complete in the above example?
Yes, there is. See answer 4...
Can I consider : partitions is a queue to consume and threads consume this queue ?
That is what exactly happens!
In general, how many threads can I / should I use ? Is that the same number of the CPU cores ?
It depends. This question has many perspectives... From the JSR-352 Specification View, "threads":
Specifies the maximum number of threads on which to execute the partitions
of this step. Note the batch runtime cannot guarantee the requested number of threads are available; it will use as many as it can up to the requested maximum. This is an optional attribute. The default is the number of partitions.
So, based only in this perspective, you should set this value as high as you want (the batch runtime will set the real limit, according to its resources!).
From the Batch Runtime Perspective (JSR352 Implementation): Any decent implementation will use a thread pool to execute the partitioned steps. So, if such pool has a fixed size of N, no matter how big you set your threads number, you will never execute more than N partitions concurrently.
JBeret is an implementation of JSR352 specification, used by wildfly server (It is the implementation that I've used). At Wildfly, it has a default thread pool setting of max 10 threads. This pool is not only shared between partitioned steps, it is also shared between batch jobs. So, if you're running 2 jobs at the same time, you will have 2 thread less for use. Additional to this fact, when you partition, one thread takes the role of coordinator, assigning partitions to the others threads and waiting for results ... so if your partition plan says that it uses 2 threads, it will in fact uses 3! (two as workers, one as coordinator)... and all this resources (threads) are taken from the same pool!!
Anyway, the important thing of all this is: investigate what implementation of JSR325 are you using and setup it accordingly.
From hardware View, your CPU has a thread max limit. Under this perspective (and as rule of thumb), set the "threads" value equals to such value.
From the Performance View, analyze the work that are you doing. If you're accessing a shared resource (like a DB) between many threads, you can produce a bottleneck causing thread blocking. If you face that kind of problem, you must think at lowering the "theads" value.
In Summary, set the "threads" value as high as the CPU max thread limit. Then, check if that value does not cause blocking issues; if it does, reduce the value. Also, verify it the batch runtime is configured accordingly and it allows to you execute as many threads as you desire.
In general, how to give an appreciated partition plan so that each thread is occupied and ensure CPU balance ?
Avoid the use of static partition plans (at least for you case). Instead, use a Partition Mapper. A Partition Mapper is a class that implements the javax.batch.api.partition.PartitionMapper interface and allows to define a partition plan (how many partitions, how many threads, the properties of each partition) programatically. So for your case, take your tables (A, B, C) and split them into blocks of N (where N = 1000) ... each block will be a partition. You should start with the partition of type C and do round robin between your entity partitions (tables): C0, B0, A0, B1, A1, ..., B999, A999, A1000, ..., A999999 ... using this scheme, entity C will finish first, leaving one thread open to resolve more A and B partitions. Later, B will finish, leaving more resources to attack the remaining A partitions.
Hope this help...

Should the event hub have same number of partitions as throughput units?

For Azure event hub 1 though put unit equals 1MB/sec ingress. So it can take 1000 messages of 1 KB. If I select 5 or more throughput units would I be able to ingest 5000 messages/ second of 1KB size with 4 partitions? What would be egress in that case? I am not sure about limitation on Event Hub partition, i read that it is also 1MB/sec. But then does that mean to use event hub effectively i need to have same number of partitions?
Great question.
1 Throughput Unit (TU) means an ingress limit of 1 MB/sec or 1000 msgs/sec - whichever happens first. You pay for TUs and you can change TUs as per your load requirements. This is your knob to control the bill. And TUs are set on a given Event Hubs Namespace!
When you buy 1 TU for an EventHubs Namespace and create a number of EventHubs in it, the the limit of 1 MB/sec or 1000 msgs/sec applies cumulatively across them. The limit also applies to each partition individually. Although, sometimes you might get lucky in some regions where load is low.
Consider these principles while deciding on no. of partitions in eventhub for your service:
The intent of Partitions is to offer high-availability. If you are sending to Eventhubs and you want the sends to succeed NO MATTER WHAT you should create multiple partitions and send using EventHubClient.Send (which doesn't confine the send to a particular partition).
The no. of partitions will determine how fat the event pipe is & how fast/parallelly you can receive & process the events. If you have 10 partitions on your EventHub - it's capacity is effectively capped at 10 TUs. You can create 10 epoch receivers in parallel & consume & process events. If you envision that the EventHub that you are currently creating now can quickly grow 10-fold create as many partitions and keep the TU's matching the current load. Analogy here is having multiple lanes on a freeway!
Another thing to note is, a TU is configured at namespace level. And, one Event Hubs namespace can have multiple EventHubs in it and each EventHub can have a different no. of partitions.
Answers:
If you select 5 or more TUs on the Namespace and have only 1 EventHub with 4 partitions you will get a max. of 4 MB/sec or 4K msgs/sec.
Egress max will be 2X of ingress (8 MBPS or 8K msgs/sec). In other words, you could create 2 patterns of receives (e.g. slow and fast) by creating 2 consumer groups. If you need more than 2X parallel receives then you will need to by more TUs.
Yes, ideally you will need more partitions than TUs. First model your partition count as mentioned above. Start with 1 TU while you are developing your solution. Once done, when you are doing load testing or going live, increase TUs in tune with your load. Remember, you could have multiple EventHubs in a Namespace. So, having 20 TUs at Namespace level and 10 EventHubs with 4 partitions each can deliver 20 MB/sec across the Namespace.
More on EventHubs
One partition goes to one TPU. Think of TPUs as a processing engine. You can't take advantage of more TPUs than you have partitions. If you have 4 partitions, you can't use more than 4 TPUs.
It's typical to have more partitions than TPUs, for the following reasons
You can scale the number of TPUs up if you have a lot of traffic, but you can't change the number of partitions
You can't have more concurrent readers than you have partitions. If you want to have 5 concurrent readers, you need 5 partitions.
As for throughput, the limits are 1 MB ingerss/2 MB egress per TPU. This covers the typical scenario where each event is sent both to cold storage (eg a database) and Stream analytics or an event processor for analysis, monitoring etc.

Apache Spark Streaming: Why is the amount of Events per Second increasing?

I found that the number of Events/sec in my Streaming Application suddenly increases (see screenshot). I made sure that the sender doesn't send more data at these moments (it always sends 800 - 900 messages per TCP). I guess it has to do with the other times below (processing / delay), but why is this shown in the Events-count graph? And can someone explain what happens there more exactly? Thanks!
The number of events is directly related to the producer. This means your producer is increasing the send rate over time. At the moment of the screenshot, your mean received messages throughput was 953 msgs/s which is higher than the expected 850, taking the 800-900 range mentioned in the question as a normally distributed.
Check the producer for the particular reasons of this increase.

Resources