I am working on kafka-node lately. I have a few doubts:
Here is the scenario: I have say 10 topics, each receiving data and the volumes are high, each message is around 300KB and 5 messages a sec.
Now I want to create a high-level consumer/consumers so as read the data efficiently.
I tried created one highlevel consumer on all the 10 topics with one group id. It worked fine with small volumes, but behaving weird when the volumes increases.
So, I am planning to do the following:
1. Create 10 consumers, one for each topic all having different group id.
2. Create 10 consumers with same group id, one for each topic
I would like to understand the significance of group id. How the behavior would be in aforementioned cases.
Also do we have max size kafka can handle?
Related
We are exploring Kafka for coordination across multiple tasks in a Spark job. Each Spark task acts as both a producer AND consumer of messages on the SAME topic. So far we are seeing decent performance, but I am wondering if there is a way to improve it, considering that we are getting the best performance by doing things CONTRARY to what the docs suggest. At the moment we use only a single Broker machine with multiple CPUs, but we can use more if needed.
So far we have tried the following setups:
Single topic, single partition, multiple consumers, NOT using Group ID: BEST PERFORMANCE
Single topic, single partition, multiple consumers each using its own Group ID: 2x slower than (1)
Single topic, single partition, multiple consumers, all using the same Group ID: stuck or dead slow
Single topic, as many partitions as consumers, single Group ID: stuck or dead slow
Single topic, as many partitions as consumers, each using its own Group ID or no Group ID: works, but a lot slower than (1) or (2)
I don't understand why we are getting best performance by doing things against what the docs suggest.
My questions are:
There's a lot written out there about the benefits of having multiple partitions, even on a single broker, but clearly here we are seeing performance deterioration.
Apart from resilience considerations, what's the benefit of adding additional Brokers? We see that our single Broker CPU utilization never goes above 50% even in times of stress. And its easier to simply increase the CPU count on a single VM rather than manage multiple VMs. Is there any merit in getting more Brokers? (for speed considerations, not resilience)
If the above is YES, then clearly we can't have a broker per each consumer. Right now we are running 30-60 Spark tasks, but it can go up to hundreds. So almost inevitably we will be in a situation that each Broker is responsible for tens of partitions, if each task were to have a partition. So based on the above tests, we are still going to see worse performance?
Note that we are setting up the producer to not wait for acknowledgment from the Brokers, as we'd seen in the docs that with many partitions that can slow things down:
producer = KafkaProducer(bootstrap_servers=[SERVER], acks=0)
Thanks for your thoughts.
I think you are missing an important concept: Kafka allows only one consumer per topic partition while there may be multiple consumer groups reading from the same partition. It seems that you have a problem with committing the offsets or too many group re-balancing problems.
Here are my thoughts;
Single topic, single partition, multiple consumers, NOT using Group ID: BEST PERFORMANCE
What actually happens here is -> one of your consumers is idle.
Single topic, single partition, multiple consumers each using its own Group ID: 2x slower than (1)
Both consumers are fetching and processing the same messages independently.
Single topic, single partition, multiple consumers, all using the same Group ID: stuck or dead slow
Only one member of the same group can read from a single partition. This should not give results different than the first case.
Single topic, as many partitions as consumers, single Group ID: stuck or dead slow
This is the situation where each consumer is assigned to different partitions. And, this is the case where we expect to consume as fast as we are.
Single topic, as many partitions as consumers, each using its own Group ID or no Group ID: works, but a lot slower than (1) or (2)
Same remarks on the first and second step.
There's a lot written out there about the benefits of having multiple partitions, even on a single broker, but clearly here we are seeing performance deterioration.
Indeed, by having multiple partitions, we can parallelize the consumers. If the consumers have the same group id, then they will consume from different partitions. Otherwise, each consumer will consume from all partitions.
Apart from resilience considerations, what's the benefit of adding additional Brokers? We see that our single Broker CPU utilization never goes above 50% even in times of stress. And its easier to simply increase the CPU count on a single VM rather than manage multiple VMs. Is there any merit in getting more Brokers? (for speed considerations, not resilience)
If the above is YES, then clearly we can't have a broker per each consumer. Right now we are running 30-60 Spark tasks, but it can go up to hundreds. So almost inevitably we will be in a situation that each Broker is responsible for tens of partitions, if each task were to have a partition. So based on the above tests, we are still going to see worse performance?
When a new topic is created, one of the brokers in the cluster is selected as partition leader, where all read/write operations are handled. So, when you have many topics, it will automatically distribute the workload between the brokers. If you have a single broker with many topics, all producers/consumers will be producing/consume from/to the same broker.
I have a spark application which is required to read from two different topics using one consumer using Spark Java.
The kafka message key & value schema is same for both the topics.
Below is the workflow:
1. Read messages from both the topics, same groupID, using JavaInputDStream<ConsumerRecord<String, String>> and iterate using foreachRDD
2. Inside the loop, Read offsets, filter messages based on the message key and create JavaRDD<String>
3. Iterate on JavaRDD<String> using mapPartitions
4. Inside mapPartitions loop, iterate over them using forEachRemaining.
5. Perform data enrichment, transformation, etc on the rows inside forEachRemaining loop.
6. commit
I want to understand below questions. Please provide your answers or share any documentation which can help me find answers.
1. How the messages are received/consumed from two topics(one common group id, same schema both key/value) in one consumer.
Let say the consumer reads data every second. Producer1 produces 50 messages to Topic1 and Producer 2 produces 1000 messages to Topic2.
2. Is it going to read all msgs(1000+50) in one batch and process together in the workflow, OR is it going to read 50 msgs first, process them and then read 1000 msgs and process them.
3. What parameter should i use to control the number of messages being read in one batch per second.
4. Will same group id create any issue while consuming.
The official document in Spark Streaming already explains on how to consume multiple topics per group id.
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
Collection<String> topics = Arrays.asList("topicA", "topicB");
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
One group id and follows same schema for both the topics.
Not sure about this, however from my understanding it would consume all the messages depending on the batch size.
"spark.streaming.backpressure.enabled" set this as true and "spark.streaming.kafka.maxRatePerPartition" set this as a numeric value, based on this spark limits the number of messaged to consume from kafka per batch. Also set the batch duration accordingly. https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html
This totally depends on your application usage.
1. How the messages are received/consumed from two topics(one common group id, same schema both key/value) in one consumer.
Let say the consumer reads data every second. Producer1 produces 50 messages to Topic1 and Producer 2 produces 1000 messages to Topic2.
Any Kafka consumer can mention a list of topics, so no constraints about this.
So if you have one consumer, it will be responsible for all the partitions of both Topic1 and Topic2.
2. Is it going to read all msgs(1000+50) in one batch and process together in the workflow, OR is it going to read 50 msgs first, process them and then read 1000 msgs and process them.
3. What parameter should I use to control the number of messages being read in one batch per second.
Answer for both 2,3 questions:
It will receive all the messages together (1050) or even more, depending on your configuration.
In order to allow the consumer to receive in batches of 1050 or greater, raise max.poll.records (default 500) to 1050 (or more); other configuration may be a bottleneck, but you should be ok with the rest for the default configurations.
4. Will same group id create any issue while consuming.
The same group-id will affect you if you create more than one consumer, making the consumers to split the partitions they responsible of between topics.
Moreover, if your consumer dies or stops for some reason you have to get it back up with the same group-id, this way the consumer "remembers" the last offset consumed and keeps from the points it stopped.
If you have any more problems regarding to your consumer, I suggest you to read more information in this article, it is chapter 4 from Kafka: The Definitive Guide, explaining deeply about consumers and should answer further questions.
If you want to explore the configuration options, the documentation is always helpful.
All of our 30 topics are created with 10 partitions in our kafka. We are monitoring the lag by partition for all the topics/group-ids.
We are using Fluentd plugin to read and route logs from kafka. The plugin is implemented using a high level consumer. We have configured some consumers for individual topics and some for multiple topics for the plugin. Overall, the data is flowing through with no problem except for 3 of the topics.
The problem is that for 3 out of the 30 topics being processed we see that the partition lag values are inconsistent ie. looking at lag values for a specific topic/group-id, the lag for some partitions are much higher than other partitions, sometimes by as much as 30k. However, for the other 27 topics the lags numbers for all partitions stay uniform, all partitions of one topic/group-id stay within close range of each other (for ex. all between 12 and 18).
Almost every time we restart Fluentd agent (which restarts the high level consumers) we see that the lag starts to smooth out for those 3 topics and sometimes they stay consistent for a little while and then again lag numbers start to become zigzagy. This is only happening for the 3 topics. But when we check the distribution for those 3 topics, everything looks normal.
We are at a loss as the reason for this. High level consumers do not code for managing the retrieval of the data from the partitions. Its the kafka lib that handles that part. All that consumer code specifies is the number of threads. We have tried 10, 5 and in all cases (especially 10 and 5 threads) the lag inconsistency keeps showing up for these 3 topics. The data volume is less than 30k per hour for each of these topics.
Any suggestion as to what could be the reason? What can be done about it?
Really appreciate your help in advance.
On the basis of the details provided I would start looking at the points below to begin with, I think so you would have already looked at them.
Compare the trend of messages being produced for the 3 topics versus rest of the topics. Also check the message sizes being published to these topics vs other topics.
Just move the problematic 3 topics to another fluentd instance to verify lag behaviour.
Please do let me know incase you find anything more or solve the issue with some fine tunings
I'm just started with Apache Kafka and really try to figure out, how could I design my system to use it in proper manner.
I'm building system which process data and actually my chunk of data is a task (object), that need to be processed. And object knows how it could be processed, so that's not a problem.
My system is actually a splited into 3 main component: Publisher (code which spown tasks), transport - actually kafka, and set of Consumers - it's actually workers who just pull data from the queue, process it somehow. It's important to note, that Consumer could be a publisher itself, if it's task need 2 step computation (Consumer just create tasks and send it back to transport)
So we could start with idea that I have 3 server: 1 single root publisher (kafka server also running there) and 2 consumers servers which actually handle the tasks. Data workflow is like that: Publisher create task, put it to transposrt, than one of consumers take this task from the queue and handle it. And it will be nice if each consumer will be handle the same ammount of tasks as the others (so workload spread eqauly between consumers).
Which kafka configuration pattern I need to use for that case? Does kafka have some message balancing features or I need to create 2 partitions and each consumer will be only binded to single partitions and could consume data only from this partition?
In kafka number of partitions roughly translates to the parallelism of the system.
General tip is create more partitions per topic (eg. 10) and while creating the consumer specify the number of consumer threads corresponding to the number of partitions.
In the High-level consumer API while creating the consumer you can provide the number of streams(threads) to create per topic. Assume that you create 10 partitions and you run the consumer process from a single machine, you can give topicCount as 10. If you run the consumer process from 2 servers you could specify the topicCount as 5.
Please refer to this link
The createMessageStreams call registers the consumer for the topic, which results in rebalancing the consumer/broker assignment. The API encourages creating many topic streams in a single call in order to minimize this rebalancing.
Also you can dynamically increased the number of partitions using kafka-add-partitions.sh command under kafka/bin. After increasing the partitions you can restart the consumer process with increased topicCount
Also while producing you should use the KeyedMessage class based on some random key within your message object so that the messages are evenly distributed across the different partitions
As per my understanding, eventhub can process/ingest millions of messages per seconds. And to tune the ingesting, we can use throughput.
More throughput= more ingesting power.
But on receiving/consuming side, You can create upto 32 receivers(since we can create 32 partitions and one partition can be consumed by one receiver).
Based on above, if one single message takes 100 milisencond to process, one consumer can process 10 message per second and 32 consumer can process 32*10= 320 message per second.
How can I make my receiver consume more messages (for ex. 5-10k per seond).
1) Either I have to process message asynchronously inside ProcessEventsAsync. But in this case I would not be able to maintain ordering.
2) Or I have to ask Microsoft to allow me to create more partitions.
Please advice
TLDR: You will need to ask Microsoft to increase the number of partitions you are allowed, and remember that there is currently no way to increase the number on an already extant Event Hub.
You are correct that your unit of consumption parallelism is the partition. If your consumers can only do 10/seconds in order or even 100/second in order, then you will need more partitions to consume millions of events. While 100ms/event certainly seems slow to me and I think you should look for optimizations there (ie farm out work you don't need to wait for, commit less often etc), you will reach the point of needing more partitions at scale.
Some things to keep in mind: 32 partitions gives you only 32 Mb/s of ingress and 64Mb/s of egress. Both of these factors matter since that egress throughput is shared by all the consumer groups you use. So if you have 4 consumer groups reading the data (16Mb/s each) you'll need twice as many partitions (or at least throughput units) for input as you would based solely on your data ingress (because otherwise you would fall behind).
With regards to your comment about multitenancy, you will have one 'database consumer' group that handles all your tenants all of whose data will be flowing through the same hub? If so that sounds like a sensible use, what would not be so sensible is having one consumer group per tenant each consuming the entire stream.