Apache kafka message dispatching and balance loading - multithreading

I'm just started with Apache Kafka and really try to figure out, how could I design my system to use it in proper manner.
I'm building system which process data and actually my chunk of data is a task (object), that need to be processed. And object knows how it could be processed, so that's not a problem.
My system is actually a splited into 3 main component: Publisher (code which spown tasks), transport - actually kafka, and set of Consumers - it's actually workers who just pull data from the queue, process it somehow. It's important to note, that Consumer could be a publisher itself, if it's task need 2 step computation (Consumer just create tasks and send it back to transport)
So we could start with idea that I have 3 server: 1 single root publisher (kafka server also running there) and 2 consumers servers which actually handle the tasks. Data workflow is like that: Publisher create task, put it to transposrt, than one of consumers take this task from the queue and handle it. And it will be nice if each consumer will be handle the same ammount of tasks as the others (so workload spread eqauly between consumers).
Which kafka configuration pattern I need to use for that case? Does kafka have some message balancing features or I need to create 2 partitions and each consumer will be only binded to single partitions and could consume data only from this partition?

In kafka number of partitions roughly translates to the parallelism of the system.
General tip is create more partitions per topic (eg. 10) and while creating the consumer specify the number of consumer threads corresponding to the number of partitions.
In the High-level consumer API while creating the consumer you can provide the number of streams(threads) to create per topic. Assume that you create 10 partitions and you run the consumer process from a single machine, you can give topicCount as 10. If you run the consumer process from 2 servers you could specify the topicCount as 5.
Please refer to this link
The createMessageStreams call registers the consumer for the topic, which results in rebalancing the consumer/broker assignment. The API encourages creating many topic streams in a single call in order to minimize this rebalancing.
Also you can dynamically increased the number of partitions using kafka-add-partitions.sh command under kafka/bin. After increasing the partitions you can restart the consumer process with increased topicCount
Also while producing you should use the KeyedMessage class based on some random key within your message object so that the messages are evenly distributed across the different partitions

Related

Using different threads to read from a consumer group in Kafka using reactor-kafka

I need to consume from a Kafka topic that will have millions of data. Once I read from the topic, i need to transform and write it to another topic. I am able to consume messages from the topic, process the data by multiple threads and write to another topic.
I followed the example from here https://projectreactor.io/docs/kafka/1.3.5-SNAPSHOT/reference/index.html#concurrent-ordered
Here is my code:
public Flux<?> flux() {
KafkaSender<Integer, Person> sender = sender(senderOptions());
return KafkaReceiver.create(receiverOptions(Collections.singleton(sourceTopic)))
.receive()
.map(m -> SenderRecord.create(transform(m.value()), m.receiverOffset()))
.as(sender::send)
.doOnNext(m -> m.correlationMetadata().acknowledge())
.doOnCancel(() -> close());
}
I have multiple consumers to read from and was looking into adding different reader threads to read from the topic due to the volume of data. However, the reactor-kafka documentation mentions KafkaReceiver is not thread-safe since the underlying KafkaConsumer cannot be accessed concurrently by multiple threads.
I am looking for suggestions on reading from a topic concurrently.
So basically what you are looking for called Consumer Group, the maximum parallel consumption you can run is limited by the number of partitions your topic has.
Kafka Consumer Group mechanism allows you to seperate the work of consumption a topic to diffrent "readers" which belongs to the same group, the work would be divided by that each consumer in the group would be solely responsible for a partition (1 or more, based on number of consumers in the group, and number of partitions to the topic)

Why single Broker setup performs better with single topic partition rather than multiple partitions

We are exploring Kafka for coordination across multiple tasks in a Spark job. Each Spark task acts as both a producer AND consumer of messages on the SAME topic. So far we are seeing decent performance, but I am wondering if there is a way to improve it, considering that we are getting the best performance by doing things CONTRARY to what the docs suggest. At the moment we use only a single Broker machine with multiple CPUs, but we can use more if needed.
So far we have tried the following setups:
Single topic, single partition, multiple consumers, NOT using Group ID: BEST PERFORMANCE
Single topic, single partition, multiple consumers each using its own Group ID: 2x slower than (1)
Single topic, single partition, multiple consumers, all using the same Group ID: stuck or dead slow
Single topic, as many partitions as consumers, single Group ID: stuck or dead slow
Single topic, as many partitions as consumers, each using its own Group ID or no Group ID: works, but a lot slower than (1) or (2)
I don't understand why we are getting best performance by doing things against what the docs suggest.
My questions are:
There's a lot written out there about the benefits of having multiple partitions, even on a single broker, but clearly here we are seeing performance deterioration.
Apart from resilience considerations, what's the benefit of adding additional Brokers? We see that our single Broker CPU utilization never goes above 50% even in times of stress. And its easier to simply increase the CPU count on a single VM rather than manage multiple VMs. Is there any merit in getting more Brokers? (for speed considerations, not resilience)
If the above is YES, then clearly we can't have a broker per each consumer. Right now we are running 30-60 Spark tasks, but it can go up to hundreds. So almost inevitably we will be in a situation that each Broker is responsible for tens of partitions, if each task were to have a partition. So based on the above tests, we are still going to see worse performance?
Note that we are setting up the producer to not wait for acknowledgment from the Brokers, as we'd seen in the docs that with many partitions that can slow things down:
producer = KafkaProducer(bootstrap_servers=[SERVER], acks=0)
Thanks for your thoughts.
I think you are missing an important concept: Kafka allows only one consumer per topic partition while there may be multiple consumer groups reading from the same partition. It seems that you have a problem with committing the offsets or too many group re-balancing problems.
Here are my thoughts;
Single topic, single partition, multiple consumers, NOT using Group ID: BEST PERFORMANCE
What actually happens here is -> one of your consumers is idle.
Single topic, single partition, multiple consumers each using its own Group ID: 2x slower than (1)
Both consumers are fetching and processing the same messages independently.
Single topic, single partition, multiple consumers, all using the same Group ID: stuck or dead slow
Only one member of the same group can read from a single partition. This should not give results different than the first case.
Single topic, as many partitions as consumers, single Group ID: stuck or dead slow
This is the situation where each consumer is assigned to different partitions. And, this is the case where we expect to consume as fast as we are.
Single topic, as many partitions as consumers, each using its own Group ID or no Group ID: works, but a lot slower than (1) or (2)
Same remarks on the first and second step.
There's a lot written out there about the benefits of having multiple partitions, even on a single broker, but clearly here we are seeing performance deterioration.
Indeed, by having multiple partitions, we can parallelize the consumers. If the consumers have the same group id, then they will consume from different partitions. Otherwise, each consumer will consume from all partitions.
Apart from resilience considerations, what's the benefit of adding additional Brokers? We see that our single Broker CPU utilization never goes above 50% even in times of stress. And its easier to simply increase the CPU count on a single VM rather than manage multiple VMs. Is there any merit in getting more Brokers? (for speed considerations, not resilience)
If the above is YES, then clearly we can't have a broker per each consumer. Right now we are running 30-60 Spark tasks, but it can go up to hundreds. So almost inevitably we will be in a situation that each Broker is responsible for tens of partitions, if each task were to have a partition. So based on the above tests, we are still going to see worse performance?
When a new topic is created, one of the brokers in the cluster is selected as partition leader, where all read/write operations are handled. So, when you have many topics, it will automatically distribute the workload between the brokers. If you have a single broker with many topics, all producers/consumers will be producing/consume from/to the same broker.

Spark Streaming in Java: Reading from two Kafka Topics using One Consumer using JavaInputDStream

I have a spark application which is required to read from two different topics using one consumer using Spark Java.
The kafka message key & value schema is same for both the topics.
Below is the workflow:
1. Read messages from both the topics, same groupID, using JavaInputDStream<ConsumerRecord<String, String>> and iterate using foreachRDD
2. Inside the loop, Read offsets, filter messages based on the message key and create JavaRDD<String>
3. Iterate on JavaRDD<String> using mapPartitions
4. Inside mapPartitions loop, iterate over them using forEachRemaining.
5. Perform data enrichment, transformation, etc on the rows inside forEachRemaining loop.
6. commit
I want to understand below questions. Please provide your answers or share any documentation which can help me find answers.
1. How the messages are received/consumed from two topics(one common group id, same schema both key/value) in one consumer.
Let say the consumer reads data every second. Producer1 produces 50 messages to Topic1 and Producer 2 produces 1000 messages to Topic2.
2. Is it going to read all msgs(1000+50) in one batch and process together in the workflow, OR is it going to read 50 msgs first, process them and then read 1000 msgs and process them.
3. What parameter should i use to control the number of messages being read in one batch per second.
4. Will same group id create any issue while consuming.
The official document in Spark Streaming already explains on how to consume multiple topics per group id.
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
Collection<String> topics = Arrays.asList("topicA", "topicB");
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
One group id and follows same schema for both the topics.
Not sure about this, however from my understanding it would consume all the messages depending on the batch size.
"spark.streaming.backpressure.enabled" set this as true and "spark.streaming.kafka.maxRatePerPartition" set this as a numeric value, based on this spark limits the number of messaged to consume from kafka per batch. Also set the batch duration accordingly. https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html
This totally depends on your application usage.
1. How the messages are received/consumed from two topics(one common group id, same schema both key/value) in one consumer.
Let say the consumer reads data every second. Producer1 produces 50 messages to Topic1 and Producer 2 produces 1000 messages to Topic2.
Any Kafka consumer can mention a list of topics, so no constraints about this.
So if you have one consumer, it will be responsible for all the partitions of both Topic1 and Topic2.
2. Is it going to read all msgs(1000+50) in one batch and process together in the workflow, OR is it going to read 50 msgs first, process them and then read 1000 msgs and process them.
3. What parameter should I use to control the number of messages being read in one batch per second.
Answer for both 2,3 questions:
It will receive all the messages together (1050) or even more, depending on your configuration.
In order to allow the consumer to receive in batches of 1050 or greater, raise max.poll.records (default 500) to 1050 (or more); other configuration may be a bottleneck, but you should be ok with the rest for the default configurations.
4. Will same group id create any issue while consuming.
The same group-id will affect you if you create more than one consumer, making the consumers to split the partitions they responsible of between topics.
Moreover, if your consumer dies or stops for some reason you have to get it back up with the same group-id, this way the consumer "remembers" the last offset consumed and keeps from the points it stopped.
If you have any more problems regarding to your consumer, I suggest you to read more information in this article, it is chapter 4 from Kafka: The Definitive Guide, explaining deeply about consumers and should answer further questions.
If you want to explore the configuration options, the documentation is always helpful.

Can we use multiple Kafka producers in a multi-threaded environment in high traffic?

We have a front layer which just receives messages and writes to the Kafka topics for back-end processing. We send the messages at a very high rate; per day we process 1 billion messages. We have a thread pool which accepts the messages and writes to the Kafka producer instance. Here I have created only one producer (single instance) which is shared among multiple threads.
Recently, I have been observing that 90% of the threads are in blocked state. I found out that Kafka is sending the data sequentially. There was a synchronized block in the producer.send() method in the Kafka Java driver:
def send(messages: KeyedMessage[K,V]*) {
**lock synchronized {**
if (hasShutdown.get)
throw new ProducerClosedException
recordStats(messages)
sync match {
case true => eventHandler.handle(messages)
case false => asyncSend(messages)
}
}
}
The documentation says that we don't need to create multiple producer instances; one instance can be shared in a multi-threaded environment. But how can we do that? Or should we better create a pool of producer instances?
The reason why it is recommended to share the publisher client across threads is that it leads to better batching, as the messages are batched at partition level. Better batching leads to better compression (if enabled) and also better throughput. You can consider tuning parameters like buffer memory and linger.ms and batch size for optimizing the throughput.
One this is done, then you can consider adding multiple producers.
Also, consider increasing the number of partitions for the topic, if the incoming rate for the topic is quite high.

Kafka high-level consumer: Can partitions have multiple threads consuming it?

Can messages from a given partition ever be divided on multiple threads? Let's say that I have a single partition and a hundred processes with a hundred threads each - will the messages from my single partition be given to only one of those 10000 threads?
Multiple threads cannot consume the same partition unless those threads are in different consumer groups. Only a single thread will consume the messages from the single partition although you have lots of idle consumers.
The number of partitions is the unit of parallelism in Kafka. To make multiple consumers consume the same partition, you must increase the number of partitions of the topic up to the parallelism you want to achieve or put every single thread into the separate consumer groups, but I think the latter is not desirable.
If you have multiple consumers consuming from the same topic under same consumer group then the messages in a topic are distributed among those consumers. In other words, each consumer will get a non-overlapping subset of the message. The following few line is taken from the Kafka FAQ page
Should I choose multiple group ids or a single one for the consumers?
If all consumers use the same group id, messages in a topic are distributed among those consumers. In other words, each consumer will get a non-overlapping subset of the messages. Having more consumers in the same group increases the degree of parallelism and the overall throughput of consumption. See the next question for the choice of the number of consumer instances. On the other hand, if each consumer is in its own group, each consumer will get a full copy of all messages.
Why some of the consumers in a consumer group never receive any message?
Currently, a topic partition is the smallest unit that we distribute messages among consumers in the same consumer group. So, if the number of consumers is larger than the total number of partitions in a Kafka cluster (across all brokers), some consumers will never get any data. The solution is to increase the number of partitions on the broker
No in extreme cases.
Kafka high-level consumer can make sure that one message will only consumed once.And make sure that one partition only be consumed by one thread at the most time.
Because, there is a local queue in kafka high-level consumer.
Consumers considers if you polled a message from the local queue out, you have consumed the message.
So lets tell a story:
Thread 1 consumes partition 0.
Thread 1 polled a message m0. Message m1,m2... have been in the local queue.
Rebalanced, kafka will clear the local queue and re-registered.
Thread 2 consumes partition 0 now, but thread 1 is still consuming m0.
Thread 2 could poll m1,m2... now.
You can see two threads are consuming the same partition at this time.
instead of using threads it better to increase consumers and partitions to get better throughput and better control

Resources