I'm working with a Kafka Consumer using KAFKAJS.
I have a requirement to know all the partitions assigned to a specific consumer.
For example, let's say Topic 1 has 5 partitions. And there are 2 consumers with the same clientId.
So, one will be assigned 3 topics and the other 2. I want each consumer to know the partitions assigned.
We can use consumer.describeGroup to query the state of the consumer group. https://kafka.js.org/docs/consuming#a-name-describe-group-a-describe-group
The memberAssignment field describes what topic-partitions are assigned to each member, but it's a Buffer, so you'll need to decode that using
AssignerProtocol.MemberAssignment.decode:
https://github.com/tulios/kafkajs/blob/master/index.js#L18
We can also listen to the GROUP_JOIN event. It already contains the member assignment, so you don't need to actually describeGroup:
https://kafka.js.org/docs/instrumentation-events#consumer
Related
I have multiple consumers subscribed to the same Pulsar topic. How do I make sure certain messages go to specific consumers?
The closest thing I understand is the key-shared consumer type. However, they seem to group the messages by a hash-range and then choose a consumer at random to send the messages to.
If you use the Key-Shared subscription type, you can specify which hash range a particular consumer will be assigned with the KeySharedPolicySticky policy.
Consumer<String> consumer = getClient().newConsumer(Schema.STRING)
.topic("topicName")
.subscriptionName("sub1")
.subscriptionType(SubscriptionType.Key_Shared)
.keySharedPolicy(KeySharedPolicy.KeySharedPolicySticky
.stickyHashRange()
.ranges(Range.of(0, 100), Range.of(1000, 2000)))
.subscribe();
I need to consume from a Kafka topic that will have millions of data. Once I read from the topic, i need to transform and write it to another topic. I am able to consume messages from the topic, process the data by multiple threads and write to another topic.
I followed the example from here https://projectreactor.io/docs/kafka/1.3.5-SNAPSHOT/reference/index.html#concurrent-ordered
Here is my code:
public Flux<?> flux() {
KafkaSender<Integer, Person> sender = sender(senderOptions());
return KafkaReceiver.create(receiverOptions(Collections.singleton(sourceTopic)))
.receive()
.map(m -> SenderRecord.create(transform(m.value()), m.receiverOffset()))
.as(sender::send)
.doOnNext(m -> m.correlationMetadata().acknowledge())
.doOnCancel(() -> close());
}
I have multiple consumers to read from and was looking into adding different reader threads to read from the topic due to the volume of data. However, the reactor-kafka documentation mentions KafkaReceiver is not thread-safe since the underlying KafkaConsumer cannot be accessed concurrently by multiple threads.
I am looking for suggestions on reading from a topic concurrently.
So basically what you are looking for called Consumer Group, the maximum parallel consumption you can run is limited by the number of partitions your topic has.
Kafka Consumer Group mechanism allows you to seperate the work of consumption a topic to diffrent "readers" which belongs to the same group, the work would be divided by that each consumer in the group would be solely responsible for a partition (1 or more, based on number of consumers in the group, and number of partitions to the topic)
I have a spark application which is required to read from two different topics using one consumer using Spark Java.
The kafka message key & value schema is same for both the topics.
Below is the workflow:
1. Read messages from both the topics, same groupID, using JavaInputDStream<ConsumerRecord<String, String>> and iterate using foreachRDD
2. Inside the loop, Read offsets, filter messages based on the message key and create JavaRDD<String>
3. Iterate on JavaRDD<String> using mapPartitions
4. Inside mapPartitions loop, iterate over them using forEachRemaining.
5. Perform data enrichment, transformation, etc on the rows inside forEachRemaining loop.
6. commit
I want to understand below questions. Please provide your answers or share any documentation which can help me find answers.
1. How the messages are received/consumed from two topics(one common group id, same schema both key/value) in one consumer.
Let say the consumer reads data every second. Producer1 produces 50 messages to Topic1 and Producer 2 produces 1000 messages to Topic2.
2. Is it going to read all msgs(1000+50) in one batch and process together in the workflow, OR is it going to read 50 msgs first, process them and then read 1000 msgs and process them.
3. What parameter should i use to control the number of messages being read in one batch per second.
4. Will same group id create any issue while consuming.
The official document in Spark Streaming already explains on how to consume multiple topics per group id.
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
Collection<String> topics = Arrays.asList("topicA", "topicB");
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
One group id and follows same schema for both the topics.
Not sure about this, however from my understanding it would consume all the messages depending on the batch size.
"spark.streaming.backpressure.enabled" set this as true and "spark.streaming.kafka.maxRatePerPartition" set this as a numeric value, based on this spark limits the number of messaged to consume from kafka per batch. Also set the batch duration accordingly. https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html
This totally depends on your application usage.
1. How the messages are received/consumed from two topics(one common group id, same schema both key/value) in one consumer.
Let say the consumer reads data every second. Producer1 produces 50 messages to Topic1 and Producer 2 produces 1000 messages to Topic2.
Any Kafka consumer can mention a list of topics, so no constraints about this.
So if you have one consumer, it will be responsible for all the partitions of both Topic1 and Topic2.
2. Is it going to read all msgs(1000+50) in one batch and process together in the workflow, OR is it going to read 50 msgs first, process them and then read 1000 msgs and process them.
3. What parameter should I use to control the number of messages being read in one batch per second.
Answer for both 2,3 questions:
It will receive all the messages together (1050) or even more, depending on your configuration.
In order to allow the consumer to receive in batches of 1050 or greater, raise max.poll.records (default 500) to 1050 (or more); other configuration may be a bottleneck, but you should be ok with the rest for the default configurations.
4. Will same group id create any issue while consuming.
The same group-id will affect you if you create more than one consumer, making the consumers to split the partitions they responsible of between topics.
Moreover, if your consumer dies or stops for some reason you have to get it back up with the same group-id, this way the consumer "remembers" the last offset consumed and keeps from the points it stopped.
If you have any more problems regarding to your consumer, I suggest you to read more information in this article, it is chapter 4 from Kafka: The Definitive Guide, explaining deeply about consumers and should answer further questions.
If you want to explore the configuration options, the documentation is always helpful.
I am working on kafka-node lately. I have a few doubts:
Here is the scenario: I have say 10 topics, each receiving data and the volumes are high, each message is around 300KB and 5 messages a sec.
Now I want to create a high-level consumer/consumers so as read the data efficiently.
I tried created one highlevel consumer on all the 10 topics with one group id. It worked fine with small volumes, but behaving weird when the volumes increases.
So, I am planning to do the following:
1. Create 10 consumers, one for each topic all having different group id.
2. Create 10 consumers with same group id, one for each topic
I would like to understand the significance of group id. How the behavior would be in aforementioned cases.
Also do we have max size kafka can handle?
I'm just started with Apache Kafka and really try to figure out, how could I design my system to use it in proper manner.
I'm building system which process data and actually my chunk of data is a task (object), that need to be processed. And object knows how it could be processed, so that's not a problem.
My system is actually a splited into 3 main component: Publisher (code which spown tasks), transport - actually kafka, and set of Consumers - it's actually workers who just pull data from the queue, process it somehow. It's important to note, that Consumer could be a publisher itself, if it's task need 2 step computation (Consumer just create tasks and send it back to transport)
So we could start with idea that I have 3 server: 1 single root publisher (kafka server also running there) and 2 consumers servers which actually handle the tasks. Data workflow is like that: Publisher create task, put it to transposrt, than one of consumers take this task from the queue and handle it. And it will be nice if each consumer will be handle the same ammount of tasks as the others (so workload spread eqauly between consumers).
Which kafka configuration pattern I need to use for that case? Does kafka have some message balancing features or I need to create 2 partitions and each consumer will be only binded to single partitions and could consume data only from this partition?
In kafka number of partitions roughly translates to the parallelism of the system.
General tip is create more partitions per topic (eg. 10) and while creating the consumer specify the number of consumer threads corresponding to the number of partitions.
In the High-level consumer API while creating the consumer you can provide the number of streams(threads) to create per topic. Assume that you create 10 partitions and you run the consumer process from a single machine, you can give topicCount as 10. If you run the consumer process from 2 servers you could specify the topicCount as 5.
Please refer to this link
The createMessageStreams call registers the consumer for the topic, which results in rebalancing the consumer/broker assignment. The API encourages creating many topic streams in a single call in order to minimize this rebalancing.
Also you can dynamically increased the number of partitions using kafka-add-partitions.sh command under kafka/bin. After increasing the partitions you can restart the consumer process with increased topicCount
Also while producing you should use the KeyedMessage class based on some random key within your message object so that the messages are evenly distributed across the different partitions