Multiple Spark Kafka consumers with same groupId - apache-spark

I am trying to have multiple consumers for multiple partitions of Kafka topic with same groupId which will help me scale the consumption of messages.
According to Kafka documentation, it says:
If all the consumer instances have the same consumer group, then the records will effectively be load-balanced over the consumer instances.
Having consumers as part of the same consumer group means providing the“competing consumers” pattern with whom the messages from topic partitions are spread across the members of the group. Each consumer receives messages from one or more partitions (“automatically” assigned to it) and the same messages won’t be received by the other consumers (assigned to different partitions). In this way, we can scale the number of consumers up to the number of the partitions (having one consumer reading only one partition);
But when I deploy multiple spark application with same groupId it gives me the following exception:
java.lang.IllegalStateException: Previously tracked partitions [cpq.cluster-1] been revoked by Kafka because of consumer rebalance. This is mostly due to another stream with same group id joined, please check if there're different streaming application misconfigure to use same group id. Fundamentally different stream should use different group id
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:200)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:228)
According to exception, I cannot have multiple consumers with the same groupId. Hence I am unable to have load balancing in my spark application; I can only assign 1 consumer per topic partition and this contradicts the Kafka documentation.
How can I have multiple consumers with the same consumer groupId to have load balancing?

Here, you don't need to execute multiple spark applications to consume from multiple partitions rather single spark application will handle this internally. Spark streaming uses a 1:1 parallelism between Kafka partitions and Spark partitions. If you execute multiple spark applications, it will give this error. Please refer this questions for more details: 2 spark stream job with same consumer group id

Related

How multiple Kafka Consumers in the same consumer group read messages from one partition in the topic?

I would like to know about how the consumers in the same consumer group read the messages from one topic which has only one partition.
For example, I have 3 consumers in one consumer group and that group is polling messages from Topic A which has partition A so if I have 1000 messages coming one by one in the Topic A how it would be delivered to 3 of the consumers.
Would it be like 3 messages will be delivered to 3 consumers parellely and once it's processed by each the another one would be delivered basically will they receive messages paraellely?
Would it be like any one consumer will fetch those messages as there is only one partition ?
Please also suggest me the best architecture approach for above scenario.
Thanks,
I want to process the multiple messages parallelly from one topic which has one partition to 4 consumers.
I am using the kafka structure with NodeJS microservices with kafkajs package.
In your scenario, only one consumer of that consumer group will read the data, most probably the first one you started. I'm not 100% sure as I never tried it out, but I assume the additional consumers will just idle without workload.
This question is essentially the same as yours.
If you want to achieve parallelity of consumers, you cannot avoid having multiple partitions, that's the main purpose of the whole partitioning concept.

How do I setup Spark application to pull from single Kafka topic on multiple Spark nodes?

My application has a Kafka input stream for a single topic, it does some filtering and aggregating of the data, and then writes to Elasticsearch. What I'm seeing is that while the application is distributed to all of the spark nodes and processing the data properly, only one node is pulling data, and the rest are idle.
Also, I am using an R53 hostname for the Kafka nodes. Should I use a comma-separated list of the Kafka nodes instead?
The topic has 20 partitions. I am running Spark 3.2.1 using only Spark Streaming (no DFS).
The topic has 20 partitions
Then up to 20 executors should be able to consume in parallel.
using an R53 hostname for the Kafka nodes
Any Kafka client, including Spark, will need to communicate with the brokers individually. This means you'll need to expose each broker's advertised.listeners setting such that Spark can communicate with each broker directly, and not via a single DNS name / load balancer address. If only one is resolvable, then you'll only be able to consume (or produce) to just that one.
Should I use a comma-separated list of the Kafka nodes instead
It's recommended, but not necessary. For example, what if the broker at the one address provided is not responding? The bootstrap protocol will return all advertised.listener addresses back to the client based on its associated listeners protocol.

Spark Streaming job fails with ReceiverDisconnectedException Class

I've Spark Streaming job which captures the near real time data from Azure Eventhub and runs 24/7.
More interestingly, my job fails at least 2 times a day with the below error. if I google the error, Microsoft docs gives me 'This exception is thrown if two or more PartitionReceiver instances connect to the same partition with different epoch values'. I'm not worried about data loss because spark Checkpointing will automatically take care of data when i restart the job, but my question is why the spark streaming job fails 2-3 times a day with the same error.
Has anybody faced the same issue, is there any solution/workaround available of this. Any help would be much appreciated.
error:
This exception is thrown if two or more Partitions Receiver instances connect to the same partition with different epoch values.
What is Partition Receiver?
This is a logical representation of receiving from a EventHub partition.
A PartitionReceiver is tied to a ConsumerGroup + Partition combination. If you are creating an epoch based PartitionReceiver (i.e. PartitionReceiver.Epoch != 0) you cannot have more than one active receiver per ConsumerGroup + Partition combo. You can have multiple receivers per ConsumerGroup + Partition combination with non-epoch receivers.
It sounds like you are running two instances of the application, two concurrent classes, or two applications that use the same event hub consumer group. Event hub consumer groups are effectively pointers to a point in time on the event stream. If you try and use one consumer group pointing with two instances of code, then you get a conflict like the one you are seeing.
Either:
Ensure you only have a single instance reading the consumer group at a time.
Use two consumer groups when you need two separate programs or sets of functionality to process the event hub at the same time.
If you are looking to parallelize for performance, look in to event hub Partitioning and how to take advantage of processing each partition independently.
There is also an alternative scenario where an event hub partition is switched over to another host as part of the event hub's internal load balancing. In this case you may see the error you are receiving. In this case, just log it and continue on.
For more details, refer "Features and terminology in Azure Event Hubs" and "Event Hubs Receiver Epoch".
Hope this helps.

2 spark applications can't consume from same Kafka Topic parrallel using same Group ID

I have a Kafka Topic With multiple partitions. I have a spark application subscribe to that topic using Dstream. When I start another instance of that application first application throws an exception
Exception in thread "main" java.lang.IllegalStateException: No current assignment for partition my-topic-0
and exits.
In a normal scenario when not using spark and if we start two kafka conumsers with the same group id and if the topic has only one partition the second consumer would become idle/stale.In order to consume the same message from a topic the consumers have to be initiated with different group id's.Same is applicable in the case of spark as well

Worker Queue option in Kafka

We are developing an application , which will receive time series sensor data as byte array from a set of devices via UDP. This data needs to be parsed and stored in a Cassandra Database...
We were using RabbitMQ as the message broker and using the Work Queues based consumers to parse the data and push it in to cassandra... Because of increasing traffic, we are concerned about RabbitMQ perfomance and are planning to move to Kafka... Our understanding is that the same can be implemented using consumer group in kafka .. is our understanding correct
With Apache Kafka, you can scale a topic relatively easier. In order to be able to process more data in same time you'll need:
Having multiple consumers in same consumer group, you'll be able to consume multiple messages in same time. You are limited to the number of partitions of a topic.
Increase the number of partitions for a topic, and increase the number of consumers.
Increase the number of brokers, if you still to process more data.
I will approach the scalability in the order described above, but Kafka can handle a lot. In a setup with 2 brokers, 4 partitions per topic and 2 consumers (each consumer use one thread per partition), consumer decode json to java object, enrich and store to Cassandra, it can handle 30k/s (data is batched in batch of 200 insert statements).

Resources