Kafka consumer need to know all the messages received from the topic - apache-spark

We need to sort the consumed records in spark streaming of kafka consumer part. Is it possible to know all the published records are consumed in kafka consumer ?

You can use KafkaConsumer#endOffsets(...) to get the offsets of the current end-of-log per partition. Of course, keep in mind that the end-of-log moves as long as new data is written by a consumer. Thus, for getting "end offsets" you must be sure that there is no running producer...

Related

Apache Spark and Kafka "exactly once" semantics

I have a Dataframe that I want to output to Kafka. This can be done manually doing a forEach using a Kafka producer or I can use a Kafka sink (if I start using Spark structured streaming).
I'd like to achieve an exactly once semantic in this whole process, so I want to be sure that I'll never have the same message committed twice.
If I use a Kafka producer I can enable the idempotency through Kafka properties, for what I've seen this is implemented using sequence numbers and producersId, but I believe that in case of stage/task failures the Spark retry mechanism might create duplicates on Kafka, for example if a worker node fails, the entire stage will be retried and will be an entire new producer pushing messages causing duplicates?
Seeing the fault tolerance table for kafka sink here I can see that:
Kafka Sink supports at-least-once semantic, so the same output can be sinked more than once.
Is it possible to achieve exactly once semantic with Spark + Kafka producers or Kafka sink?
If is possible, how?
Kafka doesn't support exactly-once semantic. They have a guarantee only for at-least-once semantic. They just propose how to avoid duplicate messages. If your data has a unique key and is stored in a database or filesystem etc., you can avoid duplicate messages.
For example, you sink your data into HBase, each message has a unique key as an HBase row key. when it gets the message that has the same key, the message will be overwritten.
I hope this article will be helpful:
https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/

how to consume kafka message not from the beginning after a failed spark job

I'm newbie to kafka and spark, wondering how to recover offset from kafka after a spark job failed.
conditions:
say 5gb/s of kafka stream, it's hard to consume from beginning
stream data has already been consumed, so how to tell spark to re-consume message / redo the failed task smoothly
I'm not to sure which area to search for, maybe someone can point me to right direction
When we are dealing with kafka, we have must have 2 different topics. One for Success and One for Failed.
Let's say, I have 2 topics Topic-Success and Topic-Failed.
When Kafka processing the data stream successfully, we can mark it and store it in Topic-Success Topic and When Kafka unable to Process data stream, then will store it in Topic-Failed Topic.
So that, when you want to re-consume the failed data stream, we can process that failed one from Topic-Failed Topic. Here you can eliminate re-consuming all the data from-beginning.
Hope this helps you.
In kafka 0.10.x there is a concept of Consumer Group which is used to track the offset of the messages.
If you have made enable.auto.commit=true and auto.offset.reset=latest it will not consume from beginning. Now taking this approach you might also need to track your offsets as the process might failed after consumption. I would suggest you to use this method suggested in Spark Docs to
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
CanCommitOffsets lies in your hands to commit those messages when your end to end pipeline gets excuted

How does KafkaUtils.createDirectStream() achieve exactly-once semantics?

Not sure if my understand to this question is correct:
Using KafkaUtils.createStream(), the program itself is a consumer that receives data passively. Since Kafka only maintains its own offset, Kafka does not know where the program consumes.
So if Kafka failure occurs, it may resend data that had been sent to receiver and this leads to duplicate data.
While using KafkaUtils.createDirectStream(), the program itself directly consumes internal Kafka partitions, so it knows where it consumes, and no matter itself or Kafka fails, it can re-consume from the correct position.
I want to confirm if my understand is correct. Any help is appreciated.

Challenges while processing Kafka Messages with Spark Streaming

I want to process the messages reported at a web server in real time. The messages reported at web server belong to different sessions and I want to do some session level aggregations. For this purpose I plan to use Spark Streaming front ended by Kafka. Even before I start, I have listed down a few challenges which this architecture is going to throw. Can someone familiar with this ecosystem help me out with these questions:
If each Kafka message belongs to a particular session, how to manage session affinity so that the same Spark executor sees all the messages linked to a session?
How to ensure that messages belonging to a session are processed by a Spark executor in the order they were reported at Kafka? Can we somehow achieve this without putting a constraint on thread count and incurring processing overheads (like sorting by message timestamp)?
When to checkpoint session state? How state is resurrected from last checkpoint in case of executor node crash? How state is resurrected from last checkpoint in case of driver node crash?
How state is resurrected if a node(executor/driver) crashes before checkpointing its state? If Spark recreates state RDD by replaying messages then where does it start replaying the Kafka messages from: last checkpoint on wards or does it process all the messages needed to recreate the partition? Can/does Spark streaming resurrect state across multiple spark streaming batches or only for the current batch i.e. can the state be recovered if checkpointing was not done during the last batch?
If each Kafka message belongs to a particular session, how to manage
session affinity so that the same Spark executor sees all the messages
linked to a session?
Kafka divides topics into partitions, and every partition can only be read by one consumer at a time, so you need to make sure that all messages belonging to one session go into the same partition. Partition assignment is controlled via the key that you assign to every message, so the easiest way to achieve this would probably be to use the session id as key when sending data. That way the same consumer will get all messages for one session.
There is one caveat though: Kafka will rebalance assignment of partitions to consumers, when a consumer joins or leaves the consumergroup. If this happens mid-session, it can (and will) happen, that half the messages for that session go to one consumer and the other half go to a different consumer after the rebalance. To avoid this, you'll need to manually subscribe to specific partitions in your code so that every processor has its specific set of partitions and does not change those. Have a look at ConsumerStrategies.Assign in the SparkKafka Component Code for this.
How to ensure that messages belonging to a session are processed by a
Spark executor in the order they were reported at Kafka? Can we
somehow achieve this without putting a constraint on thread count and
incurring processing overheads (like sorting by message timestamp)?
Kafka preserves ordering per partition, so there is not much you need to do here. The only thing is to avoid having multiple requests from the producer to the broker at the same time, which you can configure via the producer parameter max.in.flight.requests.per.connection. As long as you keep this at 1, you should be safe if I understand your setup correctly.
When to checkpoint session state? How state is resurrected from last
checkpoint in case of executor node crash? How state is resurrected
from last checkpoint in case of driver node crash?
I'd suggest reading the offset storage section of the Spark Streaming + Kafka Integration Guide, which should answer a lot of questions already.
The short version is, you can persist your last read offset into Kafka and should definitely do this whenever you checkpoint your executors. That way, whenever a new executor picks up processing, no matter whether it was restored from a checkpoint or not, it will know where to read from in Kafka.
How state is resurrected if a node(executor/driver) crashes before
checkpointing its state? If Spark recreates state RDD by replaying
messages then where does it start replaying the Kafka messages from:
last checkpoint on wards or does it process all the messages needed to
recreate the partition? Can/does Spark streaming resurrect state
across multiple spark streaming batches or only for the current batch
i.e. can the state be recovered if checkpointing was not done during
the last batch?
My Spark knowledge here is a bit shaky, but I would say that this not something that is done by Kafka/Spark but rather something that you actively need to influence with your code.
By default, if a new Kafka Stream is started up and finds no previous committed offset, it will simply start reading from the end of the topic, so it would get any message that is produced after the consumer is started. If you need to resurrect state, then you'd either need to know from what exact offset you want to start re-reading messages, or just start reading from the beginning again. You can pass an offset to read from into the above mentioned .Assign() method, when distributing partitions.
I hope this helps a bit, I am sure it is by no means a complete answer to all questions, but it is a fairly wide field to work, let me know if I can be of further help.

spark can not get message from Kafka with new groupId

I'm using spark streaming to read message from Kafka, it works fine. But I had one requirement which needs to re-read the messages. I was thinking I may just need to change the spark's customer groupId and restart spark streaming app, it should reread the kafka message from beginning. But the result was that Spark could not get any messages, I'm confused. By Kafka document if you change the customer groupId then it should get message from beginning, because kafka treat you as a new customer. Thanks in advance!
Kafka consumers have a property called auto.offset.reset (See Kafka Doc). This tells the consumer what to do when it starts consuming but it hasn't committed an offset, yet. This is your case. The topic has messages, but there's no start offset stored because you haven't read anything under that new group id, yet. In this situation, the auto.offset.reset property is used. If the value is "largest", and this is the default), then the start position is set to the largest offset (the last) and you get the behavior you're seeing. If the value is "smallest" then the offset is set to the beginning offset and the consumer would read the entire partition. This is what you want.
So I'm not exactly sure how you'd set that Kafka property in your Spark app, but you definitely want that property set to "smallest" if you want the new group id to result in a read of the entire topic.
Sounds like you are using spark streaming's receiver based api for Kafka. For that api auto.offset.reset only applies if there aren't offsets in ZK, as you noticed.
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
If you want to be able to specify the exact offsets, see the version of the createDirectStream call that takes fromOffsets as an argument.

Resources