How does KafkaUtils.createDirectStream() achieve exactly-once semantics? - apache-spark

Not sure if my understand to this question is correct:
Using KafkaUtils.createStream(), the program itself is a consumer that receives data passively. Since Kafka only maintains its own offset, Kafka does not know where the program consumes.
So if Kafka failure occurs, it may resend data that had been sent to receiver and this leads to duplicate data.
While using KafkaUtils.createDirectStream(), the program itself directly consumes internal Kafka partitions, so it knows where it consumes, and no matter itself or Kafka fails, it can re-consume from the correct position.
I want to confirm if my understand is correct. Any help is appreciated.

Related

Apache Spark and Kafka "exactly once" semantics

I have a Dataframe that I want to output to Kafka. This can be done manually doing a forEach using a Kafka producer or I can use a Kafka sink (if I start using Spark structured streaming).
I'd like to achieve an exactly once semantic in this whole process, so I want to be sure that I'll never have the same message committed twice.
If I use a Kafka producer I can enable the idempotency through Kafka properties, for what I've seen this is implemented using sequence numbers and producersId, but I believe that in case of stage/task failures the Spark retry mechanism might create duplicates on Kafka, for example if a worker node fails, the entire stage will be retried and will be an entire new producer pushing messages causing duplicates?
Seeing the fault tolerance table for kafka sink here I can see that:
Kafka Sink supports at-least-once semantic, so the same output can be sinked more than once.
Is it possible to achieve exactly once semantic with Spark + Kafka producers or Kafka sink?
If is possible, how?
Kafka doesn't support exactly-once semantic. They have a guarantee only for at-least-once semantic. They just propose how to avoid duplicate messages. If your data has a unique key and is stored in a database or filesystem etc., you can avoid duplicate messages.
For example, you sink your data into HBase, each message has a unique key as an HBase row key. when it gets the message that has the same key, the message will be overwritten.
I hope this article will be helpful:
https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/

how to consume kafka message not from the beginning after a failed spark job

I'm newbie to kafka and spark, wondering how to recover offset from kafka after a spark job failed.
conditions:
say 5gb/s of kafka stream, it's hard to consume from beginning
stream data has already been consumed, so how to tell spark to re-consume message / redo the failed task smoothly
I'm not to sure which area to search for, maybe someone can point me to right direction
When we are dealing with kafka, we have must have 2 different topics. One for Success and One for Failed.
Let's say, I have 2 topics Topic-Success and Topic-Failed.
When Kafka processing the data stream successfully, we can mark it and store it in Topic-Success Topic and When Kafka unable to Process data stream, then will store it in Topic-Failed Topic.
So that, when you want to re-consume the failed data stream, we can process that failed one from Topic-Failed Topic. Here you can eliminate re-consuming all the data from-beginning.
Hope this helps you.
In kafka 0.10.x there is a concept of Consumer Group which is used to track the offset of the messages.
If you have made enable.auto.commit=true and auto.offset.reset=latest it will not consume from beginning. Now taking this approach you might also need to track your offsets as the process might failed after consumption. I would suggest you to use this method suggested in Spark Docs to
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
CanCommitOffsets lies in your hands to commit those messages when your end to end pipeline gets excuted

Spark-Streaming Kafka Direct Streaming API & Parallelism

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.
In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:
Does a corresponding Spark task to a Kafka partition both read
and process the data altogether ?
The rational behind this question is that in the previous API, that
is, the receiver based, a TASK was dedicated for receiving the data,
meaning a number tasks slot of your executors were reserved for data
ingestion and the other were there for processing. This had an
impact on how you size your executor in term of cores.
Take for example the advise on how to launch spark-streaming with
--master local. Everyone would tell that in the case of spark streaming,
one should put local[2] minimum, because one of the
core, will be dedicated to running the long receiving task that never
ends, and the other core will do the data processing.
So if the answer is that in this case, the task does both the reading
and the processing at once, then the question that follows, is that
really smart, i mean, this sounds like asynchronous. We want to be
able to fetch while we process so on the next processing the data is
already there. However if there only one core or more precisely to
both read the data and process them, how can both be done in
parallel, and how does that make things faster in general.
My original understand was that, things would have remain somehow the
same in the sense that, a task would be launch to read but that the
processing would be done in another task. That would mean that, if
the processing task is not done yet, we can still keep reading, until
a certain memory limit.
Can someone outline with clarity what is exactly going on here ?
EDIT1
We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.
Does a corresponding Spark task to a Kafka partition both read and
process the data altogether ?
The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:
Driver reads offsets from all kafka topics and partitions
Driver assigns each executor a topic and partition to be read and processed.
Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.
This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.
Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

Challenges while processing Kafka Messages with Spark Streaming

I want to process the messages reported at a web server in real time. The messages reported at web server belong to different sessions and I want to do some session level aggregations. For this purpose I plan to use Spark Streaming front ended by Kafka. Even before I start, I have listed down a few challenges which this architecture is going to throw. Can someone familiar with this ecosystem help me out with these questions:
If each Kafka message belongs to a particular session, how to manage session affinity so that the same Spark executor sees all the messages linked to a session?
How to ensure that messages belonging to a session are processed by a Spark executor in the order they were reported at Kafka? Can we somehow achieve this without putting a constraint on thread count and incurring processing overheads (like sorting by message timestamp)?
When to checkpoint session state? How state is resurrected from last checkpoint in case of executor node crash? How state is resurrected from last checkpoint in case of driver node crash?
How state is resurrected if a node(executor/driver) crashes before checkpointing its state? If Spark recreates state RDD by replaying messages then where does it start replaying the Kafka messages from: last checkpoint on wards or does it process all the messages needed to recreate the partition? Can/does Spark streaming resurrect state across multiple spark streaming batches or only for the current batch i.e. can the state be recovered if checkpointing was not done during the last batch?
If each Kafka message belongs to a particular session, how to manage
session affinity so that the same Spark executor sees all the messages
linked to a session?
Kafka divides topics into partitions, and every partition can only be read by one consumer at a time, so you need to make sure that all messages belonging to one session go into the same partition. Partition assignment is controlled via the key that you assign to every message, so the easiest way to achieve this would probably be to use the session id as key when sending data. That way the same consumer will get all messages for one session.
There is one caveat though: Kafka will rebalance assignment of partitions to consumers, when a consumer joins or leaves the consumergroup. If this happens mid-session, it can (and will) happen, that half the messages for that session go to one consumer and the other half go to a different consumer after the rebalance. To avoid this, you'll need to manually subscribe to specific partitions in your code so that every processor has its specific set of partitions and does not change those. Have a look at ConsumerStrategies.Assign in the SparkKafka Component Code for this.
How to ensure that messages belonging to a session are processed by a
Spark executor in the order they were reported at Kafka? Can we
somehow achieve this without putting a constraint on thread count and
incurring processing overheads (like sorting by message timestamp)?
Kafka preserves ordering per partition, so there is not much you need to do here. The only thing is to avoid having multiple requests from the producer to the broker at the same time, which you can configure via the producer parameter max.in.flight.requests.per.connection. As long as you keep this at 1, you should be safe if I understand your setup correctly.
When to checkpoint session state? How state is resurrected from last
checkpoint in case of executor node crash? How state is resurrected
from last checkpoint in case of driver node crash?
I'd suggest reading the offset storage section of the Spark Streaming + Kafka Integration Guide, which should answer a lot of questions already.
The short version is, you can persist your last read offset into Kafka and should definitely do this whenever you checkpoint your executors. That way, whenever a new executor picks up processing, no matter whether it was restored from a checkpoint or not, it will know where to read from in Kafka.
How state is resurrected if a node(executor/driver) crashes before
checkpointing its state? If Spark recreates state RDD by replaying
messages then where does it start replaying the Kafka messages from:
last checkpoint on wards or does it process all the messages needed to
recreate the partition? Can/does Spark streaming resurrect state
across multiple spark streaming batches or only for the current batch
i.e. can the state be recovered if checkpointing was not done during
the last batch?
My Spark knowledge here is a bit shaky, but I would say that this not something that is done by Kafka/Spark but rather something that you actively need to influence with your code.
By default, if a new Kafka Stream is started up and finds no previous committed offset, it will simply start reading from the end of the topic, so it would get any message that is produced after the consumer is started. If you need to resurrect state, then you'd either need to know from what exact offset you want to start re-reading messages, or just start reading from the beginning again. You can pass an offset to read from into the above mentioned .Assign() method, when distributing partitions.
I hope this helps a bit, I am sure it is by no means a complete answer to all questions, but it is a fairly wide field to work, let me know if I can be of further help.

Kafka consumer need to know all the messages received from the topic

We need to sort the consumed records in spark streaming of kafka consumer part. Is it possible to know all the published records are consumed in kafka consumer ?
You can use KafkaConsumer#endOffsets(...) to get the offsets of the current end-of-log per partition. Of course, keep in mind that the end-of-log moves as long as new data is written by a consumer. Thus, for getting "end offsets" you must be sure that there is no running producer...

Resources