I'm using spark streaming to read message from Kafka, it works fine. But I had one requirement which needs to re-read the messages. I was thinking I may just need to change the spark's customer groupId and restart spark streaming app, it should reread the kafka message from beginning. But the result was that Spark could not get any messages, I'm confused. By Kafka document if you change the customer groupId then it should get message from beginning, because kafka treat you as a new customer. Thanks in advance!
Kafka consumers have a property called auto.offset.reset (See Kafka Doc). This tells the consumer what to do when it starts consuming but it hasn't committed an offset, yet. This is your case. The topic has messages, but there's no start offset stored because you haven't read anything under that new group id, yet. In this situation, the auto.offset.reset property is used. If the value is "largest", and this is the default), then the start position is set to the largest offset (the last) and you get the behavior you're seeing. If the value is "smallest" then the offset is set to the beginning offset and the consumer would read the entire partition. This is what you want.
So I'm not exactly sure how you'd set that Kafka property in your Spark app, but you definitely want that property set to "smallest" if you want the new group id to result in a read of the entire topic.
Sounds like you are using spark streaming's receiver based api for Kafka. For that api auto.offset.reset only applies if there aren't offsets in ZK, as you noticed.
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
If you want to be able to specify the exact offsets, see the version of the createDirectStream call that takes fromOffsets as an argument.
Related
I have a question regarding Spark Structured Streaming with Kafka.
Suppose that I am running a spark job and every thing is working perfectly.
One fine day, my spark job fails because of inconsistencies in data that is fed to kafka. Inconsistencies may be anything like data format issues or junk characters which spark couldn't have processed. In such case, how do we fix the issue? Is there a way we can get into the kafka topic and make changes to the data manually?
If we don't fix the data issue and restart the spark job, it will read the same old row which contributed to failure since we have not yet committed the checkpoint. so how do we get out of this loop. How to fix the data issue in Kafka topic for resuming the aborted spark job?
I would avoid trying to manually change one single message within a Kafka topic unless you really know what you are doing.
To prevent this from happening in the future, you might want to consider using a schema for your data (in combination with a schema registry).
For mitigating the problem you described I see the following options:
Manually change the offset of the Consumer Group of your structured streaming application
create a "new" streaming job that starts reading from a particular offset
Manually change offset
When using Sparks structured streaming the consumer group is automatically set by Spark. According to the code the Consumer Group will be defined as:
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
You can change the offset by using the kafka-consumer-groups tool. First identify the actual name of the consumer group by
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
and then set the offset for that consumer group for a particular topic (e.g. offset 100)
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --execute --reset-offsets --group spark-kafka-source-1337 --topic topic1 --to-offset 100
If you need to change the offset only for a particular partition you can have a look at the help function of the tool on how to do this.
Create new Streaming Job
You could make use of the Spark option startingOffsets as describe in the Spark + Kafka integration guide:
Option: startingOffsets
value: "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """
default: "latest" for streaming, "earliest" for batch
meaning: The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
For this to work, it is important to have a "new" query. That means you need to delete your checkpoint files of your existing job or create complete new application.
I have a Dataframe that I want to output to Kafka. This can be done manually doing a forEach using a Kafka producer or I can use a Kafka sink (if I start using Spark structured streaming).
I'd like to achieve an exactly once semantic in this whole process, so I want to be sure that I'll never have the same message committed twice.
If I use a Kafka producer I can enable the idempotency through Kafka properties, for what I've seen this is implemented using sequence numbers and producersId, but I believe that in case of stage/task failures the Spark retry mechanism might create duplicates on Kafka, for example if a worker node fails, the entire stage will be retried and will be an entire new producer pushing messages causing duplicates?
Seeing the fault tolerance table for kafka sink here I can see that:
Kafka Sink supports at-least-once semantic, so the same output can be sinked more than once.
Is it possible to achieve exactly once semantic with Spark + Kafka producers or Kafka sink?
If is possible, how?
Kafka doesn't support exactly-once semantic. They have a guarantee only for at-least-once semantic. They just propose how to avoid duplicate messages. If your data has a unique key and is stored in a database or filesystem etc., you can avoid duplicate messages.
For example, you sink your data into HBase, each message has a unique key as an HBase row key. when it gets the message that has the same key, the message will be overwritten.
I hope this article will be helpful:
https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/
How do I set a spark job to pick up a kafka topic from a specific offset based on a timestamp ? Let's say that I need to get all data from a kafka topic starting 6 hours ago.
Kafka does not work in that way. You are seeing Kafka like something you can query with another different parameter than offset, besides keep in mind that topic can have more than one partition so each one has a different one. Maybe you can use another relational storage to map offset/partition with timestamp, a little bit risky. Thinking in akka stream kafka consumer, for example, each of your request by timestamp should be send via another topic to activate your consumers(each of them with one ore more partitions assigned) and query for the specific offset, produce and merge. With Spark, you can adjust your consumer strategies for each job but the process should be the same.
Another thing is if your Kafka recovers it´s possible that you need to read the whole topic to update your pair (timestamp/offset). All of this can sound a little bit weird and maybe it should be better to store your topic in Cassandra (for example) and you can query it later.
The answers provided here seems to be dated. As with the latest API documentation for Spark 3.x.x given here, Structured Streaming Kafka Integration
There are quite few flexible ways in which the messages can be retrieved between a specified window from Kafka.
An example code for the batch api that get messages from all the partitions, that are between the window specified via startingTimestamp and endingTimestamp, which is in epoch time with millisecond precision.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingTimestamp", 1650418006000)
.option("endingTimestamp", 1650418008000)
.load()
Kafka is an append-only log storage. You can start consuming from a particular offset in a partition given that you know the offset. Consumption is super fast, you can have a design where you start from the smallest offset and start doing some logic only once you have come across a message (which could probably have a timestamp field to check for).
I am using the Java Spark API, for the KafkaUtils.createDirectStream, I want to track the offset.
There is a parameter called fromOffset, which records the offset in partitions of the Kafka topic. for the first run, I have no idea of how many partitions I will have, then how can I set this parameter?
And will I need set "auto.offset.reset" in Kafka parameters?
If yes, will it affect my code to recover from an known offset?
you have two options:
in case you don't have any information about partions, do not provide that param to createDirectStream. There are several implmentations of createDirectStream method. In that case or earliest, or latest offset per each topicPartition will be used (based on the auto.offset.reset param)
you can find the partitions, offsets using usual kafka API. For example look How to find the offset range for a topic-partition in Kafka 0.10?
We need to sort the consumed records in spark streaming of kafka consumer part. Is it possible to know all the published records are consumed in kafka consumer ?
You can use KafkaConsumer#endOffsets(...) to get the offsets of the current end-of-log per partition. Of course, keep in mind that the end-of-log moves as long as new data is written by a consumer. Thus, for getting "end offsets" you must be sure that there is no running producer...