Spark Streaming Kafka initial offset - apache-spark

I am using the Java Spark API, for the KafkaUtils.createDirectStream, I want to track the offset.
There is a parameter called fromOffset, which records the offset in partitions of the Kafka topic. for the first run, I have no idea of how many partitions I will have, then how can I set this parameter?
And will I need set "auto.offset.reset" in Kafka parameters?
If yes, will it affect my code to recover from an known offset?

you have two options:
in case you don't have any information about partions, do not provide that param to createDirectStream. There are several implmentations of createDirectStream method. In that case or earliest, or latest offset per each topicPartition will be used (based on the auto.offset.reset param)
you can find the partitions, offsets using usual kafka API. For example look How to find the offset range for a topic-partition in Kafka 0.10?

Related

What happens when you restart a spark job if it encounters unexpected format in the data fed to kafka

I have a question regarding Spark Structured Streaming with Kafka.
Suppose that I am running a spark job and every thing is working perfectly.
One fine day, my spark job fails because of inconsistencies in data that is fed to kafka. Inconsistencies may be anything like data format issues or junk characters which spark couldn't have processed. In such case, how do we fix the issue? Is there a way we can get into the kafka topic and make changes to the data manually?
If we don't fix the data issue and restart the spark job, it will read the same old row which contributed to failure since we have not yet committed the checkpoint. so how do we get out of this loop. How to fix the data issue in Kafka topic for resuming the aborted spark job?
I would avoid trying to manually change one single message within a Kafka topic unless you really know what you are doing.
To prevent this from happening in the future, you might want to consider using a schema for your data (in combination with a schema registry).
For mitigating the problem you described I see the following options:
Manually change the offset of the Consumer Group of your structured streaming application
create a "new" streaming job that starts reading from a particular offset
Manually change offset
When using Sparks structured streaming the consumer group is automatically set by Spark. According to the code the Consumer Group will be defined as:
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
You can change the offset by using the kafka-consumer-groups tool. First identify the actual name of the consumer group by
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
and then set the offset for that consumer group for a particular topic (e.g. offset 100)
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --execute --reset-offsets --group spark-kafka-source-1337 --topic topic1 --to-offset 100
If you need to change the offset only for a particular partition you can have a look at the help function of the tool on how to do this.
Create new Streaming Job
You could make use of the Spark option startingOffsets as describe in the Spark + Kafka integration guide:
Option: startingOffsets
value: "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """
default: "latest" for streaming, "earliest" for batch
meaning: The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
For this to work, it is important to have a "new" query. That means you need to delete your checkpoint files of your existing job or create complete new application.

Best deduplication strategy to be used with spark

What is the best de-duplication strategy to be used with spark?
I have a Kafka source that is continuously fed with structured information (say JSON) from various producers continuously.
I am having an HDInsight spark cluster that can pick messages in real time for this Kafka source, process them and put it into a destination Kafka source in real time.
My use case demands that the information received from the source may have duplicates which need to be eliminated. The duplicates have to be be checked against say last 24 hours.
My attempt :
I tried using the .dropduplicate method in spark along with watermarking , but I think it's not the best thing to do since the data for a single day window may exceed 50 GB in my use case.
I also looked for bloom filter implementation which can be used with spark but couldn't find a good one.
My question:
What are the possible approaches to eliminate duplication in general for large scale spark streaming application.?
Which of these features can be used along with HDInsight clusters on Azure ?
What are the fault tolerance capability in such services ?

Spark + Read kafka topic from a specific offset based on timestamp

How do I set a spark job to pick up a kafka topic from a specific offset based on a timestamp ? Let's say that I need to get all data from a kafka topic starting 6 hours ago.
Kafka does not work in that way. You are seeing Kafka like something you can query with another different parameter than offset, besides keep in mind that topic can have more than one partition so each one has a different one. Maybe you can use another relational storage to map offset/partition with timestamp, a little bit risky. Thinking in akka stream kafka consumer, for example, each of your request by timestamp should be send via another topic to activate your consumers(each of them with one ore more partitions assigned) and query for the specific offset, produce and merge. With Spark, you can adjust your consumer strategies for each job but the process should be the same.
Another thing is if your Kafka recovers it´s possible that you need to read the whole topic to update your pair (timestamp/offset). All of this can sound a little bit weird and maybe it should be better to store your topic in Cassandra (for example) and you can query it later.
The answers provided here seems to be dated. As with the latest API documentation for Spark 3.x.x given here, Structured Streaming Kafka Integration
There are quite few flexible ways in which the messages can be retrieved between a specified window from Kafka.
An example code for the batch api that get messages from all the partitions, that are between the window specified via startingTimestamp and endingTimestamp, which is in epoch time with millisecond precision.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingTimestamp", 1650418006000)
.option("endingTimestamp", 1650418008000)
.load()
Kafka is an append-only log storage. You can start consuming from a particular offset in a partition given that you know the offset. Consumption is super fast, you can have a design where you start from the smallest offset and start doing some logic only once you have come across a message (which could probably have a timestamp field to check for).

How to get Kafka offsets for structured query for manual and reliable offset management?

Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint directory to store offsets and guarantee an "exactly-once" message delivery.
But old docks (like https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/) says that Spark Streaming checkpoints are not recoverable across applications or Spark upgrades and hence not very reliable. As a solution, there is a practice to support storing offsets in external storage that supports transactions like MySQL or RedshiftDB.
If I want to store offsets from Kafka source to a transactional DB, how can I obtain offset from a structured stream batch?
Previously, it can be done by casting RDD to HasOffsetRanges:
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
But with new Streaming API, I have an Dataset of InternalRow and I can't find an easy way to fetch offsets. The Sink API has only addBatch(batchId: Long, data: DataFrame) method and how can I suppose to get an offset for given batch id?
Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint dir to store offsets and guarantee an "exactly-once" message delivery.
Correct.
Every trigger Spark Structured Streaming will save offsets to offset directory in the checkpoint location (defined using checkpointLocation option or spark.sql.streaming.checkpointLocation Spark property or randomly assigned) that is supposed to guarantee that offsets are processed at most once. The feature is called Write Ahead Logs.
The other directory in the checkpoint location is commits directory for completed streaming batches with a single file per batch (with a file name being the batch id).
Quoting the official documentation in Fault Tolerance Semantics:
To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
Every time a trigger is executed StreamExecution checks the directories and "computes" what offsets have been processed already. That gives you at least once semantics and exactly once in total.
But old docs (...) says that Spark Streaming checkpoints are not recoverable across applications or Spark upgrades and hence not very reliable.
There was a reason why you called them "old", wasn't there?
They refer to the old and (in my opinion) dead Spark Streaming that kept not only offsets but the entire query code that led to situations where the checkpointing were almost unusable, e.g. when you change the code.
The times are over now and Structured Streaming is more cautious what and when is checkpointed.
If I want to store offsets from Kafka source to a transactional DB, how can I obtain offset from a structured stream batch?
A solution could be to implement or somehow use MetadataLog interface that is used to deal with offset checkpointing. That could work.
how can I suppose to get an offset for given batch id?
It is not currently possible.
My understanding is that you will not be able to do it as the semantics of streaming are hidden from you. You simply should not be dealing with this low-level "thing" called offsets that Spark Structured Streaming uses to offer exactly once guarantees.
Quoting Michael Armbrust from his talk at Spark Summit Easy, Scalable, Fault Tolerant Stream Processing with Structured Streaming in Apache Spark:
you should not have to reason about streaming
and further in the talk (on the next slide):
you should write simple queries & Spark should continuously update the answer
There is a way to get offsets (from any source, Kafka including) using StreamingQueryProgress that you can intercept using StreamingQueryListener and onQueryProgress callback.
onQueryProgress(event: QueryProgressEvent): Unit Called when there is some status update (ingestion rate updated, etc.)
With StreamingQueryProgress you can access sources property with SourceProgress that gives you what you want.
Relevant Spark DEV mailing list discussion thread is here.
Summary from it:
Spark Streaming will support getting offsets in future versions (> 2.2.0). JIRA ticket to follow - https://issues-test.apache.org/jira/browse/SPARK-18258
For Spark <= 2.2.0, you can get offsets for the given batch by reading a json from checkpoint directory (the API is not stable, so be cautious):
val checkpointRoot = // read 'checkpointLocation' from custom sink params
val checkpointDir = new Path(new Path(checkpointRoot), "offsets").toUri.toString
val offsetSeqLog = new OffsetSeqLog(sparkSession, checkpointDir)
val endOffset: Map[TopicPartition, Long] = offsetSeqLog.get(batchId).map { endOffset =>
endOffset.offsets.filter(_.isDefined).map { str =>
JsonUtilsWrapper.jsonToOffsets(str.get.json)
}
}
/**
* Hack to access private API
* Put this class into org.apache.spark.sql.kafka010 package
*/
object JsonUtilsWrapper {
def offsetsToJson(partitionOffsets: Map[TopicPartition, Long]): String = {
JsonUtils.partitionOffsets(partitionOffsets)
}
def jsonToOffsets(str: String): Map[TopicPartition, Long] = {
JsonUtils.partitionOffsets(str)
}
}
This endOffset will contain the until offset for each topic/partition.
Getting the start offsets is problematic, cause you have to read the 'commit' checkpoint dir. But usually, you don't care about start offsets, because storing end offsets is enough for reliable Spark job re-start.
Please, note that you have to store the processed batch id in your storage as well. Spark can re-run failed batch with the same batch id in some cases, so make sure to initialize a Custom Sink with latest processed batch id (which you should read from external storage) and ignore any batch with id < latestProcessedBatchId. Btw, batch id is not unique across queries, so you have to store batch id for each query separately.
Streaming Dataset with Kafka source has offset as one of the field. You can simply query for all offsets in query and save them into JDBC Sink

spark can not get message from Kafka with new groupId

I'm using spark streaming to read message from Kafka, it works fine. But I had one requirement which needs to re-read the messages. I was thinking I may just need to change the spark's customer groupId and restart spark streaming app, it should reread the kafka message from beginning. But the result was that Spark could not get any messages, I'm confused. By Kafka document if you change the customer groupId then it should get message from beginning, because kafka treat you as a new customer. Thanks in advance!
Kafka consumers have a property called auto.offset.reset (See Kafka Doc). This tells the consumer what to do when it starts consuming but it hasn't committed an offset, yet. This is your case. The topic has messages, but there's no start offset stored because you haven't read anything under that new group id, yet. In this situation, the auto.offset.reset property is used. If the value is "largest", and this is the default), then the start position is set to the largest offset (the last) and you get the behavior you're seeing. If the value is "smallest" then the offset is set to the beginning offset and the consumer would read the entire partition. This is what you want.
So I'm not exactly sure how you'd set that Kafka property in your Spark app, but you definitely want that property set to "smallest" if you want the new group id to result in a read of the entire topic.
Sounds like you are using spark streaming's receiver based api for Kafka. For that api auto.offset.reset only applies if there aren't offsets in ZK, as you noticed.
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
If you want to be able to specify the exact offsets, see the version of the createDirectStream call that takes fromOffsets as an argument.

Resources