Spark Structured Streaming Batch Read Checkpointing - apache-spark

I am fairly new to Spark and am still learning. One of the more difficult concepts I have come across is checkpointing and how Spark uses it to recover from failures. I am doing batch reads from Kafka using Structured Streaming and writing them to S3 as Parquet file as:
dataset
.write()
.mode(SaveMode.Append)
.option("checkpointLocation", checkpointLocation)
.partitionBy("date_hour")
.parquet(getS3PathForTopic(topicName));
The checkpoint location is a S3 filesystem path. However, as the job runs, I see no checkpointing files. In subsequent runs, I see the following log:
21/10/14 12:20:51 INFO ConsumerCoordinator: [Consumer clientId=consumer-spark-kafka-relation-54f0cc87-e437-4582-b998-a33189e90bd7-driver-0-5, groupId=spark-kafka-relation-54f0cc87-e437-4582-b998-a33189e90bd7-driver-0] Found no committed offset for partition topic-1
This indicates that the previous run did not checkpoint any offsets for this run to pick them up from. So it keeps consuming from the earliest offset.
How can I make my job pick up new offsets? Note that this is a batch query as described here.
This is how I read:
sparkSession
.read()
.format("kafka")
.option("kafka.bootstrap.servers", kafkaProperties.bootstrapServers())
.option("subscribe", topic)
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.truststore.location", sslConfig.truststoreLocation())
.option("kakfa.ssl.truststore.password", sslConfig.truststorePassword())
.option("kafka.ssl.keystore.location", sslConfig.keystoreLocation())
.option("kafka.ssl.keystore.password", sslConfig.keystorePassword())
.option("kafka.ssl.endpoint.identification.algorithm", "")
.option("failOnDataLoss", "true");

I am not sure why batch Spark Structured Streaming with Kafka still exists now. If you wish to use it, then you must code your own Offset management. See the guide, but it is badly explained.
I would say Trigger.Once is a better use case for you; Offset management is provided by Spark as it is thus not batch mode.

Related

How we manage offsets in Spark Structured Streaming? (Issues with _spark_metadata )

Background:
I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large, when the streaming app runs for a long time the metadata folder grows so big that we start getting OOM errors. I want to get rid of metadata and checkpoint folders of Spark Structured Streaming and manage offsets myself.
How we managed offsets in Spark Streaming:
I have used val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges to get offsets in Spark Structured Streaming. But want to know how to get the offsets and other metadata to manage checkpointing ourself using Spark Structured Streaming. Do you have any sample program that implements checkpointing?
How we managed offsets in Spark Structured Streaming??
Looking at this JIRA https://issues-test.apache.org/jira/browse/SPARK-18258. looks like offsets are not provided. How should we go about?
The issue is in 6 hours size of metadata increased to 45MB and it grows till it reaches nearly 13 GB. Driver memory allocated is 5GB. At that time system crashes with OOM. Wondering how to avoid making this meta data grow so large? How to make metadata not log so much information.
Code:
1. Reading records from Kafka topic
Dataset<Row> inputDf = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.option("startingOffsets", "earliest") \
.load()
2. Use from_json API from Spark to extract your data for further transformation in a dataset.
Dataset<Row> dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
....withColumn("oem_id", col("metadata.oem_id"));
3. Construct a temp table of above dataset using SQLContext
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
4. Flatten events since Parquet does not support hierarchical data.
5. Store output in parquet format on S3
StreamingQuery query = flatDf.writeStream().format("parquet")
Dataset dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
.select("event.metadata", "event.data", "event.connection", "event.registration_event","event.version_event"
);
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
Dataset flatDf = sqlContext
.sql("select " + " date, time, id, " + flattenSchema(EVENT_SCHEMA, "event") + " from event");
StreamingQuery query = flatDf
.writeStream()
.outputMode("append")
.option("compression", "snappy")
.format("parquet")
.option("checkpointLocation", checkpointLocation)
.option("path", outputPath)
.partitionBy("date", "time", "id")
.trigger(Trigger.ProcessingTime(triggerProcessingTime))
.start();
query.awaitTermination();
For non-batch Spark Structured Streaming KAFKA integration:
Quote:
Structured Streaming ignores the offsets commits in Apache Kafka.
Instead, it relies on its own offsets management on the driver side which is responsible for distributing offsets to executors and
for checkpointing them at the end of the processing round (epoch or
micro-batch).
You need not worry if you follow the Spark KAFKA integration guides.
Excellent reference: https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-structured-streaming-apache-kafka-offsets-management/read
For batch the situation is different, you need to manage that yourself and store the offsets.
UPDATE
Based on the comments I suggest the question is slightly different and advise you look at Spark Structured Streaming Checkpoint Cleanup. In addition to your updated comments and the fact that there is no error, I suggest you consukt this on metadata for Spark Structured Streaming https://www.waitingforcode.com/apache-spark-structured-streaming/checkpoint-storage-structured-streaming/read. Looking at the code, different to my style, but cannot see any obvious error.

Spark Structured Streaming with Kafka source, change number of topic partitions while query is running

I've set up a Spark structured streaming query that reads from a Kafka topic.
If the number of partitions in the topic is changed while the Spark query is running, Spark does not seem to notice and data on new partitions is not consumed.
Is there a way to tell Spark to check for new partitions in the same topic apart from stopping the query an restarting it?
EDIT:
I'm using Spark 2.4.4. I read from kafka as follows:
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaURL)
.option("startingOffsets", "earliest")
.option("subscribe", topic)
.option("failOnDataLoss", value = false)
.load()
after some processing, I write to HDFS on a Delta Lake table.
Answering my own question. Kafka consumers check for new partitions/topic (in case of subscribing to topics with a pattern) every metadata.max.age.ms, whose default value is 300000 (5 minutes).
Since my test was lasting far less than that, I wouldn't notice the update. For tests, reduce the value to something smaller, e.g. 100 ms, by setting the following option of the DataStreamReader:
.option("kafka.metadata.max.age.ms", 100)

How can I handle old data in the kafka topic?

I start using spark structured streaming.
I get readStream from kafka topic (startOffset: latest)
with waterMark,
group by event time with window duration,
and write to kafka topic.
My question is,
How can I handle the data written to the kafka topic before spark structured streaming job?
I tried to run with `startOffset: earliest' at first. but the data in the kafka topic is too large, so spark streaming process is not started because of yarn timeout. (even though I increase timeout value)
1.
If I simply create a batch job and filter by specific data range.
the result is not reflected in the current state of spark streaming,
there seems to be a problem with the consistency and accuracy of the result.
I tried to reset the checkpoint directory but It did not work.
How can I handle the old and large data?
Help me.
you can try the parmeter maxOffsetsPerTrigger for Kafka + Structured Streaming for receiving old data from Kafka. Set the value for this parameter to the number of records you want to receive from Kafka at one time.
Use:
sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test-name")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1)
.option("group.id", "2")
.option("auto.offset.reset", "earliest")
.load()

How to manually set group.id and commit kafka offsets in spark structured streaming?

I was going through the Spark structured streaming - Kafka integration guide here.
It is told at this link that
enable.auto.commit: Kafka source doesn’t commit any offset.
So how do I manually commit offsets once my spark application has successfully processed each record?
tl;dr
It is not possible to commit any messages to Kafka. Starting with Spark version 3.x you can define the name of the Kafka consumer group, however, this still does not allow you to commit any messages.
Since Spark 3.0.0
According to the Structured Kafka Integration Guide you can provide the ConsumerGroup as an option kafka.group.id:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.option("kafka.group.id", "myConsumerGroup")
.load()
However, Spark still will not commit any offsets back so you will not be able to "manually" commit offsets to Kafka. This feature is meant to deal with Kafka's latest feature Authorization using Role-Based Access Control for which your ConsumerGroup usually needs to follow naming conventions.
A full example of a Spark 3.x application is discussed and solved here.
Until Spark 2.4.x
The Spark Structured Streaming + Kafka integration Guide clearly states how it manages Kafka offsets. Spark will not commit any messages back to Kafka as it is relying on internal offset management for fault-tolerance.
The most important Kafka configurations for managing offsets are:
group.id: Kafka source will create a unique group id for each query automatically. According to the code the group.id will be set to
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
auto.offset.reset: Set the source option startingOffsets to specify where to start instead.
Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it.
enable.auto.commit: Kafka source doesn’t commit any offset.
Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).
2.4.x in Action
Let's say you have a simple Spark Structured Streaming application that reads and writes to Kafka, like this:
// create SparkSession
val spark = SparkSession.builder()
.appName("ListenerTester")
.master("local[*]")
.getOrCreate()
// read from Kafka topic
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "testingKafkaProducer")
.option("failOnDataLoss", "false")
.load()
// write to Kafka topic and set checkpoint directory for this stream
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "testingKafkaProducerOut")
.option("checkpointLocation", "/home/.../sparkCheckpoint/")
.start()
Offset Management by Spark
Once this application is submitted and data is being processed, the corresponding offset can be found in the checkpoint directory:
myCheckpointDir/offsets/
{"testingKafkaProducer":{"0":1}}
Here the entry in the checkpoint file confirms that the next offset of partition 0 to be consumed is 1. It implies that the application already processes offset 0 from partition 0 of the topic named testingKafkaProducer.
More on the fault-tolerance-semantics are given in the Spark Documentation.
Offset Management by Kafka
However, as stated in the documentation, the offset is not committed back to Kafka.
This can be checked by executing the kafka-consumer-groups.sh of the Kafka installation.
./kafka/current/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "spark-kafka-source-92ea6f85-[...]-driver-0"
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
testingKafkaProducer 0 - 1 - consumer-1-[...] /127.0.0.1 consumer-1
The current offset for this application is unknown to Kafka as it has never been committed.
Possible Workaround
Please carefully read the comments below from Spark committer #JungtaekLim about the workaround: "Spark's fault tolerance guarantee is based on the fact Spark has a full control of offset management, and they're voiding the guarantee if they're trying to modify it. (e.g. If they change to commit offset to Kafka, then there's no batch information and if Spark needs to move back to the specific batch "behind" guarantee is no longer valid.)"
What I have seen doing some research on the web is that you could commit offsets in the callback function of the onQueryProgress method in a customized StreamingQueryListener of Spark. That way, you could have a consumer group that keeps track of the current progress. However, its progress is not necessarily aligned with the actual consumer group.
Here are some links you may find helpful:
Code Example for Listener
Discussion on SO around offset management
General description on the StreamingQueryListener

How to rewind Kafka Offsets in spark structured streaming readstream

I have a Spark Structured Streaming job which is configured to read data from Kafka. Please go through the code to check the readStream() with parameters to read the latest data from Kafka.
I understand that readStream() reads from the first offset when a new query is started and not on resume.
But I don't know how to start a new query every time I restart my job in IntelliJ.
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", AppProperties.getProp(AppConstants.PROPS_SERVICES_KAFKA_SERVERS))
.option("subscribe", AppProperties.getProp(AppConstants.PROPS_SDV_KAFKA_TOPICS))
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(topic as STRING)")
I have also tried setting the offsets by """{"topicA":{"0":0,"1":0}}"""
Following is my writestream
val query = kafkaStreamingDF
.writeStream
.format("console")
.start()
Every time I restart my job in IntelliJ IDE, logs show that the offset has been set to latest instead of 0 or earliest.
Is there way I can clean my checkpoint, in that case I don't know where the checkpoint directory is because in the above case I don't specify any checkpointing.
Kafka relies on the property auto.offset.reset to take care of the Offset Management.
The default is “latest,” which means that lacking a valid offset, the consumer will start reading from the newest records (records that were written after the consumer started running). The alternative is “earliest,” which means that lacking a valid offset, the consumer will read all the data in the partition, starting from the very beginning.
As per your question you want to read the entire data from the topic. So setting the "startingOffsets" to "earliest" should work. But, also make sure that you are setting the enable.auto.commit to false.
By setting enable.auto.commit to true means that offsets are committed automatically with a frequency controlled by the config auto.commit.interval.ms.
Setting this to true commits the offsets to Kafka automatically when messages are read from Kafka which doesn’t necessarily mean that Spark has finished processing those messages. To enable precise control for committing offsets, set Kafka parameter enable.auto.commit to false.
Try to set up .option("kafka.client.id", "XX"), to use a different client.id.

Resources