How to share state between runs of streaming jobs? - apache-spark

I have a Spark streaming job triggered every day using Trigger.Once method due to business requirements.
StreamingQuery query = processed
.writeStream()
.outputMode("append")
.format("parquet")
.option("path", resultPath)
.option("checkpointLocation", checkpointLocationPathForDate)
.trigger(Trigger.Once())
.start();
I am using map flatMapGroupsWithState so that we can store state (GroupState) for grouped data.
Somewhere I read checkpointLocation should be different for every StreamingQuery. Therefore I use a checkpointLocation like this: /path/to/nfs/checkpoint/<current date in format: yyyyMMdd>
Every day, Spark job processes files in the folder /path/to/data/<current date in format: yyyyMMdd>
I want to access the state of the yesterday's Spark job since yesterday's data may contain relevant state that is needed in today's data.
However, Spark stores state data in checkpointLocation i.e /path/to/nfs/checkpoint/<current date in format: yyyyMMdd>/<queryName>/state so when different checkpointLocation is used, it is not possible to access it.
So, how can I access the GroupState data stored at checkpointLocation of previous Spark job? Is it OK to use same checkpointLocation for different StreamingQueries?
Edit:
I tried to use same checkpointLocation for yesterday's StreamingQuery and today's StreamingQuery and Spark restored state of yesterday's batch which is I want however is this documented anywhere? Is this expected behaviour or is misbehaving possible when same checkpointLocation is used between daily batches?
Edit2:
Data is stored at S3, in parquet format, path: s3a://bucket/batchdata/year=2022/month=01/day=19/
Sample data for 2022-01-19:
s3a://bucket/batchdata/year=2022/month=01/day=19/a.parquet
s3a://bucket/batchdata/year=2022/month=01/day=19/b.parquet
s3a://bucket/batchdata/year=2022/month=01/day=19/c.parquet
Data is read using Spark parquet readStream method:
// .parquet(...) is called for 2022-01-19
Dataset<Row> dataset = spark
.readStream()
.schema(PARQUET_SCHEMA)
.parquet("s3a://bucket/batchdata/year=2022/month=01/day=19/");
Dataset<Row> processed = dataset.groupByKey(keyFuncion,encoder)
.flatMapGroupsWithState(flatMapStateFunc,
OutputMode.Append(),
stateEncoder,
outputEncoder,
GroupStateTimeout.ProcessingTimeTimeout());
StreamingQuery query = processed.writeStream()
.outputMode("append")
.format("parquet")
.option("path", resultPath)
.option("checkpointLocation", checkpointLocation)
.trigger(Trigger.Once())
.start();
query.awaitTermination();
Next day same job is run with next day's parquet files stored under:
s3a://bucket/batchdata/year=2022/month=01/day=20/
Sample data for 2022-01-20:
s3a://bucket/batchdata/year=2022/month=01/day=20/d.parquet
s3a://bucket/batchdata/year=2022/month=01/day=20/e.parquet
// .parquet(...) is called for 2022-01-20
Dataset<Row> dataset = spark
.readStream()
.schema(PARQUET_SCHEMA)
.parquet("s3a://bucket/batchdata/year=2022/month=01/day=20/");
Dataset<Row> processed = dataset.groupByKey(keyFuncion,encoder)
.flatMapGroupsWithState(flatMapStateFunc,
OutputMode.Append(),
stateEncoder,
outputEncoder,
GroupStateTimeout.ProcessingTimeTimeout());
StreamingQuery query = processed.writeStream()
.outputMode("append")
.format("parquet")
.option("path", resultPath)
.option("checkpointLocation", checkpointLocation)
.trigger(Trigger.Once())
.start();
query.awaitTermination();

how can I access the GroupState data stored at checkpointLocation of previous Spark job?
You should not. Technically, you could (with some extra coding) but there are so many things specific to the other query (e.g., stateful operator IDs) that you should take into account. Use at your own risk.
Is it OK to use same checkpointLocation for different StreamingQueries?
No. You should not share the same checkpointLocation between different streaming queries. One is that they different with their operators so the numbers may not match and, even if they did, the sinks could be different and hence some data could get skipped (as already processed).
I tried to use same checkpointLocation for yesterday's StreamingQuery and today's StreamingQuery and Spark restored state of yesterday's batch which is I want however is this documented anywhere? Is this expected behaviour or is misbehaving possible when same checkpointLocation is used between daily batches?
That's documented and that's exactly how checkpointLocation is supposed to work. It's the directory with the state of a streaming query at a given time.
Quoting Recovering from Failures with Checkpointing:
In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the quick example) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query.

Related

Spark Structured Streaming Batch Read Checkpointing

I am fairly new to Spark and am still learning. One of the more difficult concepts I have come across is checkpointing and how Spark uses it to recover from failures. I am doing batch reads from Kafka using Structured Streaming and writing them to S3 as Parquet file as:
dataset
.write()
.mode(SaveMode.Append)
.option("checkpointLocation", checkpointLocation)
.partitionBy("date_hour")
.parquet(getS3PathForTopic(topicName));
The checkpoint location is a S3 filesystem path. However, as the job runs, I see no checkpointing files. In subsequent runs, I see the following log:
21/10/14 12:20:51 INFO ConsumerCoordinator: [Consumer clientId=consumer-spark-kafka-relation-54f0cc87-e437-4582-b998-a33189e90bd7-driver-0-5, groupId=spark-kafka-relation-54f0cc87-e437-4582-b998-a33189e90bd7-driver-0] Found no committed offset for partition topic-1
This indicates that the previous run did not checkpoint any offsets for this run to pick them up from. So it keeps consuming from the earliest offset.
How can I make my job pick up new offsets? Note that this is a batch query as described here.
This is how I read:
sparkSession
.read()
.format("kafka")
.option("kafka.bootstrap.servers", kafkaProperties.bootstrapServers())
.option("subscribe", topic)
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.truststore.location", sslConfig.truststoreLocation())
.option("kakfa.ssl.truststore.password", sslConfig.truststorePassword())
.option("kafka.ssl.keystore.location", sslConfig.keystoreLocation())
.option("kafka.ssl.keystore.password", sslConfig.keystorePassword())
.option("kafka.ssl.endpoint.identification.algorithm", "")
.option("failOnDataLoss", "true");
I am not sure why batch Spark Structured Streaming with Kafka still exists now. If you wish to use it, then you must code your own Offset management. See the guide, but it is badly explained.
I would say Trigger.Once is a better use case for you; Offset management is provided by Spark as it is thus not batch mode.

Consume a Kafka Topic every hour using Spark

I want to consume a Kafka topic as a batch where I want to read Kafka topic hourly and read the latest hourly data.
val readStream = existingSparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "kafka.raw")
.load()
But this always read first 20 data rows and these rows are starting from the very beginning so this never pick latest data rows.
How can I read the latest rows on a hourly basis using scala and spark?
If you read Kafka messages in Batch mode you need to take care of the bookkeeping which data is new and which is not yourself. Remember that Spark will not commit any messages back to Kafka, so every time you restart the batch job it will read from beginning (or based on the setting startingOffsets which defaults to earliest for batch queries.
For your scenario where you want to run the job once every hour and only process the new data that arrived to Kafka in the previous hour, you can make use of the writeStream trigger option Trigger.Once for streaming queries.
There is a nice blog from Databricks that nicely explains why a streaming query with Trigger.Once should be preferred over a batch query.
The main point being:
"When you’re running a batch job that performs incremental updates, you generally have to deal with figuring out what data is new, what you should process, and what you should not. Structured Streaming already does all this for you."
Make sure that you also set the option "checkpointLocation" in your writeStream. In the end, you can have a simple cron job that submits your streaming job once an hour.

How we manage offsets in Spark Structured Streaming? (Issues with _spark_metadata )

Background:
I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large, when the streaming app runs for a long time the metadata folder grows so big that we start getting OOM errors. I want to get rid of metadata and checkpoint folders of Spark Structured Streaming and manage offsets myself.
How we managed offsets in Spark Streaming:
I have used val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges to get offsets in Spark Structured Streaming. But want to know how to get the offsets and other metadata to manage checkpointing ourself using Spark Structured Streaming. Do you have any sample program that implements checkpointing?
How we managed offsets in Spark Structured Streaming??
Looking at this JIRA https://issues-test.apache.org/jira/browse/SPARK-18258. looks like offsets are not provided. How should we go about?
The issue is in 6 hours size of metadata increased to 45MB and it grows till it reaches nearly 13 GB. Driver memory allocated is 5GB. At that time system crashes with OOM. Wondering how to avoid making this meta data grow so large? How to make metadata not log so much information.
Code:
1. Reading records from Kafka topic
Dataset<Row> inputDf = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.option("startingOffsets", "earliest") \
.load()
2. Use from_json API from Spark to extract your data for further transformation in a dataset.
Dataset<Row> dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
....withColumn("oem_id", col("metadata.oem_id"));
3. Construct a temp table of above dataset using SQLContext
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
4. Flatten events since Parquet does not support hierarchical data.
5. Store output in parquet format on S3
StreamingQuery query = flatDf.writeStream().format("parquet")
Dataset dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
.select("event.metadata", "event.data", "event.connection", "event.registration_event","event.version_event"
);
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
Dataset flatDf = sqlContext
.sql("select " + " date, time, id, " + flattenSchema(EVENT_SCHEMA, "event") + " from event");
StreamingQuery query = flatDf
.writeStream()
.outputMode("append")
.option("compression", "snappy")
.format("parquet")
.option("checkpointLocation", checkpointLocation)
.option("path", outputPath)
.partitionBy("date", "time", "id")
.trigger(Trigger.ProcessingTime(triggerProcessingTime))
.start();
query.awaitTermination();
For non-batch Spark Structured Streaming KAFKA integration:
Quote:
Structured Streaming ignores the offsets commits in Apache Kafka.
Instead, it relies on its own offsets management on the driver side which is responsible for distributing offsets to executors and
for checkpointing them at the end of the processing round (epoch or
micro-batch).
You need not worry if you follow the Spark KAFKA integration guides.
Excellent reference: https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-structured-streaming-apache-kafka-offsets-management/read
For batch the situation is different, you need to manage that yourself and store the offsets.
UPDATE
Based on the comments I suggest the question is slightly different and advise you look at Spark Structured Streaming Checkpoint Cleanup. In addition to your updated comments and the fact that there is no error, I suggest you consukt this on metadata for Spark Structured Streaming https://www.waitingforcode.com/apache-spark-structured-streaming/checkpoint-storage-structured-streaming/read. Looking at the code, different to my style, but cannot see any obvious error.

Read latest records from Kafka using pyspark batch job

I am executing a batch job in pyspark, where spark will read data from kafka topic for every 5 min.
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1") \
.option("subscribePattern", "test") \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest") \
.load()
Whenever spark reads data from kafka it is reading all the data including previous batches.
I want to read data for the current batch or latest records which is not read before.
Please suggest !! Thank you.
From https://spark.apache.org/docs/2.4.5/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries
For batch queries, latest (either implicitly or by using -1 in json)
is not allowed.
Using earliest means all the data again is obtained.
You will need to define the offset explicitly every time you run like, e.g.:
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
That implies you need to save the offsets processed per partition. I am looking into this in the near future myself for a project. Some items hereunder items to help:
https://medium.com/datakaresolutions/structured-streaming-kafka-integration-6ab1b6a56dd1 stating what you observe:
Create a Kafka Batch Query
Spark also provides a feature to fetch the
data from Kafka in batch mode. In batch mode Spark will consume all
the messages at once. Kafka in batch mode requires two important
parameters Starting offsets and ending offsets, if not specified spark
will consider the default configuration which is,
startingOffsets — earliest
endingOffsets — latest
https://dzone.com/articles/kafka-gt-hdfss3-batch-ingestion-through-spark alludes as well to what you should do, with the following:
And, finally, save these Kafka topic endOffsets to file system – local or HDFS (or commit them to ZooKeeper). This will be used for the
next run of starting the offset for a Kafka topic. Here we are making
sure the job's next run will read from the offset where the previous
run left off.
This blog https://dataengi.com/2019/06/06/spark-structured-streaming/ I think has the answer for saving offsets.
Did you use check point location while writing stream data

How to rewind Kafka Offsets in spark structured streaming readstream

I have a Spark Structured Streaming job which is configured to read data from Kafka. Please go through the code to check the readStream() with parameters to read the latest data from Kafka.
I understand that readStream() reads from the first offset when a new query is started and not on resume.
But I don't know how to start a new query every time I restart my job in IntelliJ.
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", AppProperties.getProp(AppConstants.PROPS_SERVICES_KAFKA_SERVERS))
.option("subscribe", AppProperties.getProp(AppConstants.PROPS_SDV_KAFKA_TOPICS))
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(topic as STRING)")
I have also tried setting the offsets by """{"topicA":{"0":0,"1":0}}"""
Following is my writestream
val query = kafkaStreamingDF
.writeStream
.format("console")
.start()
Every time I restart my job in IntelliJ IDE, logs show that the offset has been set to latest instead of 0 or earliest.
Is there way I can clean my checkpoint, in that case I don't know where the checkpoint directory is because in the above case I don't specify any checkpointing.
Kafka relies on the property auto.offset.reset to take care of the Offset Management.
The default is “latest,” which means that lacking a valid offset, the consumer will start reading from the newest records (records that were written after the consumer started running). The alternative is “earliest,” which means that lacking a valid offset, the consumer will read all the data in the partition, starting from the very beginning.
As per your question you want to read the entire data from the topic. So setting the "startingOffsets" to "earliest" should work. But, also make sure that you are setting the enable.auto.commit to false.
By setting enable.auto.commit to true means that offsets are committed automatically with a frequency controlled by the config auto.commit.interval.ms.
Setting this to true commits the offsets to Kafka automatically when messages are read from Kafka which doesn’t necessarily mean that Spark has finished processing those messages. To enable precise control for committing offsets, set Kafka parameter enable.auto.commit to false.
Try to set up .option("kafka.client.id", "XX"), to use a different client.id.

Resources