Why Apache Spark Structured Streaming job get stuck on writing to checkpoint when using DBFS / S3 - apache-spark

I have a Spark structured streaming job that reads data from Kafka, perform simple transform and save in delta format. However, I found that the job stuck without getting any error when the executors perform following task:
23/02/07 01:34:46 INFO StateStore: Retrieved reference to StateStoreCoordinator: org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef#4552b2e8
23/02/07 01:34:46 INFO HDFSBackedStateStoreProvider: Deleted files older than 6042 for HDFSStateStoreProvider[id = (op=0,part=59),dir = s3a://bucket/path/to/table]:
I found that some docs mentioned about this question:
https://learn.microsoft.com/en-us/azure/databricks/kb/streaming/streaming-job-stuck-writing-checkpoint
https://kb.databricks.com/en_US/streaming/streaming-job-stuck-writing-checkpoint
From both docs, it mentions that the solution should be:
Solution
You should use persistent storage for streaming checkpoints.
You should not use DBFS for streaming checkpoint storage.
I would like to know
Why DBFS is not good for streaming checkpoint storage?
Isn't DBFS on S3 persistent?
Is it because DBFS on S3 has heavy network and IO overhead compared to local machine?
Thanks for your help!

Related

Spark-Streaming Checkpoints

I am trying to implement spark streaming checkpoints, using GCS as storage for checkpoints. On enabling the checkpointing causes the performance of the job to degrade. Just thinking if checkpoint can be done on sql or some other storage which would be faster then writing to HDFS or GCS.
Spark 3.x (and previous version) do not provide native support for checkpointing data directly to a SQL database. You have to checkpoint to a file system or a distributed file system like HDFS/GCS/S3.
Having said that you can write(and also then retrieving) your own custom checkpointing mechanism to a different destination.

Does S3 Strong Consistency mean it is safe to use S3 as a checkpointing location for Spark Structured Streaming applications?

In the past, the general consensus was such that you should not use S3 as checkpointing location for Spark Structured Streaming applications.
However, now that S3 offers strong read after write consistency, is it safe to use S3 as a checkpointing location? If it is not safe, why?
In my experiments, I continue to see checkpointing related exceptions in my Spark Structured streaming applications, but I am uncertain where the problem actually lies.
not really. you get consistency of list and updates, but rename is still mocked with copy and delete...and I think the standard checkpoint algorithm depends on it.
hadoop 3.3.1 added a new API, Abortable to aid with a custom S3 stream checkpoint committer -the idea os that the checkpointer woudl write sstraight to the destination, but abort the write when aborting the checkpoint. a normal close() would finish the write and manifest the file. see https://issues.apache.org/jira/browse/HADOOP-16906
AFAIK nobody has done the actual committer. opportunity for you to contribute there...
You really answer your own question. You do not state if on Databricks or EMR so I am going to assume EC2.
Use HDFS as checkpoint location on local EC2 disk.
Where I am now we have HDFS using HDP and IBM S3, HDFS is used still for checkpointing.

spark structure streaming with efs is causing delay in job

Im using Spark Structure streaming 2.4.4. Im using Spark with kubernetes. But when i enable local checkpointing in some /tmp/ folder, jobs finishes in 7-8s. If EFS is mounted and checkpointing location is used on that then jobs are taking more than 5 mins and its quite unstable.
Please find the screenshot from spark sql tab.

Spark checkpointing has a lot of tmp.crc files

I am using spark structured streaming where I read a stream from Kafka and after some transformation I write the resulted stream to Kafka.
I see a lot of hidden ..*tmp.crc files within my checkpoint directory. These files are not getting cleaned up and ever growing in number.
Am I missing some configuration?
I am not running spark on Hadoop. Using EBS based volume for checkpointing.

Spark structured streaming over google cloud storage

I am running few batch Spark pipelines that consumes Avro data on google cloud storage. I need to update some pipelines to be more realtime and wondering if spark structured streaming can directly consume files from gcs in a streaming way i.e parkContext.readstream.from(...) can be applied to Avro files that are being generated continuously under a bucket from external sources.
Apache beam already has something like File.MatchAll().continuously(), Watch, watchnewFiles that allow beam pipelines to monitor for new files and read them in a streaming way (thus obviating the need of pubsub or notification system) , is there something similar for Spark structured streaming as well ?
As the GCS connector exposes a Hadoop-Compatible FileSystem (HCFS), "gs://" URIs should be valid targets for SparkSession.readStream.from.
Avro file handling is implemented by spark-avro. Using it with readStream should be accomplished the same way as generic reading (e.g., .format("com.databricks.spark.avro"))

Resources