cleanSource option does not delete any files - apache-spark

I have a Structured Streaming Job with Trigger.Once() enabled which I run each 20 minutes. After each running, I wat remove my processed parquet files from S3, so I enabled the cleanSource delete option, but it does not work and I don't know why !
Before showing my code, I have to comment about him. I'm running multiple structured streaming queries in paralell, I have 5 buckets and I submit this in parallel. The job works perfectly, but does not delete any processed files.
var table = ['table1','table2','table3','table4','table5']
tables.par.map(table => {
new ReplicationTables().run(table)
})
object ReplicationTables {
def run(table): Unit = {
val dataFrame = spark.readStream
.option("mergeSchema", "true")
.schema(dfSchema)
.option("cleanSource","delete")
.parquet(s"s3a://my-bucket/${table}/*")
// I do some transformation and after I write my new dataframe called df to S3 in Delta format
df.writeStream
.format("delta")
.outputMode("append")
.queryName(s"Delta/${table.schema}/${table.name}")
.trigger(Trigger.Once())
.option("checkpointLocation", s"s3a://my-bucket/checkpoints/${table.schema}/${table.name}")
.start(s"s3a://my-bucket/Delta_Tables/${table}/")
.awaitTermination()
}
}
PS: Even with INFO log level I does not have any logs about the cleanSource
PS 2: Follow the docs of Structured Streaming about cleanSource https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources

Try using option("spark.sql.streaming.fileSource.cleaner.numThreads", "10") to speedup cleanup. If more files are getting generated in less time, then Spark don't delete. May be increasing threads helps

Related

How to share state between runs of streaming jobs?

I have a Spark streaming job triggered every day using Trigger.Once method due to business requirements.
StreamingQuery query = processed
.writeStream()
.outputMode("append")
.format("parquet")
.option("path", resultPath)
.option("checkpointLocation", checkpointLocationPathForDate)
.trigger(Trigger.Once())
.start();
I am using map flatMapGroupsWithState so that we can store state (GroupState) for grouped data.
Somewhere I read checkpointLocation should be different for every StreamingQuery. Therefore I use a checkpointLocation like this: /path/to/nfs/checkpoint/<current date in format: yyyyMMdd>
Every day, Spark job processes files in the folder /path/to/data/<current date in format: yyyyMMdd>
I want to access the state of the yesterday's Spark job since yesterday's data may contain relevant state that is needed in today's data.
However, Spark stores state data in checkpointLocation i.e /path/to/nfs/checkpoint/<current date in format: yyyyMMdd>/<queryName>/state so when different checkpointLocation is used, it is not possible to access it.
So, how can I access the GroupState data stored at checkpointLocation of previous Spark job? Is it OK to use same checkpointLocation for different StreamingQueries?
Edit:
I tried to use same checkpointLocation for yesterday's StreamingQuery and today's StreamingQuery and Spark restored state of yesterday's batch which is I want however is this documented anywhere? Is this expected behaviour or is misbehaving possible when same checkpointLocation is used between daily batches?
Edit2:
Data is stored at S3, in parquet format, path: s3a://bucket/batchdata/year=2022/month=01/day=19/
Sample data for 2022-01-19:
s3a://bucket/batchdata/year=2022/month=01/day=19/a.parquet
s3a://bucket/batchdata/year=2022/month=01/day=19/b.parquet
s3a://bucket/batchdata/year=2022/month=01/day=19/c.parquet
Data is read using Spark parquet readStream method:
// .parquet(...) is called for 2022-01-19
Dataset<Row> dataset = spark
.readStream()
.schema(PARQUET_SCHEMA)
.parquet("s3a://bucket/batchdata/year=2022/month=01/day=19/");
Dataset<Row> processed = dataset.groupByKey(keyFuncion,encoder)
.flatMapGroupsWithState(flatMapStateFunc,
OutputMode.Append(),
stateEncoder,
outputEncoder,
GroupStateTimeout.ProcessingTimeTimeout());
StreamingQuery query = processed.writeStream()
.outputMode("append")
.format("parquet")
.option("path", resultPath)
.option("checkpointLocation", checkpointLocation)
.trigger(Trigger.Once())
.start();
query.awaitTermination();
Next day same job is run with next day's parquet files stored under:
s3a://bucket/batchdata/year=2022/month=01/day=20/
Sample data for 2022-01-20:
s3a://bucket/batchdata/year=2022/month=01/day=20/d.parquet
s3a://bucket/batchdata/year=2022/month=01/day=20/e.parquet
// .parquet(...) is called for 2022-01-20
Dataset<Row> dataset = spark
.readStream()
.schema(PARQUET_SCHEMA)
.parquet("s3a://bucket/batchdata/year=2022/month=01/day=20/");
Dataset<Row> processed = dataset.groupByKey(keyFuncion,encoder)
.flatMapGroupsWithState(flatMapStateFunc,
OutputMode.Append(),
stateEncoder,
outputEncoder,
GroupStateTimeout.ProcessingTimeTimeout());
StreamingQuery query = processed.writeStream()
.outputMode("append")
.format("parquet")
.option("path", resultPath)
.option("checkpointLocation", checkpointLocation)
.trigger(Trigger.Once())
.start();
query.awaitTermination();
how can I access the GroupState data stored at checkpointLocation of previous Spark job?
You should not. Technically, you could (with some extra coding) but there are so many things specific to the other query (e.g., stateful operator IDs) that you should take into account. Use at your own risk.
Is it OK to use same checkpointLocation for different StreamingQueries?
No. You should not share the same checkpointLocation between different streaming queries. One is that they different with their operators so the numbers may not match and, even if they did, the sinks could be different and hence some data could get skipped (as already processed).
I tried to use same checkpointLocation for yesterday's StreamingQuery and today's StreamingQuery and Spark restored state of yesterday's batch which is I want however is this documented anywhere? Is this expected behaviour or is misbehaving possible when same checkpointLocation is used between daily batches?
That's documented and that's exactly how checkpointLocation is supposed to work. It's the directory with the state of a streaming query at a given time.
Quoting Recovering from Failures with Checkpointing:
In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the quick example) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query.

Fixed interval micro-batch and once time micro-batch trigger mode don't work with Parquet file sink

I'm trying to consume data on Kafka topic and push consumed messages to HDFS with parquet format.
I'm using pyspark (2.4.5) to create Spark structed streaming process. The problem is my Spark job is endless and no data is pushed to HDFS.
process = (
# connect to kafka brokers
(
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "brokers_list")
.option("subscribe", "kafka_topic")
.option("startingOffset", "earliest")
.option("includeHeaders", "true")
.load()
.writeStream.format("parquet")
.trigger(once=True). # tried with processingTime argument and have same result
.option("path", f"hdfs://hadoop.local/draft")
.option("checkpointLocation", "hdfs://hadoop.local/draft_checkpoint")
.start()
)
)
My Spark session's UI is liked this:
More details on stage:
I check status on my notebook and got this:
{
'message': 'Processing new data',
'isDataAvailable': True,
'isTriggerActive': True
}
When I check my folder on HDFS, there is no data is loaded. Only a directory named _spark_metadata is created in the output_location folder.
I don't face this problem if I remove the line of triggerMode trigger(processingTime="1 minute"). When I use default trigger mode, spark create a lot of small parquet file in the output location, this is inconvenient.
Does 2 trigger mode processingTime and once support for parquet file sink?
If I have to use the default trigger mode, how can I handle the gigantic number of tiny files created in my HDFS system?

Resuming Structured Streaming from latest offsets

I would like to create Spark Structured Streaming job reading messages from Kafka source, writing to Kafka sink, which after failure will resume reading only current, newest messages. For that reason I don't need to keep checkpoints for my job.
But it looks like there is no option to disable checkpointing while writing to Kafka sink in Structured Streaming. To my understanding, even if I specify on the source:
.option("startingOffsets", "latest")
it will be taken into account only when the stream is first run, and after failure stream will resume from the checkpoint. Is there some workaround? And is there a way to disable checkpointing?
As workaround for this is to delete existing check point location from your code so that every time it will start fetching latest offset data.
import org.apache.hadoop.fs.{FileSystem, Path}
val checkPointLocation="/path/in/hdfs/location"
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.delete(new Path(checkPointLocation),true)
// Delete check point location if exist.
val options = Map(
"kafka.bootstrap.servers"-> "localhost:9092",
"topic" -> "topic_name",
"checkpointLocation" -> checkPointLocation,
"startingOffsets" -> "latest"
)
df
.writeStream
.format("kafka")
.outputMode("append")
.options(options)
.start()
.awaitTermination()

How we manage offsets in Spark Structured Streaming? (Issues with _spark_metadata )

Background:
I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large, when the streaming app runs for a long time the metadata folder grows so big that we start getting OOM errors. I want to get rid of metadata and checkpoint folders of Spark Structured Streaming and manage offsets myself.
How we managed offsets in Spark Streaming:
I have used val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges to get offsets in Spark Structured Streaming. But want to know how to get the offsets and other metadata to manage checkpointing ourself using Spark Structured Streaming. Do you have any sample program that implements checkpointing?
How we managed offsets in Spark Structured Streaming??
Looking at this JIRA https://issues-test.apache.org/jira/browse/SPARK-18258. looks like offsets are not provided. How should we go about?
The issue is in 6 hours size of metadata increased to 45MB and it grows till it reaches nearly 13 GB. Driver memory allocated is 5GB. At that time system crashes with OOM. Wondering how to avoid making this meta data grow so large? How to make metadata not log so much information.
Code:
1. Reading records from Kafka topic
Dataset<Row> inputDf = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.option("startingOffsets", "earliest") \
.load()
2. Use from_json API from Spark to extract your data for further transformation in a dataset.
Dataset<Row> dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
....withColumn("oem_id", col("metadata.oem_id"));
3. Construct a temp table of above dataset using SQLContext
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
4. Flatten events since Parquet does not support hierarchical data.
5. Store output in parquet format on S3
StreamingQuery query = flatDf.writeStream().format("parquet")
Dataset dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
.select("event.metadata", "event.data", "event.connection", "event.registration_event","event.version_event"
);
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
Dataset flatDf = sqlContext
.sql("select " + " date, time, id, " + flattenSchema(EVENT_SCHEMA, "event") + " from event");
StreamingQuery query = flatDf
.writeStream()
.outputMode("append")
.option("compression", "snappy")
.format("parquet")
.option("checkpointLocation", checkpointLocation)
.option("path", outputPath)
.partitionBy("date", "time", "id")
.trigger(Trigger.ProcessingTime(triggerProcessingTime))
.start();
query.awaitTermination();
For non-batch Spark Structured Streaming KAFKA integration:
Quote:
Structured Streaming ignores the offsets commits in Apache Kafka.
Instead, it relies on its own offsets management on the driver side which is responsible for distributing offsets to executors and
for checkpointing them at the end of the processing round (epoch or
micro-batch).
You need not worry if you follow the Spark KAFKA integration guides.
Excellent reference: https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-structured-streaming-apache-kafka-offsets-management/read
For batch the situation is different, you need to manage that yourself and store the offsets.
UPDATE
Based on the comments I suggest the question is slightly different and advise you look at Spark Structured Streaming Checkpoint Cleanup. In addition to your updated comments and the fact that there is no error, I suggest you consukt this on metadata for Spark Structured Streaming https://www.waitingforcode.com/apache-spark-structured-streaming/checkpoint-storage-structured-streaming/read. Looking at the code, different to my style, but cannot see any obvious error.

Spark Structured Streaming writing to parquet creates so many files

I used structured streaming to load messages from kafka, do some aggreation then write to parquet file. The problem is that there are so many parquet files created (800 files) for only 100 messages from kafka.
The aggregation part is:
return model
.withColumn("timeStamp", col("timeStamp").cast("timestamp"))
.withWatermark("timeStamp", "30 seconds")
.groupBy(window(col("timeStamp"), "5 minutes"))
.agg(
count("*").alias("total"));
The query:
StreamingQuery query = result //.orderBy("window")
.writeStream()
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "c:\\bigdata\\checkpoints")
.start("c:\\bigdata\\parquet");
When loading one of the parquet file using spark, it shows empty
+------+-----+
|window|total|
+------+-----+
+------+-----+
How can I save the dataset to only one parquet file?
Thanks
My idea was to use Spark Structured Streaming to consume events from Azure Even Hub then store them on storage in a parquet format.
I finally figured out how to deal with many small files created.
Spark version 2.4.0.
This how my query looks like
dfInput
.repartition(1, col('column_name'))
.select("*")
.writeStream
.format("parquet")
.option("path", "adl://storage_name.azuredatalakestore.net/streaming")
.option("checkpointLocation", "adl://storage_name.azuredatalakestore.net/streaming_checkpoint")
.trigger(processingTime='480 seconds')
.start()
As a result, I have one file created on a storage location every 480 seconds.
To figure out the balance between file size and number of files to avoid OOM error, just play with two parameters: number of partitions and processingTime, which means the batch interval.
I hope you can adjust the solution to your use case.

Resources