Spark Structured Streaming writing to parquet creates so many files - apache-spark

I used structured streaming to load messages from kafka, do some aggreation then write to parquet file. The problem is that there are so many parquet files created (800 files) for only 100 messages from kafka.
The aggregation part is:
return model
.withColumn("timeStamp", col("timeStamp").cast("timestamp"))
.withWatermark("timeStamp", "30 seconds")
.groupBy(window(col("timeStamp"), "5 minutes"))
.agg(
count("*").alias("total"));
The query:
StreamingQuery query = result //.orderBy("window")
.writeStream()
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "c:\\bigdata\\checkpoints")
.start("c:\\bigdata\\parquet");
When loading one of the parquet file using spark, it shows empty
+------+-----+
|window|total|
+------+-----+
+------+-----+
How can I save the dataset to only one parquet file?
Thanks

My idea was to use Spark Structured Streaming to consume events from Azure Even Hub then store them on storage in a parquet format.
I finally figured out how to deal with many small files created.
Spark version 2.4.0.
This how my query looks like
dfInput
.repartition(1, col('column_name'))
.select("*")
.writeStream
.format("parquet")
.option("path", "adl://storage_name.azuredatalakestore.net/streaming")
.option("checkpointLocation", "adl://storage_name.azuredatalakestore.net/streaming_checkpoint")
.trigger(processingTime='480 seconds')
.start()
As a result, I have one file created on a storage location every 480 seconds.
To figure out the balance between file size and number of files to avoid OOM error, just play with two parameters: number of partitions and processingTime, which means the batch interval.
I hope you can adjust the solution to your use case.

Related

Fixed interval micro-batch and once time micro-batch trigger mode don't work with Parquet file sink

I'm trying to consume data on Kafka topic and push consumed messages to HDFS with parquet format.
I'm using pyspark (2.4.5) to create Spark structed streaming process. The problem is my Spark job is endless and no data is pushed to HDFS.
process = (
# connect to kafka brokers
(
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "brokers_list")
.option("subscribe", "kafka_topic")
.option("startingOffset", "earliest")
.option("includeHeaders", "true")
.load()
.writeStream.format("parquet")
.trigger(once=True). # tried with processingTime argument and have same result
.option("path", f"hdfs://hadoop.local/draft")
.option("checkpointLocation", "hdfs://hadoop.local/draft_checkpoint")
.start()
)
)
My Spark session's UI is liked this:
More details on stage:
I check status on my notebook and got this:
{
'message': 'Processing new data',
'isDataAvailable': True,
'isTriggerActive': True
}
When I check my folder on HDFS, there is no data is loaded. Only a directory named _spark_metadata is created in the output_location folder.
I don't face this problem if I remove the line of triggerMode trigger(processingTime="1 minute"). When I use default trigger mode, spark create a lot of small parquet file in the output location, this is inconvenient.
Does 2 trigger mode processingTime and once support for parquet file sink?
If I have to use the default trigger mode, how can I handle the gigantic number of tiny files created in my HDFS system?

cleanSource option does not delete any files

I have a Structured Streaming Job with Trigger.Once() enabled which I run each 20 minutes. After each running, I wat remove my processed parquet files from S3, so I enabled the cleanSource delete option, but it does not work and I don't know why !
Before showing my code, I have to comment about him. I'm running multiple structured streaming queries in paralell, I have 5 buckets and I submit this in parallel. The job works perfectly, but does not delete any processed files.
var table = ['table1','table2','table3','table4','table5']
tables.par.map(table => {
new ReplicationTables().run(table)
})
object ReplicationTables {
def run(table): Unit = {
val dataFrame = spark.readStream
.option("mergeSchema", "true")
.schema(dfSchema)
.option("cleanSource","delete")
.parquet(s"s3a://my-bucket/${table}/*")
// I do some transformation and after I write my new dataframe called df to S3 in Delta format
df.writeStream
.format("delta")
.outputMode("append")
.queryName(s"Delta/${table.schema}/${table.name}")
.trigger(Trigger.Once())
.option("checkpointLocation", s"s3a://my-bucket/checkpoints/${table.schema}/${table.name}")
.start(s"s3a://my-bucket/Delta_Tables/${table}/")
.awaitTermination()
}
}
PS: Even with INFO log level I does not have any logs about the cleanSource
PS 2: Follow the docs of Structured Streaming about cleanSource https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources
Try using option("spark.sql.streaming.fileSource.cleaner.numThreads", "10") to speedup cleanup. If more files are getting generated in less time, then Spark don't delete. May be increasing threads helps

Spark Structured Streaming custom partition directory name

I'm porting a streaming job (Kafka topic -> AWS S3 Parquet Files) from Kafka Connect to Spark Structured Streaming Job.
I partition my data by year/month/day.
The code is very simple:
df.withColumn("year", functions.date_format(col("createdAt"), "yyyy"))
.withColumn("month", functions.date_format(col("createdAt"), "MM"))
.withColumn("day", functions.date_format(col("createdAt"), "dd"))
.writeStream()
.trigger(processingTime='15 seconds')
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "/some/checkpoint/directory/")
.option("path", "/some/directory/")
.option("truncate", "false")
.partitionBy("year", "month", "day")
.start()
.awaitTermination();
The output files are in the following directory (as expected):
/s3-bucket/some/directory/year=2021/month=01/day=02/
Question:
Is there a way to customize the output directory name? I need it to be
/s3-bucket/some/directory/2021/01/02/
For backward compatibility reasons.
No, there is no way to customize the output directory names into the format you have mentioned within your Spark Structured Streaming application.
Partitions are based on the values of particular columns and without their column names in the directory path it would be ambiguous to which column their value belong to. You need to write a seperate application that transforms those directories into the desired format.

spark structured stream read to hdfs files fails if data is read immediately

I'd like to load a Hive table (target_table) as a DataFrame after writing a new batch out to HDFS (target_table_dir) using Spark Structured Streaming as follows:
df.writeStream
.trigger(processingTime='5 seconds')
.foreachBatch(lambda df, partition_id:
df.write
.option("path", target_table_dir)
.format("parquet")
.mode("append")
.saveAsTable(target_table))
.start()
When we immediately read same data back from the Hive table we get a "partition not found exception". If we read with some delay, we have data correct.
It seems that Spark is still writing data to HDFS while execution has stopped and Hive Metastore is updated but data is still being written out to HDFS.
How to know when the writing of data to the Hive table (into the HDFS) is complete?
Note:
we have found that if we use processAllAvailable() after writing out,subsequent read works fine.but processAllAvailable() will block execution forever if we are dealing with continuous streams

Structured Streaming 2.1.0 stream to Parquet creates many small files

I am running Structured Streaming (2.1.0) on a 3-node yarn cluster and stream json records to parquet. My code fragment looks like this:
val query = ds.writeStream
.format("parquet")
.option("checkpointLocation", "/data/kafka_streaming.checkpoint")
.start("/data/kafka_streaming.parquet")
I notice it creates thousands of small files quickly for only 1,000 records. I suspect it has to do frequency of trigger. So I changed it:
val query = ds.writeStream
.format("parquet")
.option("checkpointLocation", "/data/kafka_streaming.checkpoint")
.**trigger(ProcessingTime("60 seconds"))**
.start("/data/kafka_streaming.parquet")
The difference is very obvious. Now I can see much smaller number of files created for the same number of records.
My question: Is there any way to have low latency for the trigger and keep smaller number of larger output files?

Resources