Structured Streaming 2.1.0 stream to Parquet creates many small files - apache-spark

I am running Structured Streaming (2.1.0) on a 3-node yarn cluster and stream json records to parquet. My code fragment looks like this:
val query = ds.writeStream
.format("parquet")
.option("checkpointLocation", "/data/kafka_streaming.checkpoint")
.start("/data/kafka_streaming.parquet")
I notice it creates thousands of small files quickly for only 1,000 records. I suspect it has to do frequency of trigger. So I changed it:
val query = ds.writeStream
.format("parquet")
.option("checkpointLocation", "/data/kafka_streaming.checkpoint")
.**trigger(ProcessingTime("60 seconds"))**
.start("/data/kafka_streaming.parquet")
The difference is very obvious. Now I can see much smaller number of files created for the same number of records.
My question: Is there any way to have low latency for the trigger and keep smaller number of larger output files?

Related

How to generate one hour length parquet files with Spark Structured Streaming (Python)?

I want to generate parquet files every hour with all the information received during that hour for further processing using Spark NLP.
I have streaming data coming from Kafka and when I writeStream I set the trigger processing time to one hour, it just generates many small parquet files. I've read that setting coalesce to 1 plus the trigger would generate a big parquet file but it still gives me many small parquet files.
I've also read that one can also set the minimum number of rows too, but the amount of rows I receive changes from hour to hour.
This is how I'm writing my stream:
df.writeStream \
.format("parquet") \
.option("checkpointLocation", "s3a://datalake-twitter-app/spark_checkpoints/") \
.option("path", "s3a://datalake-twitter-app/raw_datalake/") \
.trigger(processingTime='60 minutes') \
.start()
Any idea on how I can write one hour long parquet files with all the information received during the given hour using Spark Structured Streaming? Maybe I should use something different?
I thought that it could also be that is better to focus on the reading files from the S3 bucket on the other Spark process, and read the files within the previous hour.

How to slow down the write speed of Kafka Producer?

I use the spark to write data to kafka in this way.
df.write(). format("kafka"). save()
can I control the writing speed to kafka to avoid pressure on kafka?
Is there some options that helps to slow down the speed?
I think setting linger.ms to a non-zero value would help. As it controls the amount of time to wait for additional messages before sending the current batch. Code can look like the following
df.write.format("kafka").option("linger.ms", "100").save()
But this really depends on a lot of things. If your Kafka is 'big' enough and configured properly, I wouldn't worry too much about the speed. After all, kafka is designed to cope with this situation (traffic spike).
Generally, Structured Streaming will try to process data as fast as possible by default. There are options in each source to allow to control the processing rate, such as maxFilesPerTrigger in File source, and maxOffsetsPerTrigger in Kafka source.
val streamingETLQuery = cloudtrailEvents
.withColumn("date", $"timestamp".cast("date") // derive the date
.writeStream
.trigger(ProcessingTime("10 seconds")) // check for files every 10s
.format("parquet") // write as Parquet partitioned by date
.partitionBy("date")
.option("path", "/cloudtrail")
.option("checkpointLocation", "/cloudtrail.checkpoint/")
.start()
val df = spark.readStream
.format("text")
.option("maxFilesPerTrigger", 1)
.load("text-logs")
Read the following links for more details:
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

How to avoid writing empty json files in Spark [duplicate]

I am reading from Kafka queue using Spark Structured Streaming. After reading from Kafka I am applying filter on the dataframe. I am saving this filtered dataframe into a parquet file. This is generating many empty parquet files. Is there any way I can stop writing an empty file?
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KafkaServer) \
.option("subscribe", KafkaTopics) \
.load()
Transaction_DF = df.selectExpr("CAST(value AS STRING)")
decompDF = Transaction_DF.select(zip_extract("value").alias("decompress"))
filterDF = decomDF.filter(.....)
query = filterDF .writeStream \
.option("path", outputpath) \
.option("checkpointLocation", RawXMLCheckpoint) \
.start()
Is there any way I can stop writing an empty file.
Yes, but you would rather not do it.
The reason for many empty parquet files is that Spark SQL (the underlying infrastructure for Structured Streaming) tries to guess the number of partitions to load a dataset (with records from Kafka per batch) and does this "poorly", i.e. many partitions have no data.
When you save a partition with no data you will get an empty file.
You can use repartition or coalesce operators to set the proper number of partitions and reduce (or even completely avoid) empty files. See Dataset API.
Why would you not do it? repartition and coalesce may incur performance degradation due to the extra step of shuffling the data between partitions (and possibly nodes in your Spark cluster). That can be expensive and not worth doing it (and hence I said that you would rather not do it).
You may then be asking yourself, how to know the right number of partitions? And that's a very good question in any Spark project. The answer is fairly simple (and obvious if you understand what and how Spark does the processing): "Know your data" so you can calculate how many is exactly right.
I recommend using repartition(partitioningColumns) on the Dataframe resp. Dataset and after that partitionBy(partitioningColumns) on the writeStream operation to avoid writing empty files.
Reason:
The bottleneck if you have a lot of data is often the read performance with Spark if you have a lot of small (or even empty) files and no partitioning. So you should definitely make use of the file/directory partitioning (which is not the same as RDD partitioning).
This is especially a problem when using AWS S3.
The partitionColumns should fit your common queries when reading the data like timestamp/day, message type/Kafka topic, ...
See also the partitionBy documentation on http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/, year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.
you can try with repartitionByRange(column)..
I used this while writing dataframe to HDFS .. It solved my empty file creation issue.
If you are using yarn client mode, then setting the num of executor cores to 1 will solve the problem. This means that only 1 task will be run at any time per executor.

How to avoid empty files while writing parquet files?

I am reading from Kafka queue using Spark Structured Streaming. After reading from Kafka I am applying filter on the dataframe. I am saving this filtered dataframe into a parquet file. This is generating many empty parquet files. Is there any way I can stop writing an empty file?
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KafkaServer) \
.option("subscribe", KafkaTopics) \
.load()
Transaction_DF = df.selectExpr("CAST(value AS STRING)")
decompDF = Transaction_DF.select(zip_extract("value").alias("decompress"))
filterDF = decomDF.filter(.....)
query = filterDF .writeStream \
.option("path", outputpath) \
.option("checkpointLocation", RawXMLCheckpoint) \
.start()
Is there any way I can stop writing an empty file.
Yes, but you would rather not do it.
The reason for many empty parquet files is that Spark SQL (the underlying infrastructure for Structured Streaming) tries to guess the number of partitions to load a dataset (with records from Kafka per batch) and does this "poorly", i.e. many partitions have no data.
When you save a partition with no data you will get an empty file.
You can use repartition or coalesce operators to set the proper number of partitions and reduce (or even completely avoid) empty files. See Dataset API.
Why would you not do it? repartition and coalesce may incur performance degradation due to the extra step of shuffling the data between partitions (and possibly nodes in your Spark cluster). That can be expensive and not worth doing it (and hence I said that you would rather not do it).
You may then be asking yourself, how to know the right number of partitions? And that's a very good question in any Spark project. The answer is fairly simple (and obvious if you understand what and how Spark does the processing): "Know your data" so you can calculate how many is exactly right.
I recommend using repartition(partitioningColumns) on the Dataframe resp. Dataset and after that partitionBy(partitioningColumns) on the writeStream operation to avoid writing empty files.
Reason:
The bottleneck if you have a lot of data is often the read performance with Spark if you have a lot of small (or even empty) files and no partitioning. So you should definitely make use of the file/directory partitioning (which is not the same as RDD partitioning).
This is especially a problem when using AWS S3.
The partitionColumns should fit your common queries when reading the data like timestamp/day, message type/Kafka topic, ...
See also the partitionBy documentation on http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/, year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.
you can try with repartitionByRange(column)..
I used this while writing dataframe to HDFS .. It solved my empty file creation issue.
If you are using yarn client mode, then setting the num of executor cores to 1 will solve the problem. This means that only 1 task will be run at any time per executor.

Spark Structured Streaming writing to parquet creates so many files

I used structured streaming to load messages from kafka, do some aggreation then write to parquet file. The problem is that there are so many parquet files created (800 files) for only 100 messages from kafka.
The aggregation part is:
return model
.withColumn("timeStamp", col("timeStamp").cast("timestamp"))
.withWatermark("timeStamp", "30 seconds")
.groupBy(window(col("timeStamp"), "5 minutes"))
.agg(
count("*").alias("total"));
The query:
StreamingQuery query = result //.orderBy("window")
.writeStream()
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "c:\\bigdata\\checkpoints")
.start("c:\\bigdata\\parquet");
When loading one of the parquet file using spark, it shows empty
+------+-----+
|window|total|
+------+-----+
+------+-----+
How can I save the dataset to only one parquet file?
Thanks
My idea was to use Spark Structured Streaming to consume events from Azure Even Hub then store them on storage in a parquet format.
I finally figured out how to deal with many small files created.
Spark version 2.4.0.
This how my query looks like
dfInput
.repartition(1, col('column_name'))
.select("*")
.writeStream
.format("parquet")
.option("path", "adl://storage_name.azuredatalakestore.net/streaming")
.option("checkpointLocation", "adl://storage_name.azuredatalakestore.net/streaming_checkpoint")
.trigger(processingTime='480 seconds')
.start()
As a result, I have one file created on a storage location every 480 seconds.
To figure out the balance between file size and number of files to avoid OOM error, just play with two parameters: number of partitions and processingTime, which means the batch interval.
I hope you can adjust the solution to your use case.

Resources