spark parquet too many small files [duplicate]

spark parquet too many small files [duplicate] - apache-spark

I've got a fairly simple job coverting log files to parquet. It's processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12 thousand files.
Job works as follows:
val events = spark.sparkContext
.textFile(s"$stream/$sourcetype")
.map(_.split(" \\|\\| ").toList)
.collect{case List(date, y, "Event") => MyEvent(date, y, "Event")}
.toDF()
df.write.mode(SaveMode.Append).partitionBy("date").parquet(s"$path")
It collects the events with a common schema, converts to a DataFrame, and then writes out as parquet.
The problem I'm having is that this can create a bit of an IO explosion on the HDFS cluster, as it's trying to create so many tiny files.
Ideally I want to create only a handful of parquet files within the partition 'date'.
What would be the best way to control this? Is it by using 'coalesce()'?
How will that effect the amount of files created in a given partition? Is it dependent on how many executors I have working in Spark? (currently set at 100).

you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter
Try this:
df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")

In Python you can rewrite Raphael's Roth answer as:
(df
.repartition("date")
.write.mode("append")
.partitionBy("date")
.parquet("{path}".format(path=path)))
You might also consider adding more columns to .repartition to avoid problems with very large partitions:
(df
.repartition("date", another_column, yet_another_colum)
.write.mode("append")
.partitionBy("date)
.parquet("{path}".format(path=path)))

The simplest solution would be to replace your actual partitioning by :
df
.repartition(to_date($"date"))
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
You can also use more precise partitioning for your DataFrame i.e the day and maybe the hour of an hour range. and then you can be less precise for writer.
That actually depends on the amount of data.
You can reduce entropy by partitioning DataFrame and the write with partition by clause.

I came across the same issue and I could using coalesce solved my problem.
df
.coalesce(3) // number of parts/files
.write.mode(SaveMode.Append)
.parquet(s"$path")
For more information on using coalesce or repartition you can refer to the following spark: coalesce or repartition

Duplicating my answer from here: https://stackoverflow.com/a/53620268/171916
This is working for me very well:
data.repartition(n, "key").write.partitionBy("key").parquet("/location")
It produces N files in each output partition (directory), and is (anecdotally) faster than using coalesce and (again, anecdotally, on my data set) faster than only repartitioning on the output.
If you're working with S3, I also recommend doing everything on local drives (Spark does a lot of file creation/rename/deletion during write outs) and once it's all settled use hadoop FileUtil (or just the aws cli) to copy everything over:
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
// ...
def copy(
in : String,
out : String,
sparkSession: SparkSession
) = {
FileUtil.copy(
FileSystem.get(new URI(in), sparkSession.sparkContext.hadoopConfiguration),
new Path(in),
FileSystem.get(new URI(out), sparkSession.sparkContext.hadoopConfiguration),
new Path(out),
false,
sparkSession.sparkContext.hadoopConfiguration
)
}

how about trying running scripts like this as map job consolidating all the parquet files into one:
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat

Related

Avoid data shuffle and coalesce-numPartitions is not applied to individual partition while doing left anti-join in spark dataframe

I have two dataframe - target_df and reference_df. I need to remove account_id's in target_df which is present in reference_df.
target_df is created from hive table, will have hundreds of partitions. It is partitioned based on date(20220101 to 20221101).
I am doing left anti-join and writing data in hdfs location.
val numPartitions = 10
val df_purge = spark.sql(s"SELECT /*+ BROADCASTJOIN(ref) */ target.* FROM input_table target LEFT ANTI JOIN ${reference_table} ref ON target.${Customer_ID} = ref.${Customer_ID}")
df_purge.coalesce(numPartitions).write.partitionBy("date").mode("overwrite").parquet("hdfs_path")
I need to apply same numPartitions value to each partition. But it is applying to numPartitions value to entire dataframe. For example: If it has 100 date partitions, i need to have 100 * 10 = 1000 part files. These code is not working as expected. I tried repartitionby("date") but this is causing huge data shuffle.
Can anyone please provide an optimized solution. Thanks!

I am afraid that you can not skip shuffle in this case. All repartition/coalesce/partitionBy are working on dataset level and i dont think that there is a way to just split partitions into 10 without shuffle
You tried to use coalesce which is not causing shuffle and this is true, but coalesce can only be used to decrese number of partitions so its not going to help you
You can try to achieve what you want by using combination of raprtition and repartitionBy. Here is description of both functions (same applies to Scala source: https://sparkbyexamples.com:
PySpark repartition() is a DataFrame method that is used to increase
or reduce the partitions in memory and when written to disk, it create
all part files in a single directory.
PySpark partitionBy() is a method of DataFrameWriter class which is
used to write the DataFrame to disk in partitions, one sub-directory
for each unique value in partition columns.
If you first repartition your dataset with repartition = 1000 Spark is going to create 1000 partitions in memory. Later, when you call repartitionBy, Spark is going to create sub-directory forr each value and create one part file for each in-memory partition which contains given key
So if after repartition you have date X in 500 partitions out of 1000 you will find 500 file in sub-directory for this date
In article which i mentioned previously you can find simple example of this behaviourm, chech chapter 1.3 partitionBy(colNames : String*) Example
#Use repartition() and partitionBy() together
dfRepart.repartition(2)
.write.option("header",True) \
.partitionBy("state") \
.mode("overwrite") \
.csv("c:/tmp/zipcodes-state-more")

In Apache Spark's `bucketBy`, how do you generate 1 file per bucket instead of 1 file per bucket per partition?

I am trying to use Spark's bucketBy feature on a pretty large dataset.
dataframe.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
The problem is that my Spark cluster has about 500 partitions/tasks/executors (not sure the terminology), so I end up with files that look like:
part-00001-{UUID}_00001.c000.snappy.parquet
part-00001-{UUID}_00002.c000.snappy.parquet
...
part-00001-{UUID}_00500.c000.snappy.parquet
part-00002-{UUID}_00001.c000.snappy.parquet
part-00002-{UUID}_00002.c000.snappy.parquet
...
part-00002-{UUID}_00500.c000.snappy.parquet
part-00500-{UUID}_00001.c000.snappy.parquet
part-00500-{UUID}_00002.c000.snappy.parquet
...
part-00500-{UUID}_00500.c000.snappy.parquet
That's 500x500=250000 bucketed parquet files! It takes forever for the FileOutputCommitter to commit that to S3.
Is there a way to generate one file per bucket, like in Hive? Or is there a better way to deal with this problem? As of now it seems like I have to choose between lowering the parallelism of my cluster (reduce number of writers) or reducing the parallelism of my parquet files (reduce number of buckets).
Thanks

In order to get 1 file per final bucket do the following. Right before writing the dataframe as table repartition it using exactly same columns as ones you are using for bucketing and set the number of new partitions to be equal to number of buckets you will use in bucketBy (or a smaller number which is a divisor of number of buckets, though I don't see a reason to use a smaller number here).
In your case that would probably look like this:
dataframe.repartition(500, bucketColumn1, bucketColumn2)
.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
In the cases when you're saving to an existing table you need to make sure the types of columns are matching exactly (e.g. if your column X is INT in dataframe, but BIGINT in the table you're inserting into your repartitioning by X into 500 buckets won't match repartitioning by X treated as BIGINT and you'll end up with each of 500 executors writing 500 files again).
Just to be 100% clear - this repartitioning will add another step into your execution which is to gather the data for each bucket on 1 executor (so one full data reshuffle if the data was not partitioned same way before). I'm assuming that is exactly what you want.
It was also mentioned in comments to another answer that you'll need to be prepared for possible issues if your bucketing keys are skewed. It is true, but default Spark behavior doesn't exactly help you much if the first thing you do after loading the table is to aggregate/join on the same columns you bucketed by (which seems like a very possible scenario for someone who chose to bucket by these columns). Instead you will get a delayed issue and only see the skewness when try to load the data after the writing.
In my opinion it would be really nice if Spark offered a setting to always repartition your data before writing a bucketed table (especially when inserting into existing tables).

This should solve it.
dataframe.write()
.format("parquet")
.bucketBy(1, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
Modify the Input Parameter for the BucketBy Function to 1.
You can look at the code of bucketBy from spark's git repository - https://github.com/apache/spark/blob/f8d59572b014e5254b0c574b26e101c2e4157bdd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
The first split part-00001, part-00002 is based on the number of parallel tasks running when you save the bucketed table. In your case you had 500 parallel tasks running. The number of files inside each part file is decided based on the input you provide for the bucketBy function.
To learn more about Spark tasks, partitions, executors, view my Medium articles - https://medium.com/#tharun026

How to avoid writing empty json files in Spark [duplicate]

I am reading from Kafka queue using Spark Structured Streaming. After reading from Kafka I am applying filter on the dataframe. I am saving this filtered dataframe into a parquet file. This is generating many empty parquet files. Is there any way I can stop writing an empty file?
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KafkaServer) \
.option("subscribe", KafkaTopics) \
.load()
Transaction_DF = df.selectExpr("CAST(value AS STRING)")
decompDF = Transaction_DF.select(zip_extract("value").alias("decompress"))
filterDF = decomDF.filter(.....)
query = filterDF .writeStream \
.option("path", outputpath) \
.option("checkpointLocation", RawXMLCheckpoint) \
.start()

Is there any way I can stop writing an empty file.
Yes, but you would rather not do it.
The reason for many empty parquet files is that Spark SQL (the underlying infrastructure for Structured Streaming) tries to guess the number of partitions to load a dataset (with records from Kafka per batch) and does this "poorly", i.e. many partitions have no data.
When you save a partition with no data you will get an empty file.
You can use repartition or coalesce operators to set the proper number of partitions and reduce (or even completely avoid) empty files. See Dataset API.
Why would you not do it? repartition and coalesce may incur performance degradation due to the extra step of shuffling the data between partitions (and possibly nodes in your Spark cluster). That can be expensive and not worth doing it (and hence I said that you would rather not do it).
You may then be asking yourself, how to know the right number of partitions? And that's a very good question in any Spark project. The answer is fairly simple (and obvious if you understand what and how Spark does the processing): "Know your data" so you can calculate how many is exactly right.

I recommend using repartition(partitioningColumns) on the Dataframe resp. Dataset and after that partitionBy(partitioningColumns) on the writeStream operation to avoid writing empty files.
Reason:
The bottleneck if you have a lot of data is often the read performance with Spark if you have a lot of small (or even empty) files and no partitioning. So you should definitely make use of the file/directory partitioning (which is not the same as RDD partitioning).
This is especially a problem when using AWS S3.
The partitionColumns should fit your common queries when reading the data like timestamp/day, message type/Kafka topic, ...
See also the partitionBy documentation on http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/, year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.

you can try with repartitionByRange(column)..
I used this while writing dataframe to HDFS .. It solved my empty file creation issue.

If you are using yarn client mode, then setting the num of executor cores to 1 will solve the problem. This means that only 1 task will be run at any time per executor.

How to avoid empty files while writing parquet files?

I am reading from Kafka queue using Spark Structured Streaming. After reading from Kafka I am applying filter on the dataframe. I am saving this filtered dataframe into a parquet file. This is generating many empty parquet files. Is there any way I can stop writing an empty file?
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KafkaServer) \
.option("subscribe", KafkaTopics) \
.load()
Transaction_DF = df.selectExpr("CAST(value AS STRING)")
decompDF = Transaction_DF.select(zip_extract("value").alias("decompress"))
filterDF = decomDF.filter(.....)
query = filterDF .writeStream \
.option("path", outputpath) \
.option("checkpointLocation", RawXMLCheckpoint) \
.start()

Is there any way I can stop writing an empty file.
Yes, but you would rather not do it.
The reason for many empty parquet files is that Spark SQL (the underlying infrastructure for Structured Streaming) tries to guess the number of partitions to load a dataset (with records from Kafka per batch) and does this "poorly", i.e. many partitions have no data.
When you save a partition with no data you will get an empty file.
You can use repartition or coalesce operators to set the proper number of partitions and reduce (or even completely avoid) empty files. See Dataset API.
Why would you not do it? repartition and coalesce may incur performance degradation due to the extra step of shuffling the data between partitions (and possibly nodes in your Spark cluster). That can be expensive and not worth doing it (and hence I said that you would rather not do it).
You may then be asking yourself, how to know the right number of partitions? And that's a very good question in any Spark project. The answer is fairly simple (and obvious if you understand what and how Spark does the processing): "Know your data" so you can calculate how many is exactly right.

you can try with repartitionByRange(column)..
I used this while writing dataframe to HDFS .. It solved my empty file creation issue.

If you are using yarn client mode, then setting the num of executor cores to 1 will solve the problem. This means that only 1 task will be run at any time per executor.

Spark : Modify CSV file and write to other folder

Folks,
We have one requirement where we wanted to do a minor transformation on CSV file and write the same in to other HDFS folder using spark.
e.g /input/csv1.txt (at least 4 GB file)
ID,Name,Address
100,john,some street
output should be in file (output/csv1.txt). Basically two new columns will be added after analyzing address ( Order of record should be same as input file)
ID,Name,Address,Country,ZipCode
100,Name,Address,India,560001
Looks like there is no easy to do this with spark.

Ehm, I don't know what you mean by no easy way - the spark-csv package makes it very easy IMHO. Depending on which version of Spark you are running, you need to do one of the following:
Spark 2.x
val df = spark.read.csv("/path/to/files/")
df
.withColumn("country", ...)
.withColumn("zip_code", ...)
.write
.csv("/my/output/path/")
Spark 1.x
val df = sqlContext.read.format("com.databricks.spark.csv").load(/path/to/my/files/")
df.
.withColumn("country", ...)
.withColumn("zip_code", ...)
.write
.format("com.databricks.spark.csv")
.save("/my/output/path/")
Note, that I just put withColumn here - you are probably joining with some other dataframe containing the country and zip code, but my example is just to illustrate how you read and write it with the spark-csv package (which has been build into Spark 2.x)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string