I saved my dataframe as parquet format
df.write.parquet('/my/path')
When checking on HDFS, I can see that there is 10 part-xxx.snappy.parquet files under the parquet directory /my/path
My question is: is one part-xxx.snappy.parquet file correspond to a partition of my dataframe ?
Yes, part-** files are created based on number of partitions in the dataframe while writing to HDFS.
To check number of partitions in the dataframe:
df.rdd.getNumPartitions()
To control number of files writing to filesystem we can use .repartition (or) .coalesce() (or) dynamically based on our requirement.
Yes, this creates one file per Spark-partition.
Note, that you can also partition files by some attribute:
df.write.partitionBy("key").parquet("/my/path")
in such case Spark is going to create up to Spark-partition number of files for each parquet-partition. Common way to reduce number of files in such case is to repartition data by key before writing (this effectively creates one file per partition).
Related
Hi I want to save my spark dataframe to a file with custom file format,
such that it partitions data to different files while writing to the file.
Also I need single part file for each partition key.
I have tried extending TextBasedFileFormat and change writer to suit my needs.
The data is getting partitioned while writing to file without shuffle.
But I feel each rdd partition will write data to different part file
When you write the dataframe, each partition of underlying RDD will be written by separate tasks. Now each of these RDD partitions might correspond to data which belongs to different partition key. So each task will end up creating multiple part files.
To solve this, you have to repartition your dataframe by the partitionKey. This will involve a shuffle and all the data corresponding to same partitionKey will come into same RDD partition. This can be done by -
val newDf = df.repartition("partitionKey")
Now this RDD can be written to any file format (say parquet, csv etc) and their should be 1 file per partition. If the file size is going big, it might create multiple files. This can be controlled by config "spark.sql.files.maxRecordsPerFile".
val newDf = df.repartition("partitionKey")
newDf.write.partitionBy("partitionKey").parquet("<directory_path>")
I would like to take small parquet files that are spread out through multiple partition layers on s3 and compress them into larger files with a single partition back out to s3.
So in this example, I have 3 partition layers (part1, part2, part3). I would like to take this data and write it back out only partitioned by part2
For my first run through I used:
df = spark.read
.option("basePath", "s3://some_bucket/base/location/in/s3/")
.parquet("s3://some_bucket/base/location/in/s3/part1=*/part2=*/part3=*/")
df.write.partitionBy("part2").parquet("s3://some_bucket/different/location/")
This worked for the most part but this seems to still create smaller files. Since I'm not running a coalesce or repartition. This brings me to my question. Is there a way I can easily compress these files into larger files based on size/row counts?
Thanks in advance!
Is there a way I can easily compress these files into larger files based on size/row counts?
Not really. Spark doesn't provide any utilities which can be used to limit size of the output files, as each files corresponds in general to a single partition.
So repartitioning by the same column as used for partitionBy is your best bet.
option("maxRecordsPerFile", 400000)
use this option while writing the file.
Our spark program is reading in parquet files. These files are partitioned by date in a directory like structure (e.g month=201703/day=20170313/). The filenames themselves contain a number which reflects the kafka partition they were from originally (e.g. data.232.parquet). The data of a specific user always ends up in the same partition, so it would make sense if one spark executor would read in all the parquet files of the same partition (to avoid shuffling down the road) across all the dates.
How can we accomplish this ? Maybe we also have to put the partition number in the directory hierarchy ? But even then it's not clear to me how we can tell Spark to use this information.
I am doing the following process
rdd.toDF.write.mode(SaveMode.Append).partitionBy("Some Column").parquet(output_path)
However, under each partition, there are too many parquet files and each of them, the size is very small, that will makes my following steps become very slow to load all the parquet files. Is there a better way that under each partition, make less parquet files and increase the single parquet file size?
You can repartition before save:
rdd.toDF.repartition("Some Column").write.mode(SaveMode.Append).partitionBy("Some Column")
I used to have this problem.
Actually you can't control the partition of files because it depends on the executor doing.
The way to work around it is using method coalesce to make a shuffle and you can make how many partition you want but it's not efficient way you also need to set driver memory enough to handle this operation.
df = df.coalesce(numPartitions).write.partitionBy(""yyyyy").parquet("xxxx")
I also faced this issue. The problem is if you use coalesce each partition gets same number of parquet files. Now different partitions have different size so ideally I need different coalesce for each partition.
It's going to be really quite expensive if you open a lot of small files. Let's say you open 1k files and each filesize are far from the value of your parquet.block.size.
Here are my suggestions:
Create a job that will first merge your input parquet files to have smaller number of files where their sizes are near or equal to parquet.block.size. The default block size for 128Mb, though it's configurable by updating parquet.block.size. Spark would love if your parquet file is near before or equal the value of your parquet.block.size. The block size is the size of a row group being buffered in memory.
Or update your spark job to just read limited number of files
Or if you have a big machine and/or resources, just do the right tuning.
Hive query has a way to merge small files into larger one. This is not available in spark sql. Also, reducing spark.sql.shuffle.partitions wont help with Dataframe API.
I tried below solution and it generated lesser number of parquet files(from 800 parquet files to 29).
Suppose the data is loaded to a dataframe df
Create a temporary table in hive.
df.createOrReplaceTempView("tempTable")
spark.sql("CREATE TABLE test_temp LIKE test")
spark.sql("INSERT INTO TABLE test_temp SELECT * FROM tempTable")
The test_temp will contain small parquet files.
Populate final hive table from temporary table
spark.sql("INSERT INTO test SELECT * FROM test_temp")
The final table will contain lesser files. Drop temporary table after populating final table.
I have a job that needs to save the result in parquet/avro format from all the worker nodes. Can I do a separate parquet file for each of the individual partition and read all the resulting files as a single table? Or is there a better way of going about this?
Input is divided into 96 partitions and result needs to be saved on HDFS. When I tried to save it as a file it created over a million small files.
You can do a repartition (or coalesce if you always want fewer partitions) to the desired number of partitions just before you call write. Your data will then be written into the same number of files. When you want to read in the data, you simply point to the folder with the files rather than to a specific file. Like this:
sqlContext.read.parquet("s3://my-bucket/path/to/files/")