Resolving small file issue in pyspark - apache-spark

I am reading from a partitioned table that has close to 4 billion records.
The files that I am reading from is my source, and I have no control over it to alter the records.
While reading the files through dataframes, for each partition I am creating 2000 files of size less than 2KB. This is because of shuffle partition being set to 2000, to increase the execution speed.
Approach followed to resolve this issue:
I have looped over the HDFS path of the table, post its execution is completed as has created a list with data paths [/dv/hdfs/..../table_name/partition_value=01,/dv/hdfs/..../table_name/partition_value=02..]
For each such path, I have calculated
disk usage and block size from cluster and got the appropriate number of partitions as
no_of_partitions = ceil[disk_usage / block size], and then written the data into another location with the same partition_id such as [/dv/hdfs/..../table2_name/partition_value=01].
Now though this works in reducing the small files to avg block size of 82 MB from 2KB, it is taking about 2.5 mins per partition. With 256 such partitions being available, it is taking more than 10hrs to finish the execution.
Kindly suggest any other method where this could be achieved in less than 2 hrs of time.

Although you have 2000 shuffle partitions you can and should control the output files.
Generating small files in spark is itself a performance degradation for the next read operations.
Now to control small files issue you can do the following:
While writing the dataframe to hdfs repartition it based on the number of partitions and controlling the number of output files per partition
df.repartition(partition_col).write.option("maxRecordsPerFile", 100000).partition_by(partition_col).parquet(path)
This will generate files having 100000 records each in every partition. Hence solving your small files issue. This will improve overall read and write performance of your job.
Hope it helps.

Related

How to specify file size using repartition() in spark

Im using pyspark and I have a large data source that I want to repartition specifying the files size per partition explicitly.
I know using the repartition(500) function will split my parquet into 500 files with almost equal sizes.
The problem is that new data gets added to this data source every day. On some days there might be a large input, and on some days there might be smaller inputs. So when looking at the partition file size distribution over a period of time, it varies between 200KB to 700KB per file.
I was thinking of specifying the max size per partition so that I get more or less the same file size per file per day irrespective of the number of files.
This will help me when running my job on this large dataset later on to avoid skewed executor times and shuffle times etc.
Is there a way to specify it using the repartition() function or while writing the dataframe to parquet?
You could consider writing your result with the parameter maxRecordsPerFile.
storage_location = //...
estimated_records_with_desired_size = 2000
result_df.write.option(
"maxRecordsPerFile",
estimated_records_with_desired_size) \
.parquet(storage_location, compression="snappy")

How to control size of Parquet files in Glue?

I'm loading a data set into a DynamicFrame, perform a transformation and then write it back to S3:
datasink = glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set"
},
format = "parquet"
)
The result is 12 Parquet files with an average size of about 3MB.
First of all, I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files. If somebody has an insight here, I'd appreciate to hear about it.
But practically speaking I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Using coalesce(1) will deteriorate the performance of Glue in the long run. While, it may work for small files, it will take ridiculously long amounts of time for larger files.
coalesce(1) makes only 1 spark executor to write the file which without coalesce() would have used all the spark executors to write the file.
Also, using coalesce(1) will have bigger cost. 1 executor running for long run time will have bigger cost than all executors running for fraction of the time taken by 1 executor.
Coalesce(1) took 4 hrs 48 minutes to process 1GB of Parquet Snappy Compressed Data.
Coalesce(9) took 48 minutes for the same.
No Coalesce() did the same job in 25 minutes.
I haven't tried yet. But you can set accumulator_size in write_from_options.
Check https://github.com/awslabs/aws-glue-libs/blob/master/awsglue/context.py for how to pass value.
Alternatively, you can use pyspark DF with 1 partition before write in order to make sure it writes to one file only.
df.coalesce(1).write.format('parquet').save('s3://the-bucket/some-data-set')
Note that writing to 1 file will not take advantage of parallel writing and hence will increase time to write.
You could try repartition(1) before writing the dynamic dataframe to S3. Refer here to understand why coalesce(1) is a bad choice for merging. It might also cause Out Of Memory(OOM) exceptions if a single node cannot hold all the data to be written.
I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files.
The number of the output files is directly linked to the number of partitions. Spark cannot assume a default size for output files as it is application depended. The only way you control the size of output files is to act on your partitions numbers.
I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Yes, it is possible but there is no rule of thumb. You have to try different settings according to your data.
If you are using AWS Glue API [1], you can control how to group small files into a single partition while you read data:
glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set",
"groupFiles": "inPartition",
"groupSize": "10485760" # 10485760 bytes (10 MB)
}
format = "parquet"
)
If your transformation code does not impact too much the data distribution (not filtering, not joining, etc), you should expect the output file to have almost the same size as the read in input (not considering compression rate) In general, Spark transformations are pretty complex with joins, aggregates, filtering. This changes the data distribution and number of final partitions.
In this case, you should use either coalesce() or repartition() to control the number of partitions you expect.
[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-job-output-large-files/

spark write to disk with N files less than N partitions

Can we write data to say 100 files, with 10 partitions in each file?
I know we can use repartition or coalesce to reduce number of partition. But I have seen some hadoop generated avro data with much more partitions than number of files.
The number of files that get written out is controlled by the parallelization of your DataFrame or RDD. So if your data is split across 10 Spark partitions you cannot write fewer than 10 files without reducing partitioning (e.g. coalesce or repartition).
Now, having said that when data is read back in it could be split into smaller chunks based on your configured split size but depending on format and/or compression.
If instead you want to increase the number of files written per Spark partition (e.g. to prevent files that are too large), Spark 2.2 introduces a maxRecordsPerFile option when you write data out. With this you can limit the number of records that get written per file in each partition. The other option of course would be to repartition.
The following will result in 2 files being written out even though it's only got 1 partition:
val df = spark.range(100).coalesce(1)
df.write.option("maxRecordsPerFile", 50).save("/tmp/foo")

Spark partition by files

I have several thousand compressed CSV files on a S3 bucket, each of size approximately 30MB(around 120-160MB after decompression), which I want to process using spark.
In my spark job, I am doing simple filter select queries on each row.
While partitioning Spark is dividing the files into two or more parts and then creating tasks for each partition. Each task is taking around 1 min to complete just to process 125K records. I want to avoid this partitioning of a single file across many tasks.
Is there a way to fetch files and partition data such that each task works on one complete file, that is, Number of tasks = Number of input files.?
as well as playing with spark options, you can tell the s3a filesystem client to tell it to tell Spark that the "block size" of a file in S3 is 128 MB. The default is 32 MB, which is close enough to your "approximately 30MB" number that spark could be splitting the files in two
spark.hadoop.fs.s3a.block.size 134217728
using the wholeTextFiles() operation is safer though

Spark DataFrames with Parquet and Partitioning

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.
So let me clarify, parquet compressed (these numbers are not fully accurate).
1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.
Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?
Assumptions:
1 Block = 1 Partition For Spark
1 Core operated on 1 Partition
Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).
Spark 1.5 DataFrame partitions parquet file as follows:
1 partition per HDFS block
If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size
I saw the other answer but I thought I can clarify more on this. If you are reading Parquet from posix filesystem then you can increase number of partitioning readings by just having more workers in Spark.
But in order to control the balance of data that comes into workers one may use the hierarchical data structure of the Parquet files, and later in the workers you may point to different partitions or parts of the Parquet file. This will give you control over how much of data should go to each worker according to the domain of your dataset (if by balancing data in workers you mean equal batch of data per worker is not efficient).

Resources