How to control size of Parquet files in Glue? - apache-spark

I'm loading a data set into a DynamicFrame, perform a transformation and then write it back to S3:
datasink = glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set"
},
format = "parquet"
)
The result is 12 Parquet files with an average size of about 3MB.
First of all, I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files. If somebody has an insight here, I'd appreciate to hear about it.
But practically speaking I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?

Using coalesce(1) will deteriorate the performance of Glue in the long run. While, it may work for small files, it will take ridiculously long amounts of time for larger files.
coalesce(1) makes only 1 spark executor to write the file which without coalesce() would have used all the spark executors to write the file.
Also, using coalesce(1) will have bigger cost. 1 executor running for long run time will have bigger cost than all executors running for fraction of the time taken by 1 executor.
Coalesce(1) took 4 hrs 48 minutes to process 1GB of Parquet Snappy Compressed Data.
Coalesce(9) took 48 minutes for the same.
No Coalesce() did the same job in 25 minutes.

I haven't tried yet. But you can set accumulator_size in write_from_options.
Check https://github.com/awslabs/aws-glue-libs/blob/master/awsglue/context.py for how to pass value.
Alternatively, you can use pyspark DF with 1 partition before write in order to make sure it writes to one file only.
df.coalesce(1).write.format('parquet').save('s3://the-bucket/some-data-set')
Note that writing to 1 file will not take advantage of parallel writing and hence will increase time to write.

You could try repartition(1) before writing the dynamic dataframe to S3. Refer here to understand why coalesce(1) is a bad choice for merging. It might also cause Out Of Memory(OOM) exceptions if a single node cannot hold all the data to be written.

I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files.
The number of the output files is directly linked to the number of partitions. Spark cannot assume a default size for output files as it is application depended. The only way you control the size of output files is to act on your partitions numbers.
I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Yes, it is possible but there is no rule of thumb. You have to try different settings according to your data.
If you are using AWS Glue API [1], you can control how to group small files into a single partition while you read data:
glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set",
"groupFiles": "inPartition",
"groupSize": "10485760" # 10485760 bytes (10 MB)
}
format = "parquet"
)
If your transformation code does not impact too much the data distribution (not filtering, not joining, etc), you should expect the output file to have almost the same size as the read in input (not considering compression rate) In general, Spark transformations are pretty complex with joins, aggregates, filtering. This changes the data distribution and number of final partitions.
In this case, you should use either coalesce() or repartition() to control the number of partitions you expect.
[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-job-output-large-files/

Related

How to specify file size using repartition() in spark

Im using pyspark and I have a large data source that I want to repartition specifying the files size per partition explicitly.
I know using the repartition(500) function will split my parquet into 500 files with almost equal sizes.
The problem is that new data gets added to this data source every day. On some days there might be a large input, and on some days there might be smaller inputs. So when looking at the partition file size distribution over a period of time, it varies between 200KB to 700KB per file.
I was thinking of specifying the max size per partition so that I get more or less the same file size per file per day irrespective of the number of files.
This will help me when running my job on this large dataset later on to avoid skewed executor times and shuffle times etc.
Is there a way to specify it using the repartition() function or while writing the dataframe to parquet?
You could consider writing your result with the parameter maxRecordsPerFile.
storage_location = //...
estimated_records_with_desired_size = 2000
result_df.write.option(
"maxRecordsPerFile",
estimated_records_with_desired_size) \
.parquet(storage_location, compression="snappy")

Optimising Spark read and write performance

I have around 12K binary files, each of 100mb in size and contains multiple compressed records with variables lengths. I am trying to find the most efficient way to read them, uncompress and then write back in parquet format. The cluster i have has is 6 nodes with 4 cores each.
At this moment with pseudocode below, it takes around 8 hrs to read all the files and writing back to parquet is very very slow.
def reader(file_name):
keyMsgList = []
with open(file_name, "rb") as f:
while True:
header = f.read(12)
if not header:
break
keyBytes = header[0:8]
msgLenBytes = header[8:12]
# conver keyBytes & msgLenBytes to int
message = f.read(msgLen)
keyMsgList.append((key, decode(message)))
return keyMsgList
files = os.listdir("/path/to/binary/files")
rddFiles = sc.parallelize(files, 6000)
df = spark.createDataFrame(rddFiles.flatMap(reader), schema)
df.repartition(6000).write.mode("append").partitionBy("key").parquet("/directory")
The rational behind choosing 6000 here sc.parallelize(files, 6000) is creating partitions each with 200 MB in size i.e. (12k files * 100mb size) / 200MB. Being the sequential nature of file content that is needs to read each of them byte by byte, not sure if read can be further optimised?
Similarly, when writing back to parquet, the number in repartition(6000) is to make sure data is distributed uniformly and all executors can write in parallel. However, it turns out be a very slow operation.
One solution is to increase the number of executors, which will improve the read performance but not sure if it will improve writes?
Looking for any suggestion on how can I improve the performance here?
Suggestion 1: do not use repartition but coalesce.
See here. You identified the bottleneck of the repartition operatio, this is because you have launched a full shuffle. With coalesce you won't do that. You will end up with N partitions also. They won't be as balanced as those you would get with repartition but does it matter ?
I would recommend you to favor coalesce rather than repartition
Suggestion 2: 6000 partitions is maybe not optimal
Your application runs with 6 nodes with 4 cores. You have 6000 partitions. This means you have around 250 partitions by core (not even counting what is given to your master). That's, in my opinion, too much.
Since your partitions are small (around 200Mb) your master probably spend more time awaiting anwsers from executor than executing the queries.
I would recommend you to reduce the number of partitions
Suggestion 3: can you use the DataFrame API ?
DataFrame API operations are generally faster and better than a hand-coded solution.
Maybe have a look at pyspark.sql.functions to see if you can find something there (see here). I don't know if it's relevent since I have not seen your data but that's a general recommendation I do from my experience.

Optimal file size and parquet block size

I have around 100 GB of data per day which I write to S3 using Spark. The write format is parquet. The application which writes this run Spark 2.3
The 100 GB data is further partitioned, where the largest partition is 30 GB. For this case, let's just consider that 30 GB partition.
We are planning to migrate this whole data and rewrite to S3, in Spark 2.4. Initially we didn't decide on file size and block size when writing to S3. Now that we are going to rewrite everything, we want to take into consideration the optimal file size and parquet block size.
What is the optimal file size to write to S3 in parquet ?
Can we write 1 file with 30 GB size and parquet block size as 512 MB ? How will reading work in this case ?
Same as #2 but parquet block size as 1 GB ?
Before talking about the parquet side of the equation, one thing to consider is how the data will be used after you save it to parquet.
If it's going to be read/processed often, you may want to consider what are the access patterns and decide to partition it accordingly.
One common pattern is partitioning by date, because most of our queries have a time range.
Partitioning your data appropriately will have a much bigger impact on performance on using that data after it is written.
Now, onto Parquet, the rule of thumb is for the parquet block size to be roughly the size of the underlying file system. That matters when you're using HDFS, but it doesn't matter much when you're using S3.
Again, the consideration for the Parquet block size, is how you're reading the data.
Since a Parquet block has to be basically reconstructed in memory, the larger it is, the more memory is needed downstream. You also will need fewer workers, so if your downstream workers have plenty of memory you can have larger parquet blocks as it will be slightly more efficient.
However, for better scalability, it's usually better having several smaller objects - especially according to some partitioning scheme - versus one large object, which may act as a performance bottleneck, depending on your use case.
To sum it up:
a larger parquet block size means slightly smaller file size (since compression works better on large files) but larger memory footprint when serializing/deserializing
the optimal file size depends on your setup
if you store 30GB with 512MB parquet block size, since Parquet is a splittable file system and spark relies on HDFS getSplits() the first step in your spark job will have 60 tasks. They will use byte-range fetches to get different parts of the same S3 object in parallel. However, you'll get better performance if you break it down in several smaller (preferably partitioned) S3 objects, since they can be written in parallel (one large file has to be written sequentially) and also most likely have better reading performance when accessed by a large number of readers.

Does size of part files play a role for Spark SQL performance

I am trying to query the hdfs which has lot of part files (avro). Recently we made a change to reduce parallelism and thus the size of part files have increased , the size of each of these part files are in the range of 750MB to 2 GB (we use spark streaming to write date to hdfs in 10 minute intervals, so the size of these files depends on the amount of data we are processing from the upstream). The number of part files would be around 500. I was wondering if the size of these part files/ number of part files would play any role in the spark SQL performance?
I can provide more information if required.
HDFS, Map Reduce and SPARK prefer files that are larger in size, as opposed to many small files. S3 also has issues. I am not sure if you mean HDFS or S3 here.
Repartitioning smaller files to a lesser number of larger files will - without getting into all the details - allow SPARK or MR to process less of, but bigger blocks of data, thereby improving the speed of jobs by decreasing the number of map tasks needed to read them in, and reducing the cost of storage due to less wastage and name node contention issues.
All in all, the small files problem of which there is much to read on. E.g. https://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html. Just to be clear, I am a Spark fan.
Generally, fewer, larger files are better,
One issue is whether the file can be split, and how.
Files compressed with .gz cannot be split: you have to read from the start to the finish, so at most one worker at a time gets assigned a single file (except near the end of a query & speculation can trigger a second). Use a compression like snappy and all is well
very small files are inefficient as startup/commit overhead dominates
on HDFS, small files put load on the namenode, so the ops team may be unhappy to

Spark write Parquet to S3 the last task takes forever

I'm writing a parquet file from DataFrame to S3.
When I look at the Spark UI, I can see all tasks but 1 completed swiftly of the writing stage (e.g. 199/200). This last task appears to take forever to complete, and very often, it fails due to exceeding executor memory limit.
I'd like to know what is happening in this last task. How to optimize it?
Thanks.
I have tried Glemmie Helles Sindholt solution and works very well.
Here is the code:
path = 's3://...'
n = 2 # number of repartitions, try 2 to test
spark_df = spark_df.repartition(n)
spark_df.write.mode("overwrite").parquet(path)
It sounds like you have a data skew. You can fix this by calling repartition on your DataFrame before writing to S3.
As others have noted, data skew is likely at play.
Besides that, I noticed that your task count is 200.
The configuration parameter spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations.
200 is the default for this setting, but generally it is far from an optimal value.
For small data, 200 could be overkill and you would waste time in the overhead of multiple partitions.
For large data, 200 can result in large partitions, which should be broken down into more, smaller partitions.
The really rough rules of thumb are:
- have 2-3x number of partitions to cpu's.
- Or ~128MB.
2GB's is the max partition size. If you are hovering just below 2000 partitions, Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000[1]
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
...
You can try playing with this parameter at runtime:
spark.conf.set("spark.sql.shuffle.partitions", "300")
[1]What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
This article - The Bleeding Edge: Spark, Parquet and S3 has a lot of useful information about Spark, S3 and Parquet. In particular, it talks about how the driver ends up writing out the _common_metadata_ files and can take quite a bit of time. There is a way to turn it off.
Unfortunately, they say that they go on to generate the common metadata themselves, but don't really talk about how they did so.

Resources