Optimal file size and parquet block size - apache-spark

I have around 100 GB of data per day which I write to S3 using Spark. The write format is parquet. The application which writes this run Spark 2.3
The 100 GB data is further partitioned, where the largest partition is 30 GB. For this case, let's just consider that 30 GB partition.
We are planning to migrate this whole data and rewrite to S3, in Spark 2.4. Initially we didn't decide on file size and block size when writing to S3. Now that we are going to rewrite everything, we want to take into consideration the optimal file size and parquet block size.
What is the optimal file size to write to S3 in parquet ?
Can we write 1 file with 30 GB size and parquet block size as 512 MB ? How will reading work in this case ?
Same as #2 but parquet block size as 1 GB ?

Before talking about the parquet side of the equation, one thing to consider is how the data will be used after you save it to parquet.
If it's going to be read/processed often, you may want to consider what are the access patterns and decide to partition it accordingly.
One common pattern is partitioning by date, because most of our queries have a time range.
Partitioning your data appropriately will have a much bigger impact on performance on using that data after it is written.
Now, onto Parquet, the rule of thumb is for the parquet block size to be roughly the size of the underlying file system. That matters when you're using HDFS, but it doesn't matter much when you're using S3.
Again, the consideration for the Parquet block size, is how you're reading the data.
Since a Parquet block has to be basically reconstructed in memory, the larger it is, the more memory is needed downstream. You also will need fewer workers, so if your downstream workers have plenty of memory you can have larger parquet blocks as it will be slightly more efficient.
However, for better scalability, it's usually better having several smaller objects - especially according to some partitioning scheme - versus one large object, which may act as a performance bottleneck, depending on your use case.
To sum it up:
a larger parquet block size means slightly smaller file size (since compression works better on large files) but larger memory footprint when serializing/deserializing
the optimal file size depends on your setup
if you store 30GB with 512MB parquet block size, since Parquet is a splittable file system and spark relies on HDFS getSplits() the first step in your spark job will have 60 tasks. They will use byte-range fetches to get different parts of the same S3 object in parallel. However, you'll get better performance if you break it down in several smaller (preferably partitioned) S3 objects, since they can be written in parallel (one large file has to be written sequentially) and also most likely have better reading performance when accessed by a large number of readers.

Related

Is it allowed to merge small files(but will be large when merged) in HDFS by using coalesce or repartition?

I'm using an hdfs-sink-connector to consume Kafka's data into HDFS.
The Kafka connector writes data every 10 minutes, and sometimes the written file's size is really small; it varies from 2MB to 100MB. So, the written files actually waste my HDFS storage since each block size is 256MB.
The directory is created per date; so I wondered it would be great to merge many small files into one big file by daily batch. (I expected the HDFS will automatically divide one large file into block size as a result.)
I know there are many answers which say we could use spark's coalesce(1) or repartition(1), but I worried about OOM error if I read the whole directory and use those functions; it might be more than 90GB~100GB if I read every file.
Will 90~100GB in HDFS be allowed? Am I don't need to be worried about it?
Could anyone let me know if there is a best practice for merging small HDFS files? Thanks!
So, the written files actually waste my HDFS storage since each block size is 256MB.
HDFS doesn't "fill out" the unused parts of the block. So a 2MB file only uses 2MB on disk (well, 6MB if you account for 3x replication). The main concern with small files on HDFS is that billions of small files can cause problems.
I worried about OOM error if I read the whole directory and use those functions
Spark may be an in-memory processing framework, but it still works if the data doesn't fit into memory. In such situations processing spills over onto disk and will be a bit slower.
Will 90~100GB in HDFS be allowed?
That is absolutely fine - this is big data after all. As you noted, the actual file will be split into smaller blocks in the background (but you won't see this unless you use hadoop fsck).

How to control size of Parquet files in Glue?

I'm loading a data set into a DynamicFrame, perform a transformation and then write it back to S3:
datasink = glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set"
},
format = "parquet"
)
The result is 12 Parquet files with an average size of about 3MB.
First of all, I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files. If somebody has an insight here, I'd appreciate to hear about it.
But practically speaking I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Using coalesce(1) will deteriorate the performance of Glue in the long run. While, it may work for small files, it will take ridiculously long amounts of time for larger files.
coalesce(1) makes only 1 spark executor to write the file which without coalesce() would have used all the spark executors to write the file.
Also, using coalesce(1) will have bigger cost. 1 executor running for long run time will have bigger cost than all executors running for fraction of the time taken by 1 executor.
Coalesce(1) took 4 hrs 48 minutes to process 1GB of Parquet Snappy Compressed Data.
Coalesce(9) took 48 minutes for the same.
No Coalesce() did the same job in 25 minutes.
I haven't tried yet. But you can set accumulator_size in write_from_options.
Check https://github.com/awslabs/aws-glue-libs/blob/master/awsglue/context.py for how to pass value.
Alternatively, you can use pyspark DF with 1 partition before write in order to make sure it writes to one file only.
df.coalesce(1).write.format('parquet').save('s3://the-bucket/some-data-set')
Note that writing to 1 file will not take advantage of parallel writing and hence will increase time to write.
You could try repartition(1) before writing the dynamic dataframe to S3. Refer here to understand why coalesce(1) is a bad choice for merging. It might also cause Out Of Memory(OOM) exceptions if a single node cannot hold all the data to be written.
I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files.
The number of the output files is directly linked to the number of partitions. Spark cannot assume a default size for output files as it is application depended. The only way you control the size of output files is to act on your partitions numbers.
I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Yes, it is possible but there is no rule of thumb. You have to try different settings according to your data.
If you are using AWS Glue API [1], you can control how to group small files into a single partition while you read data:
glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set",
"groupFiles": "inPartition",
"groupSize": "10485760" # 10485760 bytes (10 MB)
}
format = "parquet"
)
If your transformation code does not impact too much the data distribution (not filtering, not joining, etc), you should expect the output file to have almost the same size as the read in input (not considering compression rate) In general, Spark transformations are pretty complex with joins, aggregates, filtering. This changes the data distribution and number of final partitions.
In this case, you should use either coalesce() or repartition() to control the number of partitions you expect.
[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-job-output-large-files/

Does size of part files play a role for Spark SQL performance

I am trying to query the hdfs which has lot of part files (avro). Recently we made a change to reduce parallelism and thus the size of part files have increased , the size of each of these part files are in the range of 750MB to 2 GB (we use spark streaming to write date to hdfs in 10 minute intervals, so the size of these files depends on the amount of data we are processing from the upstream). The number of part files would be around 500. I was wondering if the size of these part files/ number of part files would play any role in the spark SQL performance?
I can provide more information if required.
HDFS, Map Reduce and SPARK prefer files that are larger in size, as opposed to many small files. S3 also has issues. I am not sure if you mean HDFS or S3 here.
Repartitioning smaller files to a lesser number of larger files will - without getting into all the details - allow SPARK or MR to process less of, but bigger blocks of data, thereby improving the speed of jobs by decreasing the number of map tasks needed to read them in, and reducing the cost of storage due to less wastage and name node contention issues.
All in all, the small files problem of which there is much to read on. E.g. https://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html. Just to be clear, I am a Spark fan.
Generally, fewer, larger files are better,
One issue is whether the file can be split, and how.
Files compressed with .gz cannot be split: you have to read from the start to the finish, so at most one worker at a time gets assigned a single file (except near the end of a query & speculation can trigger a second). Use a compression like snappy and all is well
very small files are inefficient as startup/commit overhead dominates
on HDFS, small files put load on the namenode, so the ops team may be unhappy to

Files larger than block size in HDFS

IT is common knowledge that writing a single file which is larger than HDFS block size is not optimal, same goes for many very small files.
However, when performing a repartition('myColumn) operation in spark it will create a single partition per item (let's assume day) which contains all the records (as a single file) which might be several GB in size (assume 20GB) whereas HDFS block size is configured to be 256 MB.
Is it actually bad that the file is too large? When reading the file back in (assuming it is a splittable file like parquet or orc with gzip or zlib compression) spark is creating >> 1 task per file i.e. does this mean I do not need to worry about specifying maxRecordsPerFile / file size larger than HDFS block size ?
Having a singular large file in a splittable format is a good thing in HDFS. The namenode has to maintain less file references and there are more blocks to parallize processing.
In fact, 20 GB still isn't large in Hadoop terms considering it'll fit on a cheap flash drive

spark mechanism of accessing files larger than (or lesser) than HDFS block size

This is most of a theoretical query per se, but directly linked to how I should create my files in HDFS. So, please bear with me for a bit.
I'm recently stuck on creating Dataframes for a set of data stored in parquet (snappy) files sitting on HDFS. Each parquet file is approximately 250+ MB in size but the total number of files are around 6k. Which I see as the reason of creating around 10K tasks while creating the DF & obviously runs longer than expected.
I went through some posts where the explanation of the optimal parquet file size to be 1G minimum has been given (https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html),
(Is it better to have one large parquet file or lots of smaller parquet files?).
I wanted to understand how Spark's processing is affected by the size of the files it is reading. More so, does HDFS block size & the file size greater or lesser than HDFS block size literally affects how spark partitions get created? If yes, then how; I need to understand the granular level details. If anyone has any specific & precise links to the context I'm asking of, it'd be a great help in understanding.

Resources