I'm tuning a spark (v2.2.3) sql job that reads from parquet (about 1TB of data). I'm reading more parquet parts into a single spark partition, by increasing spark.sql.files.maxPartitionBytes from default 128MB to 1280MB.
The effect is fantastic (about 30% reduction in total time across all tasks).
What I'm struggling to understand is how come the total input size is greatly reduced.
With default 128MB, data is read into 12033 partitions, with total input of 61.9 GB. With the altered configuration, data is read into 1651, with less than half the input size - only 26.5 GB.
No doubt this is a good outcome, but I'm just trying to understand it, at the end - the same amount of records is read, the same exact columns.
If it matters, using all the other defaults - HDFS block size is 128MB, parquet block size (row group) is 128MB, parquet page size is 1MB.
Related
There are 200 files in a non formatted table in ORC format. Each file is around 170KB.The total size is around 33MB.
Wondering why the spark stage reading the table generating 7 tasks. The job is assigned one executor with 5 cores.
The way Spark maps files to partitions is quite complex but there 2 main configuration options that influence the number of partitions created:
spark.sql.files.maxPartitionBytes which is 128 MB by default and sets the maximum partition size for splittable sources. So if you have an 2 GB ORC file, you will end up with 16 partitions.
spark.sql.files.openCostInBytes which is 4 MB by default and is used as the cost to create a new partition which basically means that Spark will concatenate files into the same partitions if they are smaller that 4MB.
If you have lots of small splittable files, you will end up with partitions roughly 4MB in size by default, which is what happens in your case.
If you have non-splittable sources, such as gzipped files, they will always end up in a single partition, regardless of their size.
I have a use case in which sometimes I received 400GB data and sometimes 1MB data. I have set number of partitions to a hard coded value let's say 300. When I receive 1MB then script makes 300 partitions of very small sizes. I want to avoid this, somehow I want to partition the dataframe on basis of size. Let's say I want to make each partition of size 2GB.
Use -
spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats.sizeInBytes
to get the input size. You can then convert it to GB and compute the number of partitions by dividing it single partition size (like 2 GB)
Please refer my ans for other approaches to get input size - https://stackoverflow.com/a/62463009/4758823
My understanding is that spark.sql.files.maxPartitionBytes is used to control the partition size when spark reads data from hdfs.
However, I used spark sql to read data for a specific date in hdfs. It contains 768 files. The largest file is 4.7 GB. The smallest file is 17.8 MB.
the hdfs block size is 128MB.
the value of spark.sql.files.maxPartitionBytes is 128MB.
I expected that spark would split a large file into several partitions and make each partition no larger than 128MB. However, it doesn't work like that.
I know we can use repartition(), but it is an expensive operation.
I have around 100 GB of data per day which I write to S3 using Spark. The write format is parquet. The application which writes this run Spark 2.3
The 100 GB data is further partitioned, where the largest partition is 30 GB. For this case, let's just consider that 30 GB partition.
We are planning to migrate this whole data and rewrite to S3, in Spark 2.4. Initially we didn't decide on file size and block size when writing to S3. Now that we are going to rewrite everything, we want to take into consideration the optimal file size and parquet block size.
What is the optimal file size to write to S3 in parquet ?
Can we write 1 file with 30 GB size and parquet block size as 512 MB ? How will reading work in this case ?
Same as #2 but parquet block size as 1 GB ?
Before talking about the parquet side of the equation, one thing to consider is how the data will be used after you save it to parquet.
If it's going to be read/processed often, you may want to consider what are the access patterns and decide to partition it accordingly.
One common pattern is partitioning by date, because most of our queries have a time range.
Partitioning your data appropriately will have a much bigger impact on performance on using that data after it is written.
Now, onto Parquet, the rule of thumb is for the parquet block size to be roughly the size of the underlying file system. That matters when you're using HDFS, but it doesn't matter much when you're using S3.
Again, the consideration for the Parquet block size, is how you're reading the data.
Since a Parquet block has to be basically reconstructed in memory, the larger it is, the more memory is needed downstream. You also will need fewer workers, so if your downstream workers have plenty of memory you can have larger parquet blocks as it will be slightly more efficient.
However, for better scalability, it's usually better having several smaller objects - especially according to some partitioning scheme - versus one large object, which may act as a performance bottleneck, depending on your use case.
To sum it up:
a larger parquet block size means slightly smaller file size (since compression works better on large files) but larger memory footprint when serializing/deserializing
the optimal file size depends on your setup
if you store 30GB with 512MB parquet block size, since Parquet is a splittable file system and spark relies on HDFS getSplits() the first step in your spark job will have 60 tasks. They will use byte-range fetches to get different parts of the same S3 object in parallel. However, you'll get better performance if you break it down in several smaller (preferably partitioned) S3 objects, since they can be written in parallel (one large file has to be written sequentially) and also most likely have better reading performance when accessed by a large number of readers.
I am planning to use Spark to process data where each individual element/row in the RDD or DataFrame may occasionally be large (up to several GB).
The data will probably be stored in Avro files in HDFS.
Obviously, each executor must have enough RAM to hold one of these "fat rows" in memory, and some to spare.
But are there other limitations on row size for Spark/HDFS or for the common serialisation formats (Avro, Parquet, Sequence File...)? For example, can individual entries/rows in these formats be much larger than the HDFS block size?
I am aware of published limitations for HBase and Cassandra, but not Spark...
There are currently some fundamental limitations related to block size, both for partitions in use and for shuffle blocks - both are limited to 2GB, which is the maximum size of a ByteBuffer (because it takes an int index, so is limited to Integer.MAX_VALUE bytes).
The maximum size of an individual row will normally need to be much smaller than the maximum block size, because each partition will normally contain many rows, and the largest rows might not be evenly distributed among partitions - if by chance a partition contains an unusually large number of big rows, this may push it over the 2GB limit, crashing the job.
See:
Why does Spark RDD partition has 2GB limit for HDFS?
Related Jira tickets for these Spark issues:
https://issues.apache.org/jira/browse/SPARK-1476
https://issues.apache.org/jira/browse/SPARK-5928
https://issues.apache.org/jira/browse/SPARK-6235