Spark 2.0+: spark.sql.files.maxPartitionBytes is not working? - apache-spark

My understanding is that spark.sql.files.maxPartitionBytes is used to control the partition size when spark reads data from hdfs.
However, I used spark sql to read data for a specific date in hdfs. It contains 768 files. The largest file is 4.7 GB. The smallest file is 17.8 MB.
the hdfs block size is 128MB.
the value of spark.sql.files.maxPartitionBytes is 128MB.
I expected that spark would split a large file into several partitions and make each partition no larger than 128MB. However, it doesn't work like that.
I know we can use repartition(), but it is an expensive operation.

Related

Number of tasks while reading HDFS in Spark

There are 200 files in a non formatted table in ORC format. Each file is around 170KB.The total size is around 33MB.
Wondering why the spark stage reading the table generating 7 tasks. The job is assigned one executor with 5 cores.
The way Spark maps files to partitions is quite complex but there 2 main configuration options that influence the number of partitions created:
spark.sql.files.maxPartitionBytes which is 128 MB by default and sets the maximum partition size for splittable sources. So if you have an 2 GB ORC file, you will end up with 16 partitions.
spark.sql.files.openCostInBytes which is 4 MB by default and is used as the cost to create a new partition which basically means that Spark will concatenate files into the same partitions if they are smaller that 4MB.
If you have lots of small splittable files, you will end up with partitions roughly 4MB in size by default, which is what happens in your case.
If you have non-splittable sources, such as gzipped files, they will always end up in a single partition, regardless of their size.

How does Spark repartitioning work w.r.t to the input file partitioning?

I have 2 questions:
Can we have less partitions set in a call to coalesce than the HDFS block size? e.g. Suppose I have a 1 GB file size and HDFS block size is 128MB, can I do coalesce(1)?
As we know, input files on HDFS are physically split on the basis of block size. Does Spark further split the data (physically) when we repartition, or change parallelism?
e.g suppose I have a 1 GB file size and hdfs block size is 128MB. can I do coalesce(1)?
Yes, you can coalesce to a single file and write that to an external file system (at least with EMRFS)
does spark further splits the data (physically) when we repartition or change parallelism ?
repartition slices the data into partitions independently of the partitioning of the original input files.

Apache Spark ---- how spark reads large partitions from source when there is no enough memory

Suppose my data source contains data in 5 partitions each partition size is 10gb ,so total data size 50gb , my doubt here is ,when my spark cluster doesn't have 50gb of main memory how spark handles out of memory exceptions , and what is the best practice to avoid these scenarios in spark.
50GB is data that can fit in memory and you probably don't need Spark for this kind of data - it would run slower than other solutions.
Also depending on the job and data format, a lot of times, not all the data needs to be read into memory (e.g. reading just needed columns from columnar storage format like parquet)
Generally speaking - when the data can't fit in memory Spark will write temporary files to disk. you may need to tune the job to more smaller partitions so each individual partition will fit in memory. see Spark Memory Tuning
Arnon

Get uncompressed size of the dataset on HDFS after being read by Spark

I am trying to improve the performance of my Spark application. To this end, I am trying to determine the optimal number of shuffle partitions for a dataset. I read from multiple sources that each partition should be about 128 MB.
So, if I have a 1GB file, I'll need around 8 partitions. But my question is how do I find the file size? I know I can find the file size on the hdfs using the following
hdfs dfs -du -s {data_path}
But from what I understand this is the compressed size and the actual size of the file is different. (Spark uses a compression codec while writing parquet files, by default snappy). And this leads me to two questions actually
How do I find the actual uncompressed size of the file?
What should the number of shuffle partitions be based on- compressed size or actual size?
Shuffle partitions are independent of the data size.
The data is uncompressed and then shuffled based on the number of shuffle partitions(using hash partitioner, range partitioner, etc).
Generally, the shuffle partitions are tuned
1. To increase the parallelism available in reducer stage.
2. To reduce the amount of data processed by shuffle partition(if we observe spills or it the reduce stage is memory intensive)
I read from multiple sources that each partition should be about 128 MB.
This is applicable only to mapper stages. The split sizes in the mapper are computed based on the size of compressed data. You can tune the size of the mapper splits using spark.sql.files.maxPartitionBytes
And the shuffle partitions(configured using spark.sql.shuffle.partitions, defaulting to 200) is related to reducer stages.
In short, compression comes into play only in mapper stages and not reducer stages.

Spark DataFrames with Parquet and Partitioning

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.
So let me clarify, parquet compressed (these numbers are not fully accurate).
1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.
Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?
Assumptions:
1 Block = 1 Partition For Spark
1 Core operated on 1 Partition
Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).
Spark 1.5 DataFrame partitions parquet file as follows:
1 partition per HDFS block
If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size
I saw the other answer but I thought I can clarify more on this. If you are reading Parquet from posix filesystem then you can increase number of partitioning readings by just having more workers in Spark.
But in order to control the balance of data that comes into workers one may use the hierarchical data structure of the Parquet files, and later in the workers you may point to different partitions or parts of the Parquet file. This will give you control over how much of data should go to each worker according to the domain of your dataset (if by balancing data in workers you mean equal batch of data per worker is not efficient).

Resources