How does Spark repartitioning work w.r.t to the input file partitioning? - apache-spark

I have 2 questions:
Can we have less partitions set in a call to coalesce than the HDFS block size? e.g. Suppose I have a 1 GB file size and HDFS block size is 128MB, can I do coalesce(1)?
As we know, input files on HDFS are physically split on the basis of block size. Does Spark further split the data (physically) when we repartition, or change parallelism?

e.g suppose I have a 1 GB file size and hdfs block size is 128MB. can I do coalesce(1)?
Yes, you can coalesce to a single file and write that to an external file system (at least with EMRFS)
does spark further splits the data (physically) when we repartition or change parallelism ?
repartition slices the data into partitions independently of the partitioning of the original input files.

Related

How the number of partitions is decided by Spark when a file is read?

How the number of partitions is decided by Spark when a file is read ?
Suppose we have a 10 GB single file in a hdfs directory and multiple part files of total 10 GB volume a another hdfs location .
If these two files are read in two separate spark data frames what would be their number of partitions and based on what logic ?
Found the information in How to: determine partition
It says:
How is this number determined? The way Spark groups RDDs into stages is described in the previous post. (As a quick reminder, transformations like repartition and reduceByKey induce stage boundaries.) The number of tasks in a stage is the same as the number of partitions in the last RDD in the stage. The number of partitions in an RDD is the same as the number of partitions in the RDD on which it depends, with a couple exceptions: thecoalesce transformation allows creating an RDD with fewer partitions than its parent RDD, the union transformation creates an RDD with the sum of its parents’ number of partitions, and cartesian creates an RDD with their product.
What about RDDs with no parents? RDDs produced by textFile or hadoopFile have their partitions determined by the underlying MapReduce InputFormat that’s used. Typically there will be a partition for each HDFS block being read. Partitions for RDDs produced by parallelize come from the parameter given by the user, or spark.default.parallelism if none is given.
When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file. For instance, if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (but the split between partitions would be done on line split, not the exact block split), unless you have a compressed text file. In case of compressed file you would get a single partition for a single file (as compressed text files are not splittable).
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 79 blocks, which means that the RDD you read from this file would have 79 partitions.
Also, we can pass the number of partitions we want if we are not satisfied by the number of partitions provided by spark by default as shown below:
>>> rdd1 = sc.textFile("statePopulations.csv",10) // 10 is number of partitions

Spark 2.0+: spark.sql.files.maxPartitionBytes is not working?

My understanding is that spark.sql.files.maxPartitionBytes is used to control the partition size when spark reads data from hdfs.
However, I used spark sql to read data for a specific date in hdfs. It contains 768 files. The largest file is 4.7 GB. The smallest file is 17.8 MB.
the hdfs block size is 128MB.
the value of spark.sql.files.maxPartitionBytes is 128MB.
I expected that spark would split a large file into several partitions and make each partition no larger than 128MB. However, it doesn't work like that.
I know we can use repartition(), but it is an expensive operation.

Get uncompressed size of the dataset on HDFS after being read by Spark

I am trying to improve the performance of my Spark application. To this end, I am trying to determine the optimal number of shuffle partitions for a dataset. I read from multiple sources that each partition should be about 128 MB.
So, if I have a 1GB file, I'll need around 8 partitions. But my question is how do I find the file size? I know I can find the file size on the hdfs using the following
hdfs dfs -du -s {data_path}
But from what I understand this is the compressed size and the actual size of the file is different. (Spark uses a compression codec while writing parquet files, by default snappy). And this leads me to two questions actually
How do I find the actual uncompressed size of the file?
What should the number of shuffle partitions be based on- compressed size or actual size?
Shuffle partitions are independent of the data size.
The data is uncompressed and then shuffled based on the number of shuffle partitions(using hash partitioner, range partitioner, etc).
Generally, the shuffle partitions are tuned
1. To increase the parallelism available in reducer stage.
2. To reduce the amount of data processed by shuffle partition(if we observe spills or it the reduce stage is memory intensive)
I read from multiple sources that each partition should be about 128 MB.
This is applicable only to mapper stages. The split sizes in the mapper are computed based on the size of compressed data. You can tune the size of the mapper splits using spark.sql.files.maxPartitionBytes
And the shuffle partitions(configured using spark.sql.shuffle.partitions, defaulting to 200) is related to reducer stages.
In short, compression comes into play only in mapper stages and not reducer stages.

How partitions are created in spark RDD

Let's say I am reading a file from HDFS using spark(scala). A HDFS block size is 64 MB.
Assume , the size of HDFS file is 130 MB.
I would like to know how many partitions are created in base RDD
scala> val distFile = sc.textFile("hdfs://user/cloudera/data.txt")
Is it true that no. of partitions are decided based on block size?
In the above case the no. of partitions is 3?
Here is a good article that describes the partition computation logic for input.
The HDFS block size is the maximum size of a partition. So in your example the minimum number of partitions will be 3.
partitions = ceiling(input size/block size)
You can further increase the number of partitions by passing that as a parameter to sc.textFile as in sc.textFile(inputPath,numPartitions)
Also another setting mapreduce.input.fileinputformat.split.minsize plays a role. You can set it to increase the size of partitions (and reduce the number of partitions). So if you set mapreduce.input.fileinputformat.split.minsize to say 130MB then you will only get 1 partition.
you can run and check number of partitions
distFile.partitions.size

Spark DataFrames with Parquet and Partitioning

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.
So let me clarify, parquet compressed (these numbers are not fully accurate).
1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.
Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?
Assumptions:
1 Block = 1 Partition For Spark
1 Core operated on 1 Partition
Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).
Spark 1.5 DataFrame partitions parquet file as follows:
1 partition per HDFS block
If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size
I saw the other answer but I thought I can clarify more on this. If you are reading Parquet from posix filesystem then you can increase number of partitioning readings by just having more workers in Spark.
But in order to control the balance of data that comes into workers one may use the hierarchical data structure of the Parquet files, and later in the workers you may point to different partitions or parts of the Parquet file. This will give you control over how much of data should go to each worker according to the domain of your dataset (if by balancing data in workers you mean equal batch of data per worker is not efficient).

Resources