How do we calculate the input data size and feed the number of partitions to re-partition/coalesce? - apache-spark

Example - Now assume we have an input RDD input which is filtered in the second step. Now I want to calculate the data size in the filtered RDD and calculate how many partitions will be required to repartition by considering block size is 128MB
This will help me out to pass the number of partitions to repartition method.
InputRDD=sc.textFile("sample.txt")
FilteredRDD=InputRDD.Filter( Some Filter Condition )
FilteredRDD.repartition(XX)
Q1.How to calculate the value of XX ?
Q2.What is the similar approach for Spark SQL/DataFrame?

The block size of 128MB will comes into picture only when reading /writing the data from/to HDFS. Once RDD is created, data is in memory or spill to disk based on executor RAM size.
You can't calculate data size unless calling collect() action on filtered RDD and it is not recommended.
The maximum partition size is 2GB, you can choose the number of partition based on cluster size or data model.
df.partition(col)

Related

When is the optimal time to repartition a PySpark DataFrame after filter?

I have a large DataFrame (billions of records) that I am applying a filter to which reduces the size of the DataFrame considerably (tens of millions of records). Of course this leads to highly skewed partitions which is destroying the database write performance.
I know that the data needs to be repartitioned. I created a column consisting of random integers between 1 and 36. I have 36 cores available and 36 corresponding partitions. The idea was that I could repartition on this random number column to end up with 36 relatively even partitions. The issue is that the repartition alone takes over 90 minutes. I call the repartition operation right before the database write. Am I missing anything here?
Thank you!

How to check size of each partition of a dataframe in Databricks using pyspark?

df=spark.read.parquet('path')
df.rdd.getNumPartitions() --->It gives the number of partitions but I would like to know information like size of each partition of this dataframe
How do I get it?

How many partitions Spark creates when loading a Hive table

Even if it is a Hive table or an HDFS file, when Spark reads the data and creates a dataframe, I was thinking that the number of partitions in the RDD/dataframe will be equal to the number of partfiles in HDFS. But when I did a test with Hive external table, I could see that the number was coming different than the number of part-files .The number of partitions in a dataframe was 119. The table was a Hive partitioned table with 150 partfiles in it, with a minimum size of a file 30 MB and max size is 118 MB. So then what decides the number of partitions?
You can control how many bytes Spark packs into a single partition by setting spark.sql.files.maxPartitionBytes. The default value is 128 MB, see Spark Tuning.
I think this link does answers my question .The number of partitions depends on the number of splits split and the splits depends on the hadoop inputformat .
https://intellipaat.com/community/7671/how-does-spark-partition-ing-work-on-files-in-hdfs
With the block size of each block as 128MB.
Spark will read the data.
Say if your hive table size was aprrox 14.8 GB then it will divide the hive table data into 128 MB blocks and will result in 119 Partitions.
On the other hand your hive table is partitioned so the partition column has 150 unique values.
So number of part files in hive and number of partitions in spark are not linked.

Specify max file size while write dataframe as parquet

When I try to write a dataframe as parquet, the file sizes are non-uniform. Although I don't want to make the files uniform, I want to set a max size for each file.
I can't afford to repartition the data as the dataframe is sorted(As per my understanding, repartitioning a sorted dataframe can distort the ordering).
Any help would be appreciated.
I have come across maxRecordsPerFile, but I don't want to limit the number of rows and I might not have full information about the columns(total number of columns and their types). So it's difficult to estimate file size based on rows.
I have read about parquet block size as well and I don't think that helps.

How does Sparks RDD.randomSplit actually split the RDD

So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of class2.
The RDD is partitioned across 100 partitions.
When calling RDD.randomSplit(0.8,0.2)
Does the function also shuffle the rdd? Our does the splitting simply sample 20% continuously of the rdd? Or does it select 20% of the partitions randomly?
Ideally does the resulting split have the same class distribution as the original RDD. (i.e. 2:1)
Thanks
For each range defined by weights array there is a separate mapPartitionsWithIndex transformation which preserves partitioning.
Each partition is sampled using a set of BernoulliCellSamplers. For each split it iterates over the elements of a given partition and selects item if value of the next random Double is in a given range defined by normalized weights. All samplers for a given partition use the same RNG seed. It means it:
doesn't shuffle a RDD
doesn't take continuous blocks other than by chance
takes a random sample from each partition
takes non-overlapping samples
require n-splits passes over data

Resources