Size of a partition or RDD - apache-spark

How can we calculate the size of a partition in a RDD? Is it not a recommended to calculate the partition size ? I want to dynamically set the number of shuffle partition before I call any action, hence need to calculate the partition size and depending on the number of executors want to set the shuffle partition count.

"I want to dynamically set the number of shuffle partition before I call any action"
unfortunately that's challenging todo in spark without diving deep into the low level code. In fact this is something that adaptive execution in spark 3.0 is bringing to the table. What it will do is over partition the dataset and then dynamically combine small partitions to reach a certain threshold.
https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html

you can get the RDD partition size using below command:
someRDD.partitions.size
you can use different methods of partitioning like:
based on the columns
based on the (ataset size)/(block size)
based on the cores available

Related

Difference between shuffle partition and repartition

I am a newbie in spark and I am trying to understand shuffle partition and repartition function. But i still dont understand how they are different. Both reduces the number of partition??
Thank you
The biggest difference between shuffle partition and repartition is when things are defined.
The configuration spark.sql.shuffle.partitions is a property and according to the documentation
Configures the number of partitions to use when shuffling data for joins or aggregations.
That means, every time you run a Join or any type of aggregation in spark that will shuffle the data according to the configuration, where the default value is 200. So if you join two datasets the number of partitions in the shuffle will be 200.
The repartition(numPartitions, *cols) function is applied during an execution, where you can define how many partitions you will write, that usually is for output writing based in partition columns or just number. The example in the documentation is pretty good to show.
So in general, Shuffle Partition is for Joins and Aggregations during the execution. Repartition is for number of output files, based in number or partition column.

How to distribute data into X partitions on read with Spark?

I’m trying to read data from Hive with Spark DF and distribute it into a specific configurable number of partitions (in a correlation to the number of cores). My job is pretty straightforward and it does not contain any joins or aggregations. I’ve read on the spark.sql.shuffle.partitions property but the documentation says:
Configures the number of partitions to use when shuffling data for joins or aggregations.
Does this mean that it would be irrelevant for me to configure this property? Or does the read operation is considered as a shuffle? If not, what is the alternative? Repartition and coalesce seems a bit like an overkill for that matter.
To verify my understanding of your problem, you want to increase number of partitions in your rdd/dataframe which is created immediately after reading data.
In this case the property you are after is spark.sql.files.maxPartitionBytes which controls the maximum data that can be pushed in a partition at max (please refer to https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html)
Default value is 128 MB which can be overridden to improve parallelism.
Read is not a shuffle as such. You need to get the data in at some stage.
The answer below can be used or an algorithm by Spark sets the number of partitions upon a read.
You do not state if you are using RDD or DF. With RDD you can set num partitions. With DF you need to repartition after read in general.
Your point on controlling parallelism is less relevant when joining or aggregating as you note.

How does Spark decide the partitions number of the next stage when shuffle in SparkSQL?

Of course I know the spark.sql.shuffle.partitionsconfig,
but for example, when I set this config 300 on the small dataset which just has 200 rows, the config is not valid, the actual partition number is just 2,
anthor example, I set this config 3000 on the dataset which has 30 billion rows, the config is not valid too, the actual partition number is just 600,
we see that when we set a big value partitions config on a small dataset, the config would be not valid,
So I just want to know How does Spark decide the partitions number of the next stage when shuffle in SparkSQL? Or How to force this config to be valid ?
My Spark SQL is just like below:
set spark.sql.shuffle.partitions=3000;
with base_data as (
select
device_id
from
table_name
where
dt = '20210621'
distribute by
rand()
)
select count(1) from base_data
In general Narrow transformation does not change number of partitions .
Wide transformations transformation does not change number of partitions.
Narrow transformation In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().
Wide transformation — In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey and reducebyKey.
Update after question change:
you can assume "spark.sql.shuffle.partitions" as a query hint where we are forcing executors that make that number of partitions for joins or aggregations in my view we should not play with this value unless we are very sure that what are no of grouping key would be.
This will make unnecessary shuffling of data over the network.

Spark group by Key and partitioning the data

I have a large csv file with data in following format.
cityId1,name,address,.......,zip
cityId2,name,address,.......,zip
cityId1,name,address,.......,zip
........
cityIdN,name,address,.......,zip
I am performing following operation on the above csv file:
Group by cityId as key and list of resources as value
df1.groupBy($"cityId").agg(collect_list(struct(cols.head, cols.tail: _*)) as "resources")
Change it to jsonRDD
val jsonDataRdd2 = df2.toJSON.rdd
Iterate through each Partition and upload to s3 per key
I can not use dataframe partitionby write because of business logic constraints (how other services read from S3 )
My Questions:
What is the default size of a spark partition?
Let's say default size of partition is X MBs and there is one large record present in the dataFrame with key having Y MBs of data (Y > X) , what would happen in this scenario?
Do I need to worry about having the same key in different partitions in that case?
In answer to your questions:
When reading from secondary storage (S3, HDFS) the partitions are equal to block size of file system, 128MB or 256MB; but you can repartition RDDs immediately, not Data Frames. (For JDBC and Spark Structured Streaming the partitions are dynamic in size.)
When applying 'wide transformations' and re-partitioning the number and size of partitions most likely change. The size of a given partition has a maximum value. In Spark 2.4.x the partition size increased to 8GB. So, if any transformation (e.g. collect_list in combination with groupBy) gens more than this maximum size, you will get an error and the program aborts. So you need to partition wisely or in your case have sufficient number of partitions for aggregation - see spark.sql.shuffle.partitions parameter.
The parallel model for processing by Spark relies on 'keys' being allocated via hash, range partitioning, etc. being distributed to one and only one partition - shuffling. So, iterating through a partition foreachPartition, mapPartitions there is no issue.

Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after "groupByKey" even if the data for a key is very huge?

Consider I have a PairedRDD of,say 10 partitions. But the keys are not evenly distributed, i.e, all the 9 partitions having data belongs to a single key say a and rest of the keys say b,c are there in last partition only.This is represented by the below figure:
Now if I do a groupByKey on this rdd, from my understanding all data for same key will eventually go to different partitions or no data for the same key will not be in multiple partitions. Please correct me if I am wrong.
If that is the case then there can be a chance that the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do ? My assumption is like it will spill the data to worker's disk.
Is that correct?
Or how spark handle such situations
Does spark keep all elements (...) for a particular key in a single partition after groupByKey
Yes, it does. This is a whole point of the shuffle.
the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do
Size of a particular partition is not the biggest issue here. Partitions are represented using lazy Iterators and can easily store data which exceeds amount of available memory. The main problem is non-lazy local data structure generated in the process of grouping.
All values for the particular key are stored in memory as a CompactBuffer so a single large group can result in OOM. Even if each record separately fits in memory you may still encounter serious GC issues.
In general:
It is safe, although not optimal performance wise, to repartition data where amount of data assigned to a partition exceeds amount of available memory.
It is not safe to use PairRDDFunctions.groupByKey in the same situation.
Note: You shouldn't extrapolate this to different implementations of groupByKey though. In particular both Spark Dataset and PySpark RDD.groupByKey use more sophisticated mechanisms.

Resources