I am new to Spark. I am trying to understand the number of partitions produced by default by a hiveContext.sql("query") statement. I know that we can repartition the dataframe after it has been created using df.repartition. But, what is the number of partitions produced by default when the dataframe is initially created?
I understand that sc.parallelize and some other transformations produce the number of partitions according to spark.default.parallelism. But what about a dataframe ? I saw some answers saying that the setting spark.sql.shuffle.partitions produces the set number of partitions while doing shuffle operations like join. Does this give the initial number of partitions when a dataframe is created?
Then I also saw some answers explaining the number of partitions produced by setting
mapred.min.split.size.
mapred.max.split.size and
hadoop block size
Then when I tried to do it practically, I read 10 million records into a dataframe in a spark-shell launched with 2 executors and 4 cores per executor. When I did df.rdd.getNumPartitions, I get the value 1. How am I getting 1 for the number of partitions? isn't 2 the min number of partitions?
When I do a count on the dataframe, I see that 200 tasks are being launched. IS this due to the spark.sql.shuffle.partitions setting?
I am totally confused! can someone please answer my questions?? Any help would be appreciated. Thank you!
Related
I am a newbie in spark and I am trying to understand shuffle partition and repartition function. But i still dont understand how they are different. Both reduces the number of partition??
Thank you
The biggest difference between shuffle partition and repartition is when things are defined.
The configuration spark.sql.shuffle.partitions is a property and according to the documentation
Configures the number of partitions to use when shuffling data for joins or aggregations.
That means, every time you run a Join or any type of aggregation in spark that will shuffle the data according to the configuration, where the default value is 200. So if you join two datasets the number of partitions in the shuffle will be 200.
The repartition(numPartitions, *cols) function is applied during an execution, where you can define how many partitions you will write, that usually is for output writing based in partition columns or just number. The example in the documentation is pretty good to show.
So in general, Shuffle Partition is for Joins and Aggregations during the execution. Repartition is for number of output files, based in number or partition column.
The question has been asked in other thread, but it seems my problem doesn't fit in any of them.
I'm using Spark 2.4.4 in local mode, I set the master to local[16] to use 16 cores. I also see in the web UI 16 cores have been allocated.
I create a dataframe importing a csv file of about 8MB like this:
val df = spark.read.option("inferSchema", "true").option("header", "true").csv("Datasets/globalpowerplantdatabasev120/*.csv")
finally I print the number of partitions the dataframe is made of:
df.rdd.partitions.size
res5: Int = 2
Answer is 2.
Why? As far as I read around, the number of partitions depends on the number of executors that is by default set equal the numer of cores(16).
I tried to set the number of esecutors using spark.default.Parallelism = 4 and/or spark.executor.instances = 4 and started a new spark object but nothing changed in the number of partitions.
Any suggestion?
When you read a file using Spark the number of partitions is calculated as the maximum between defaultMinPartitions and the number of splits computed based on hadoop input split size divided by the block size. Since your file is small so the number of partitions you are getting is 2 which is the maximum of the two.
The default defaultMinPartitions is calculated as
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
Please check https://github.com/apache/spark/blob/e9f983df275c138626af35fd263a7abedf69297f/core/src/main/scala/org/apache/spark/SparkContext.scala#L2329
I'm using MLlib with Python(Pyspark) and would like to get to know the number of RDD's getting created in memory before the execution of my code. I'm performing Transactions and Actions on RDD's. So would just like to get to know the total number of RDD's that created in Memory.
number of RDD's depends on your program.
But I think here you want to know number of partitions an RDD is created on :
for that you can use : rdd.getNumPartitions()
refer : Show partitions on a pyspark RDD
Upvote if works
First of all As you asked Number of RDD's . That depends how you write your application code. There can be 1 or more than 1 RDD in you application.
Though you can find the number of partitions in an RDD.
for scala
someRDD.partitions.size
Pyspark
someRDD.getNumPartitions()
If there are more than 1 rdd in you application you can count partitions of each RDD and sum them that will be the total number of partitions..
I am asking this question because if I specify repartition as 5, than all my data(>200Gigs) are moved to 5 different executors and 98% of the resources is unused. and then the partitionBy is happening which is again creating a lot of shuffle. Is there a way that first the partitionBy happens and then repartition runs on the data?
Although the question is not entirely easy to follow, the following aligns with the other answer and this approach should avoid the issues mentioned on unnecessary shuffling:
val n = [... some calculation for number of partitions / executors based on cluster config and volume of data to process ...]
df.repartition(n, $"field_1", $"field_2", ...)
.sortWithinPartitions("fieldx", "field_y")
.write.partitionBy("field_1", "field_2", ...)
.format("location")
whereby [field_1, field_2, ...] are the same set of fields for repartition and partitionBy.
You can use repartition(5, col("$colName")).
Thus when you will make partitionBy("$colName") you will skip shuffle for '$colName' since it's already been repartitioned.
Also consider to have as many partitions as the product of the number of executors by the number of used cores by 3 (this may vary between 2 and 4 though).
So as we know, Spark can only run 1 concurrent task for every partition of an RDD. Assuming you have 8 cores per executor and 5 executors:
You need to have: 8 * 5 * 3 = 120 partitions
I am using spark in local mode and a simple join is taking too long. I have fetched two dataframes: A (8 columns and 2.3 million rows) and B(8 columns and 1.2 million rows) and joining them using A.join(B,condition,'left') and called an action at last. It creates a single job with three stages, each for two dataframes extraction and one for joining. Surprisingly stage with extraction of dataframe A is taking around 8 minutes and that of dataframe B is taking 1 minute. And join happens within seconds. My important configuration settings are:
spark.master local[*]
spark.driver.cores 8
spark.executor.memory 30g
spark.driver.memory 30g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.shuffle.partitions 16
The only executor is driver itself. While extracting dataframes, i have partitioned it in 32(also tried 16,64,50,100,200) parts. I have seen shuffle write memory to be 100 MB for Stage with dataframe A extraction. So to avoid shuffle i made 16 initial partitions for both dataframes and broadcasted dataframe B(smaller), but it is not helping. There is still shuffle write memory. I have used broadcast(B) syntax for this. Am I doing something wrong? Why shuffling is still there? Also when i see event timelines its showing only four cores are processing at any point of time. Although I have a 2core*4 processor machine.Why is that so?
In short, "Join"<=>Shuffling, the big question here is how uniformly are your data distributed over partitions (see for example https://0x0fff.com/spark-architecture-shuffle/ , https://www.slideshare.net/SparkSummit/handling-data-skew-adaptively-in-spark-using-dynamic-repartitioning and just Google the problem).
Few possibilities to improve efficiency:
think more about your data (A and B) and partition data wisely;
analyze, are your data skewed?;
go into UI and look at the tasks timing;
choose such keys for partitions that during "join" only few partitions from dataset A shuffle with few partitions of B;