I wrote a small PySpark code to test the working of spark AQE, and doesn't seem to coalesce the partitions as per the parameters passed to it.
Following is my code :
df = spark.read.format("csv").option("header", "true").load(<path to my csv file>)
spark.conf.set("spark.sql.adaptive.enabled","true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.shuffle.partitions","50")
spark.conf.set("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "60")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes","200000")
spark.conf.set("spark.sql.adaptive.coalescePartitions.parallelismFirst","false")
spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionSize", "200000")
df3 = df.groupby("Loan title").agg({"*":"count"}).withColumnRenamed('count(1)','cnt')
df3.show()
The file is ~ 1.8 Gb and gets read into 14 partitions and its shuffle write is ~ 1.8MB and I have set the advisoryPartitionSizeInBytes and minPartitionSize as 200 kb, so I expected the number of coalesce partitions to be around 9 (1M/200kb).
But eventhough we see 8 coalesced partitions in AQEshuffle reader in the final plan, the number of tasks in the final stage is still 1 which is confusing.
Please find the spark ui images below :
physical plan
stages
Could anyone help me in figuring out this behavior ? Thanks in advance!!
After some trials I figured out the issue. The shuffle write for the final stage was not equal to the shuffle read because of the df3.show() command. This was only reading some of the input to take to the driver as all the answers are not shown.
Once I changed this to .write or df3.rdd.getNumPartitions(). I can see the expected number of tasks/partitions getting created because now all the partitions are being read.
Please find the screenshots below:
Stages
Stages 18 - 20 : df3.show()
Stages 21 - 23 : df3.write.format("csv").save(..)
Related
I am repartitioning the data frame after reading the data from ORC,
Available cores 6
df = spark.read.orc("filePath")
df.rdd.getNumPartitions()
Giving output as 12 partitions ( It is expected job ran locally so ( cores * 2) in my case 6 * 2 = 12)
Now i am increasing the partitions
df = df.repartition(50)
df.rdd.getNumPartitions() ---- returning 50 partitions
When observed in the SparkUI the job is still executing 12 tasks where as 50 tasks stage was skipped
How to tell spark to use 50 tasks instead of 12 default tasks ?.
Even after forcing repartition to 50 , Why spark is still using 12 tasks ? why not 50 tasks. Could you please some one help me here
As seen in below diagram
Spark Ui
I grant you that reading the Spark UI stuff is not always clear. I do not understand it fully at times. I looked on Databricks Community edition a similar case, but not a fair comparison as it is a driver with 8 cores. You do not state your Spark version.
None-the-less: when requesting repartition(n) --> this is round robin partitioning. The source is 12 partitions and the target 50 partitions. So the 12 tasks are fine in all ways. You need to go from 12 to 50 partitions, from the existing to the new partitions. 50 + 12 = 62, the skipping I think is the previous not yet returned to system partitions.
Few responses, SO is less popular I believe now.
I am new to spark and trying to understand the internals of it. So,
I am reading a small 50MB parquet file from s3 and performing a group by and then saving back to s3.
When I observe the Spark UI, I can see 3 stages created for this,
stage 0 : load (1 tasks)
stage 1 : shufflequerystage for grouping (12 tasks)
stage 2: save (coalescedshufflereader) (26 tasks)
Code Sample:
df = spark.read.format("parquet").load(src_loc)
df_agg = df.groupby(grp_attribute)\
.agg(F.sum("no_of_launches").alias("no_of_launchesGroup")
df_agg.write.mode("overwrite").parquet(target_loc)
I am using EMR instance with 1 master, 3 core nodes(each with 4vcores). So, default parallelism is 12. I am not changing any config in runtime. But I am not able to understand why 26 tasks are created in the final stage? As I understand by default the shuffle partition should be 200. Screenshot of the UI attached.
I tried a similar logic on Databricks with Spark 2.4.5.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'true'), the final number of my partitions is 2.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'false') and spark.conf.set('spark.sql.shuffle.partitions', 75), the final number of my partitions is 75.
Using print(df_agg.rdd.getNumPartitions()) reveals this.
So, the job output on Spark UI does not reflect this. May be a repartition occurs at the end. Interesting, but not really an issue.
In Spark sql, number of shuffle partitions are set using spark.sql.shuffle.partitions which defaults to 200. In most of the cases, this number is too high for smaller data and too small for bigger data. Selecting right value becomes always tricky for the developer.
So we need an ability to coalesce the shuffle partitions by looking at the mapper output. If the mapping generates small number of partitions, we want to reduce the overall shuffle partitions so it will improve the performance.
In the lastet version , Spark3.0 with Adaptive Query Execution , this feature of reducing the tasks is automated.
http://blog.madhukaraphatak.com/spark-aqe-part-2/
Considering this in Spark2.4.5 also catalist opimiser or EMR might have enabled this feature to reduce the tasks insternally rather 200 tasks.
Note: This is not a question ask the difference between coalesce and repartition, there are many questions talk about this ,mine is different.
I have a pysaprk job
df = spark.read.parquet(input_path)
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
...
return pdf
df = df.repartition(1000, 'store_id', 'product_id')
df1 = df.groupby(['store_id', 'product_id']).apply(train_predict)
df1 = df1.withColumnRenamed('y', 'yhat')
print('Partition number: %s' % df.rdd.getNumPartitions())
df1.write.parquet(output_path, mode='overwrite')
Default 200 partition would reqire large memory, so I change repartition to 1000.
The job detail on spark webui looked like:
As output is only 44M, I tried to use coalesce to avoid too many little files slow down hdfs.
What I do was just adding .coalesce(20) before .write.parquet(output_path, mode='overwrite'):
df = spark.read.parquet(input_path)
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
...
return pdf
df = df.repartition(1000, 'store_id', 'product_id')
df1 = df.groupby(['store_id', 'product_id']).apply(train_predict)
df1 = df1.withColumnRenamed('y', 'yhat')
print('Partition number: %s' % df.rdd.getNumPartitions()) # 1000 here
df1.coalesce(20).write.parquet(output_path, mode='overwrite')
Then spark webui showed:
It looks like only 20 task are running.
When repartion(1000) , the parallelism was depend by my vcores number, 36 here. And I could trace the progress intutively(progress bar size is 1000 ).
After coalesce(20) , the previous repartion(1000) lost function, parallelism down to 20 , lost intuition too.
And adding coalesce(20) would cause whole job stucked and failed without notification .
change coalesce(20) to repartition(20) works, but according to document, coalesce(20) is much more efficient and should not cause such problem .
I want higher parallelism, and only the result coalesce to 20 . What is the correct way ?
coalesce is considered a narrow transformation by Spark optimizer so it will create a single WholeStageCodegen stage from your groupby to the output thus limiting your parallelism to 20.
repartition is a wide transformation (i.e. forces a shuffle), when you use it instead of coalesce if adds a new output stage but preserves the groupby-train parallelism.
repartition(20) is a very reasonable option in your use case (the shuffle is small so the cost is pretty low).
Another option is to explicitly prevent Spark optimizer from merging your predict and output stages, for example by using cache or persist before your coalesce:
# Your groupby code here
from pyspark.storagelevel import StorageLevel
df1.persist(StorageLevel.MEMORY_ONLY)\
.coalesce(20)\
.write.parquet(output_path, mode='overwrite')
Given your small output size, a MEMORY_ONLY persist + coalesce should be faster than a repartition but this doesn't hold when the output size grows
I am new to Spark. I am trying to understand the number of partitions produced by default by a hiveContext.sql("query") statement. I know that we can repartition the dataframe after it has been created using df.repartition. But, what is the number of partitions produced by default when the dataframe is initially created?
I understand that sc.parallelize and some other transformations produce the number of partitions according to spark.default.parallelism. But what about a dataframe ? I saw some answers saying that the setting spark.sql.shuffle.partitions produces the set number of partitions while doing shuffle operations like join. Does this give the initial number of partitions when a dataframe is created?
Then I also saw some answers explaining the number of partitions produced by setting
mapred.min.split.size.
mapred.max.split.size and
hadoop block size
Then when I tried to do it practically, I read 10 million records into a dataframe in a spark-shell launched with 2 executors and 4 cores per executor. When I did df.rdd.getNumPartitions, I get the value 1. How am I getting 1 for the number of partitions? isn't 2 the min number of partitions?
When I do a count on the dataframe, I see that 200 tasks are being launched. IS this due to the spark.sql.shuffle.partitions setting?
I am totally confused! can someone please answer my questions?? Any help would be appreciated. Thank you!
Firstly, I join two dataframes, the first DF is filtered from second DF and is about 8MB (260 000 records), second DF is from file that is cca 2GB (37 000 000 records). Then I call
joinedDF.javaRDD().saveAsTextFile("hdfs://xxx:9000/users/root/result");
and I tried also
joinedDF.write().mode(SaveMode.Overwrite).json("hdfs://xxx:9000/users/root/result");
I am bit confused since I get an exception
ERROR TaskSetManager: Total size of serialized results of 54 tasks
(1034.6 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
As I know, saveAsTextFile should outputs directly from workers. So why I get exception related to the driver?
I know about the option to increase spark.driver.maxResultSize and I set it to unlimited, but it does not help since my driver has in total just 4,8GB memory.
EDIT:
DataFrame df1 = table.as("A");
DataFrame df2 = table.withColumnRenamed("id", "key").filter("value = 'foo'");
joinedDF = df1.join(df2.as("B"), col("A.id").
startsWith(col("B.key")),
"right_outer");
I tried broadcast variable too, change is in df2
DataFrame df2 = sc.broadcast(table.withColumnRenamed("id", "key").filter("value = 'foo'")).getValue();
Found the answer in the related post https://stackoverflow.com/a/29602918/5957143
To Summarize #kuujo's answer :
saveAsTextFile does not send the data back to the driver. Rather, it
sends the result of the save back to the driver once it's complete.
That is, saveAsTextFile is distributed. The only case where it's not
distributed is if you only have a single partition or you've
coallesced your RDD back to a single partition before calling
saveAsTextFile.