PySpark join iteration time increasing exponentially - apache-spark

I have a table named "table1" and I'm splitting it based on a criterion, and then joining the split parts one by one in for loop. The following is a representation of what I am trying to do.
When I joined them, the joining time increased exponentially.
0.7423694133758545
join
0.4046192169189453
join
0.5775985717773438
join
5.664674758911133
join
1.0985417366027832
join
2.2664384841918945
join
3.833379030227661
join
12.762675762176514
join
44.14520192146301
join
124.86295890808105
join
389.46189188957214
. Following are my parameters
spark = SparkSession.builder.appName("xyz").getOrCreate()
sqlContext = HiveContext(spark)
sqlContext.setConf("spark.sql.join.preferSortMergeJoin", "true")
sqlContext.setConf("spark.serializer","org.apache.spark.serializer.KryoSerializer")
sqlContext.setConf("spark.sql.shuffle.partitions", "48")
and
--executor-memory 16G --num-executors 8 --executor-cores 8 --driver-memory 32G
Source table
Desired output table
In the join iteration, I also increased the partitions to 2000 and decreased it to 4, and cached the DF data frame by df.cached(), but nothing worked. I know I am doing something terribly wrong but I don't know what. Please can you guide me on how to correct this.
I would really appreciate any help :)
code:
df = spark.createDataFrame([], schema=SCHEMA)
for i, column in enumerate(columns):
df.cache()
df_part = df_to_transpose.where(col('key') == column)
df_part = df_part.withColumnRenamed("value", column)
if (df_part.count() != 0 and df.count() != 0):
df = df_part.join(broadcast(df), 'tuple')

I had same problem a while ago. if you check your pyspark web ui and go in stages section and checkout dag visualization of your task you can see the dag is growing exponentialy and the waiting time you see is for making this dag not doing the task acutally. I dont know why but it seams when you join table made of a dataframe with it self pyspark cant handle partitions and it's getting a lot bigger. how ever the solution i found at that moment was to save each of join results on seperated files and at the end after restarting the kernel load and join all the files again. It seams if dataframes you want to join are not made from each other you dont see this problem.

Add a checkpoint every loop, or every so many loops, so as to break lineage.

Related

Avoid data shuffle and coalesce-numPartitions is not applied to individual partition while doing left anti-join in spark dataframe

I have two dataframe - target_df and reference_df. I need to remove account_id's in target_df which is present in reference_df.
target_df is created from hive table, will have hundreds of partitions. It is partitioned based on date(20220101 to 20221101).
I am doing left anti-join and writing data in hdfs location.
val numPartitions = 10
val df_purge = spark.sql(s"SELECT /*+ BROADCASTJOIN(ref) */ target.* FROM input_table target LEFT ANTI JOIN ${reference_table} ref ON target.${Customer_ID} = ref.${Customer_ID}")
df_purge.coalesce(numPartitions).write.partitionBy("date").mode("overwrite").parquet("hdfs_path")
I need to apply same numPartitions value to each partition. But it is applying to numPartitions value to entire dataframe. For example: If it has 100 date partitions, i need to have 100 * 10 = 1000 part files. These code is not working as expected. I tried repartitionby("date") but this is causing huge data shuffle.
Can anyone please provide an optimized solution. Thanks!
I am afraid that you can not skip shuffle in this case. All repartition/coalesce/partitionBy are working on dataset level and i dont think that there is a way to just split partitions into 10 without shuffle
You tried to use coalesce which is not causing shuffle and this is true, but coalesce can only be used to decrese number of partitions so its not going to help you
You can try to achieve what you want by using combination of raprtition and repartitionBy. Here is description of both functions (same applies to Scala source: https://sparkbyexamples.com:
PySpark repartition() is a DataFrame method that is used to increase
or reduce the partitions in memory and when written to disk, it create
all part files in a single directory.
PySpark partitionBy() is a method of DataFrameWriter class which is
used to write the DataFrame to disk in partitions, one sub-directory
for each unique value in partition columns.
If you first repartition your dataset with repartition = 1000 Spark is going to create 1000 partitions in memory. Later, when you call repartitionBy, Spark is going to create sub-directory forr each value and create one part file for each in-memory partition which contains given key
So if after repartition you have date X in 500 partitions out of 1000 you will find 500 file in sub-directory for this date
In article which i mentioned previously you can find simple example of this behaviourm, chech chapter 1.3 partitionBy(colNames : String*) Example
#Use repartition() and partitionBy() together
dfRepart.repartition(2)
.write.option("header",True) \
.partitionBy("state") \
.mode("overwrite") \
.csv("c:/tmp/zipcodes-state-more")

Why does Spark crossJoin take so long for a tiny dataframe?

I'm trying to do the following crossJoin on two dataframes with 5 rows each, but Spark spawns 40000 tasks on my machine and it took 30 seconds to achieve the task. Any idea why that is happening?
df = spark.createDataFrame([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']]).toDF('a','b')
df = df.repartition(1)
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
You call a .distinct before join, it requires a shuffle, so it repartitions data based on spark.sql.shuffle.partitions property value. Thus, df.select('a').distinct() and df.select('b').distinct() result in new DataFrames each with 200 partitions, 200 x 200 = 40000
Two things - it looks like you cannot directly control the number of partitions a DF is created with, so we can first create a RDD instead (where you can specify the number of partitions) and convert it to DF. Also you can set the shuffle partitions to '1' as well. These both ensure you will have just 1 partition during the whole execution and should speed things up.
Just note that this shouldn't be an issue at all for larger datasets, for which Spark is designed (it would be faster to achieve the same result on a dataset of this size not using spark at all). So in the general case you won't really need to do stuff like this, but tune the number of partitions to your resources/data.
spark.conf.set("spark.default.parallelism", "1")
spark.conf.set("spark.sql.shuffle.partitions", "1")
df = sc.parallelize([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']], 1).toDF(['a','b'])
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
spark.conf.set sets the configuration for a single execution only, if you want more permanent changes do them in the actual spark conf file

spark coalesce(20) overwrite parallelism of repartition(1000).groupby(xxx).apply(func)

Note: This is not a question ask the difference between coalesce and repartition, there are many questions talk about this ,mine is different.
I have a pysaprk job
df = spark.read.parquet(input_path)
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
...
return pdf
df = df.repartition(1000, 'store_id', 'product_id')
df1 = df.groupby(['store_id', 'product_id']).apply(train_predict)
df1 = df1.withColumnRenamed('y', 'yhat')
print('Partition number: %s' % df.rdd.getNumPartitions())
df1.write.parquet(output_path, mode='overwrite')
Default 200 partition would reqire large memory, so I change repartition to 1000.
The job detail on spark webui looked like:
As output is only 44M, I tried to use coalesce to avoid too many little files slow down hdfs.
What I do was just adding .coalesce(20) before .write.parquet(output_path, mode='overwrite'):
df = spark.read.parquet(input_path)
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
...
return pdf
df = df.repartition(1000, 'store_id', 'product_id')
df1 = df.groupby(['store_id', 'product_id']).apply(train_predict)
df1 = df1.withColumnRenamed('y', 'yhat')
print('Partition number: %s' % df.rdd.getNumPartitions()) # 1000 here
df1.coalesce(20).write.parquet(output_path, mode='overwrite')
Then spark webui showed:
It looks like only 20 task are running.
When repartion(1000) , the parallelism was depend by my vcores number, 36 here. And I could trace the progress intutively(progress bar size is 1000 ).
After coalesce(20) , the previous repartion(1000) lost function, parallelism down to 20 , lost intuition too.
And adding coalesce(20) would cause whole job stucked and failed without notification .
change coalesce(20) to repartition(20) works, but according to document, coalesce(20) is much more efficient and should not cause such problem .
I want higher parallelism, and only the result coalesce to 20 . What is the correct way ?
coalesce is considered a narrow transformation by Spark optimizer so it will create a single WholeStageCodegen stage from your groupby to the output thus limiting your parallelism to 20.
repartition is a wide transformation (i.e. forces a shuffle), when you use it instead of coalesce if adds a new output stage but preserves the groupby-train parallelism.
repartition(20) is a very reasonable option in your use case (the shuffle is small so the cost is pretty low).
Another option is to explicitly prevent Spark optimizer from merging your predict and output stages, for example by using cache or persist before your coalesce:
# Your groupby code here
from pyspark.storagelevel import StorageLevel
df1.persist(StorageLevel.MEMORY_ONLY)\
.coalesce(20)\
.write.parquet(output_path, mode='overwrite')
Given your small output size, a MEMORY_ONLY persist + coalesce should be faster than a repartition but this doesn't hold when the output size grows

Efficient pyspark join

I've read a lot about how to do efficient joins in pyspark. The ways to achieve efficient joins I've found are basically:
Use a broadcast join if you can. (I usually can't because the dataframes are too large)
Consider using a very large cluster. (I'd rather not because of $$$).
Use the same partitioner.
The last one is the one i'd rather try, but I can't find a way to do it in pyspark. I've tried:
df.repartition(numberOfPartitions,['parition_col1','partition_col2'])
but it doesn't help, it still takes way too long until I stop it, because spark get's stucked in the last few jobs.
So, how can I use the same partitioner in pyspark and speed up my joins, or even get rid of the shuffles that takes forever ? Which code do I need to use ?
PD: I've checked other articles, even on stackoverflow, but I still can't see code.
you can also use a two-pass approach, in case it suits your requirement.First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
It was nicely explained by Sim. see link below
two pass approach to join big dataframes in pyspark
based on case explained above I was able to join sub-partitions serially in a loop and then persisting joined data to hive table.
Here is the code.
from pyspark.sql.functions import *
emp_df_1.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_1")
emp_df_2.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_2")
So, if you are joining on an integer emp_id, you can partition by the ID modulo some number and this way you can re distribute the load across the spark partitions and records having similar keys will be grouped together and reside on same partition.
you can then read and loop through each sub partition data and join both the dataframes and persist them together.
counter =0;
paritioncount = 4;
while counter<=paritioncount:
query1 ="SELECT * FROM UDB.temptable_1 where par_id={}".format(counter)
query2 ="SELECT * FROM UDB.temptable_2 where par_id={}".format(counter)
EMP_DF1 =spark.sql(query1)
EMP_DF2 =spark.sql(query2)
df1 = EMP_DF1.alias('df1')
df2 = EMP_DF2.alias('df2')
innerjoin_EMP = df1.join(df2, df1.emp_id == df2.emp_id,'inner').select('df1.*')
innerjoin_EMP.show()
innerjoin_EMP.write.format('orc').insertInto("UDB.temptable")
counter = counter +1
I have tried this and this is working fine. This is just an example to demo the two-pass approach. your join conditions may vary and the number of partitions also depending on your data size.
Thank you #vikrantrana for your answer, I will try it if I ever need it. I say these because I found out the problem wasn't with the 'big' joins, the problem was the amount of calculations prior to the join. Imagine this scenario:
I read a table and I store in a dataframe, called df1. I read another table, and I store it in df2. Then, I perfome a huge amount of calculations and joins to both, and I end up with a join between df1 and df2. The problem here wasn't the size, the problem was spark's execution plan was huge and it couldn't maintain all the intermediate tables in memory, so it started to write to disk and it took so much time.
The solution that worked to me was to persist df1 and df2 in disk before the join (I also persisted other intermediate dataframes that were the result of big and complex calculations).

spark cross join memory leak

I have two tables to be cross joined,
table 1: query 300M rows
table 2: product description 3000 rows
The following query does a cross join and calculate a score between the tuple, and pick the top 3 matches,
query_df.repartition(10000).registerTempTable('queries')
product_df.coalesce(1).registerTempTable('products')
CREATE TABLE matches AS
SELECT *
FROM
(SELECT *,
row_number() over (partition BY a.query_id
ORDER BY 0.40 + 0.15*score_a + 0.20*score_b + 0.5*score_c DESC) AS rank
FROM
(SELECT /*+ MAPJOIN(b) */ a.query_id,
b.product_id,
func_a(a.qvec,b.pvec) AS score_a,
func_b(a.qvec,b.pvec) AS score_b,
func_c(a.qvec,b.pvec) AS score_c
FROM queries a CROSS
JOIN products b) a) a
WHERE rn <= 3
My spark cluster looks like the following,
MASTER="yarn-client" /opt/mapr/spark/spark-1.6.1/bin/pyspark --num-executors 22 --executor-memory 30g --executor-cores 7 --driver-memory 10g --conf spark.yarn.executor.memoryOverhead=10000 --conf spark.akka.frameSize=2047
Now the issue is, as expected, due to memory leak the job fails after a couple of stages because of the extremely large temp data produced. I'm looking for some help/suggestion in optimizing the above operation in a such a way that the job should be able to run both the match and filter operation for a query_id before picking the next query_id, in a parallel fashion - similar to a sort within for loop against the queries table. If the job is slow but successful, I'm ok with it, since I can request a bigger cluster.
The above query works fine for a smaller query table, say one with 10000 records.
In the scenario where you want to join table A (big) with table B (small), the best practice is to leverage broadcast join.
A clear overview is given in https://stackoverflow.com/a/39404486/1203837.
Hope this helps.
Cartesian joins or cross join in spark is extremely expensive. I would suggest to join the tables with inner join and save the output data first. Then use that dataframe for further aggregation.
One small suggestion the map join or broadcast join could fail sometime if the smaller table is not small enough. Unless you are sure about the size of the small table refrain using the broadcast join.

Resources