Sequential operations in pyspark dataframes - apache-spark

I have a question about the best way to deal with dataframe (df) transformations. Let's suppose that I have a main df and I need to join this df with other 3 dfs. Which one below is the best way (better performance) to achieve this? Create several dfs or reassign to existing one?
1 - One dataframe for each step
df = spark.read.orc(file)
df2 = spark.read.orc(file2)
df3 = spark.read.orc(file3)
df4 = spark.read.orc(file4)
df5 = df.join(df2, df.col==df2.col, 'inner')
df6 = df5.join(df3, df5.col==df3.col, 'inner')
df7 = df6.join(df4, df6.col==df4.col, 'inner')
df7.write.orc(file)
2 - Reassign to existing one
df = spark.read.orc(file)
df2 = spark.read.orc(file2)
df3 = spark.read.orc(file3)
df4 = spark.read.orc(file4)
df = df.join(df2, df.col==df2.col, 'inner')
df = df.join(df3, df.col==df3.col, 'inner')
df = df.join(df4, df.col==df4.col, 'inner')
df.write.orc(file)

There are a couple of points to note down first:
DataFrame is immutable, any transformation on it will create a new DataFrame.
All transformations in Spark are lazy.
So assigning or reassigning the transformed Dataframe to a variable doesn't make any difference in terms of query execution plan.
So when you finally materialize the resultant DataFrame (.write in this case), Spark will create (and optimize) single query plan with all 4 joins in it.
So it doesn't really matter** how you go about writing the transformation.
**But yes there are few things you can consider if you know your data well and think an inner join will drop number of records significantly, then maybe you can perform that join before other inner joins.

The second one. Spark's DAG is intelligent enough to detect joins.
Even better, inside the second approach, instead of assigning df multiple times, you could just do:
df = df.join(df2, df.col==df2.col, 'inner')
.join(df3, df.col==df3.col, 'inner')
.join(df4, df.col==df4.col, 'inner')

Related

How to join efficiently 2 Spark dataframes partitioned by some column, when that column is one of multiple join keys?

I am currently facing some issues in Spark 3.0.2 to efficiently join 2 Spark dataframes when
The 2 Spark DataFrames are partitioned by some key id;
id is part of the join key, but it is not the only one.
My intuition is telling me that the query optimizer is, in this case, not choosing the optimal path. I will illustrate my issue through a minimal example (note that this particular example does not really require a join, it's just for illustrative purposes).
Let's start from the simple case: the 2 dataframes are partitioned by id, and we join by id only:
from pyspark.sql import SparkSession, Row, Window
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
# Make up some test dataframe
df = spark.createDataFrame([Row(id=i // 10, order=i % 10, value=i) for i in range(10000)])
# Create the left side of the join (repartitioned by id)
df2 = df.repartition(50, 'id')
# Create the right side of the join (also repartitioned by id)
df3 = df2.select('id', F.col('order').alias('order_alias'), F.lit(0).alias('dummy'))
# Perform the join
joined_df = df2.join(df3, on='id')
joined_df.foreach(lambda x: None)
This results in the following efficient plan:
This plan is efficient: it recognizes that the 2 dataframes are already partitioned by the join key and avoids to re-shuffle them. The 2 dataframes are not only repartitioned, but also colocated.
What happens if there is an additional join key? It results in an inefficient plan:
joined_df = df2.join(df3, on=[df2.id==df3.id, df2.order==df3.order_alias])
joined_df.foreach(lambda x: None)
The plan is inefficient since it is repartitioning the 2 dataframes to do the join. This does not make sense to me. Intuitively, we could use the existing partitions: all keys to be joined will be found in the same partition as before, there is just one additional condition to apply! So I thought: perhaps we could phrase the 2nd condition as a filter?
joined_df.foreach(lambda x: None)
joined_df = df2.join(df3, on='id')
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
This however results in the same inefficient plan, since Spark query optimizer will just merge the 2nd filter with the join.
So, I finally thought that maybe I could force Spark to process the join as I want by adding a dummy cache step, by trying the following:
from pyspark import StorageLevel
joined_df = df2.join(df3, on='id')
# Note that this storage level will not cache anything, it's just to suggest to Spark that I need this intermediate result
joined_df.persist(StorageLevel(False, False, False, False))
# Do the filtering after "persisting" the join
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
joined_df_filtered.foreach(lambda x: None)
This results in an efficient plan! It is in fact much faster than the previous ones.
The workaround of "persisting" the first join to force Spark to use a more efficient processing plan is "good enough" for my use case, but I still have a few questions:
Am I missing something in my intuition that Spark should actually be reusing partitions when the partition key is part of the join key, instead of re-shuffling?
Is this expected behavior of the query optimizer? Should a ticket be filed for it?
Is there a better way to force the desired processing plan than adding the "persist" step? It seems more like an indirect workaround than a direct solution.

Caching in spark before diverging the flow

I have a basic question regarding working with Spark DataFrame.
Consider the following piece of pseudo code:
val df1 = // Lazy Read from csv and create dataframe
val df2 = // Filter df1 on some condition
val df3 = // Group by on df2 on certain columns
val df4 = // Join df3 with some other df
val subdf1 = // All records from df4 where id < 0
val subdf2 = // All records from df4 where id > 0
* Then some more operations on subdf1 and subdf2 which won't trigger spark evaluation yet*
// Write out subdf1
// Write out subdf2
Suppose I start of with main dataframe df1(which I lazy read from the CSV), do some operations on this dataframe (filter, groupby, join) and then comes a point where I split this datframe based on a condition (for eg, id > 0 and id < 0). Then I further proceed to operate on these sub dataframes(let us name these subdf1, subdf2) and ultimately write out both the sub dataframes.
Notice that the write function is the only command that triggers the spark evaluation and rest of the functions(filter, groupby, join) result in lazy evaluations.
Now when I write out subdf1, I am clear that lazy evaluation kicks in and all the statements are evaluated starting from reading of CSV to create df1.
My question comes when we start writing out subdf2. Does spark understand the divergence in code at df4 and store this dataframe when command for writing out subdf1 was encountered? Or will it again start from the first line of creating df1 and re-evaluate all the intermediary dataframes?
If so, is it a good idea to cache the dataframe df4(Assuming I have sufficient memory)?
I'm using scala spark if that matters.
Any help would be appreciated.
No, Spark cannot infer that from your code. It will start all over again. To confirm this, you can do subdf1.explain() and subdf2.explain() and you should see that both dataframes have query plans that start right from the beginning where df1 was read.
So you're right that you should cache df4 to avoid redoing all the computations starting from df1, if you have enough memory. And of course, remember to unpersist by doing df4.unpersist() at the end if you no longer need df4 for any further computations.

How to process pyspark dataframe as group by column value

I have a huge dataframe of different item_id and its related data, I need to process each group with the item_id serparately in parallel, I tried the to repartition the dataframe by item_id using the below code, but it seems it's still being processed as a whole not chunks
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
result = data.repartition('ITEM_ID') \
.rdd \
.mapPartitions(lambda iter: pd.DataFrame(list(iter), columns=columns))\
.mapPartitions(scan_item_best_model)\
.collect()
also is repartition is the correct approach or there is something am doing wrong ?
after looking around I found this which addresses a similar problem, finally I had to solve it like
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
df = data.select("ITEM_ID", F.struct(columns).alias("df"))
df = df.groupBy('ITEM_ID').agg(F.collect_list('df').alias('data'))
df = df.rdd.map(lambda big_df: (big_df['ITEM_ID'], pd.DataFrame.from_records(big_df['data'], columns=columns))).map(
scan_item_best_model)

spark join raises “Detected implicit cartesian product for INNER join between logical plans”

I have a situation where I have a dataframe df
and let's say I do the following steps:
df1 = df
df2 = df
and then write a query which uses D and E in the joins e.g
df3 = df1.join(df2, df1["column"] = df2["column"])
This is nothing but a self join which is widely needed in ETL.
Why does spark not handle it correctly
I have seen many posts but none of them provide a work around.
UPdate:
If I load the dataframes df1 and df2 from the same s3 location and then perform the join the issue goes away. But when you are doing ETL it may not always be the case where we persist the data and then use it to avoid this scenario.
Any thoughts?

Comparing two rows at a time in PySpark

I am new to Spark, and am looking for help with best practices. I have a large DataFrame, and need feed two rows at a time into a function which compares them.
actual_data is a DataFrame with an id column, and several value columns.
rows_to_compare is a DataFrame with two columns: left_id and right_id.
For each pair in rows_to_compare, I'd like to feed the two corresponding rows from actual_data into a function.
My actual data is quite large (~30GB) and has many columns, so I've reduced it to this simpler example:
import pandas as pd
from pyspark.sql import SQLContext
from pyspark.sql.functions import col
import builtins
sqlContext = SQLContext(sc)
# Build DataFrame of Actual Data
data = {
'id': [1,2,3,4,5],
'value': [11,12,13,14,15]}
actual_data_df = sqlContext.createDataFrame(
pd.DataFrame(data, columns=data.keys()))
# Build DataFrame of Rows To Compare
rows_to_compare = {
'left_id': [1,2,3,4,5],
'right_id': [1,1,1,1,1]}
rows_to_compare_df =
sqlContext.createDataFrame(
pd.DataFrame(rows_to_compare, columns=rows_to_compare.keys()))
result = (
rows_to_compare_df
.join(
actual_data_df.alias('a'),
col('left_id') == col('a.id'))
.join(
actual_data_df.alias('b'),
col('right_id') == col('b.id'))
.withColumn(
'total',
builtins.sum(
[col('a.value'),
col('b.value')]))
.select('a.id', 'b.id', 'total')
.collect())
This returns the desired output:
[Row(id=2, id=1, total=23), Row(id=5, id=1, total=26), Row(id=4, id=1, total=25), Row(id=1, id=1, total=22), Row(id=3, id=1, total=24)]
When I run this, it seems quite slow, even for this toy problem. Is this the best way of approaching this problem? The clearest alternative approach I can think of is to make each row of my DataFrame contain the values for both rows I'd like to compare. I'm concerned about this approach though since it will involve a tremendous amount of data duplication.
Any help is much appreciated, thank you.

Resources