How to avoid excessive dataframe query - apache-spark

Consider there is spark job has multiple dataframe transitions
val baseDF1 = spark.sql(s"select * from db.table1 where condition1='blah'")
val baseDF2 = spark.sql(s"select * from db.table2 where condition2='blah'")
val df3 = basedDF1.join(baseDF12, basedDF1("col1") <=> basedDF1("col2"))
val df4 = df3.withcolumn("col3").withColumnRename("col4", "newcol4")
val df5 = df4.groupBy("groupbycol").agg(expr("coalesce(first(col5, false))"))
val df6 = df5.withColumn("level1", col("coalesce(first(col5, false))")(0))
.withColumn("level2", col("coalesce(first(col5, false))")(1))
.withColumn("level3", col("coalesce(first(col5, false))")(2))
.withColumn("level4", col("coalesce(first(col5, false))")(3))
.withColumn("level5", col("coalesce(first(col5, false))")(4))
.drop("coalesce(first(col5, false))")
I just wondering how Spark generate the spark SQL logic, is it going to generate the query-like transaction for each data frame, i.e
df1 = select * ....
df2 = select * ....
df3 = df1.join.df2 // spark takes content from df1/df2 instead run each query again for joining
....
df6 = ...
or generate large query by the end of the last dataframe
df6 = select coalesce(first(col5, false)).. from ((select * from table1) join (select * from table2 ) on blah ) group by blah 2...
All I trying to figure out, is how to avoid Spark generate huge query-like logic instead I can let Spark "Commit" somewhere to avoid huge long transaction
the reason behind the inquiry is because current spark job threw following exception
19/12/17 10:57:55 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 567, Column 28: Redefinition of parameter "agg_expr_21"

Spark has two operations - transformation and action.
Transformation happens when a DF is being built using various operations like - select, join, filter etc. It is read to be executed but has not done any work yet, it is being lazy. These transformations can be composed to make new transformation which you do while operating on predefined dataframes, like basedDF1.join(baseDF12, basedDF1("col1") <=> basedDF1("col2")). But again nothing has run.
Action happens when certain operations are called like save, collect, show etc. This is when real work happens. Here each and every 'transformation' that was defined before with be either executed or retrieved from cache. You can save a lot of work for Spark if you can cache some of the complex steps. This can also simplify the plan.
val baseDF1 = spark.sql(s"select * from db.table1 where condition1='blah'")
val baseDF2 = spark.sql(s"select * from db.table2 where condition2='blah'")
baseDF1.cache()
baseDF2.cache()
val df3 = basedDF1.join(baseDF12, basedDF1("col1") <=> basedDF1("col2"))
val df4 = baseDF1.join(baseDF12, basedDF1("col2") === basedDF1("col3"))// different join
When df4 is executed after df3, it won't be selecting from db.table1 and db.table2 but rather reading baseDF1 and baseDF2 from cache. The plan will look simpler too.
if some reason cache is gone then Spark will recompute baseDF1 and baseDF2 as they were defined, so it knows its lineage but didn't execute it.
You can also use checkpoint to break up the lineage of overall execution, hence simplify it. I think this can help your case.
I have also saved intermediate dataframe to a temporary file and read It back as a dataframe and use it down the line. This breaks up the complexity at the cost of extra io. I won’t recommend it unless other methods didn’t work.
I am not sure about the error you are getting.

Related

How to join efficiently 2 Spark dataframes partitioned by some column, when that column is one of multiple join keys?

I am currently facing some issues in Spark 3.0.2 to efficiently join 2 Spark dataframes when
The 2 Spark DataFrames are partitioned by some key id;
id is part of the join key, but it is not the only one.
My intuition is telling me that the query optimizer is, in this case, not choosing the optimal path. I will illustrate my issue through a minimal example (note that this particular example does not really require a join, it's just for illustrative purposes).
Let's start from the simple case: the 2 dataframes are partitioned by id, and we join by id only:
from pyspark.sql import SparkSession, Row, Window
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
# Make up some test dataframe
df = spark.createDataFrame([Row(id=i // 10, order=i % 10, value=i) for i in range(10000)])
# Create the left side of the join (repartitioned by id)
df2 = df.repartition(50, 'id')
# Create the right side of the join (also repartitioned by id)
df3 = df2.select('id', F.col('order').alias('order_alias'), F.lit(0).alias('dummy'))
# Perform the join
joined_df = df2.join(df3, on='id')
joined_df.foreach(lambda x: None)
This results in the following efficient plan:
This plan is efficient: it recognizes that the 2 dataframes are already partitioned by the join key and avoids to re-shuffle them. The 2 dataframes are not only repartitioned, but also colocated.
What happens if there is an additional join key? It results in an inefficient plan:
joined_df = df2.join(df3, on=[df2.id==df3.id, df2.order==df3.order_alias])
joined_df.foreach(lambda x: None)
The plan is inefficient since it is repartitioning the 2 dataframes to do the join. This does not make sense to me. Intuitively, we could use the existing partitions: all keys to be joined will be found in the same partition as before, there is just one additional condition to apply! So I thought: perhaps we could phrase the 2nd condition as a filter?
joined_df.foreach(lambda x: None)
joined_df = df2.join(df3, on='id')
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
This however results in the same inefficient plan, since Spark query optimizer will just merge the 2nd filter with the join.
So, I finally thought that maybe I could force Spark to process the join as I want by adding a dummy cache step, by trying the following:
from pyspark import StorageLevel
joined_df = df2.join(df3, on='id')
# Note that this storage level will not cache anything, it's just to suggest to Spark that I need this intermediate result
joined_df.persist(StorageLevel(False, False, False, False))
# Do the filtering after "persisting" the join
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
joined_df_filtered.foreach(lambda x: None)
This results in an efficient plan! It is in fact much faster than the previous ones.
The workaround of "persisting" the first join to force Spark to use a more efficient processing plan is "good enough" for my use case, but I still have a few questions:
Am I missing something in my intuition that Spark should actually be reusing partitions when the partition key is part of the join key, instead of re-shuffling?
Is this expected behavior of the query optimizer? Should a ticket be filed for it?
Is there a better way to force the desired processing plan than adding the "persist" step? It seems more like an indirect workaround than a direct solution.

Is spark.read or spark.sql lazy transformations?

In Spark if the source data has changed in between two action calls why I still get previous o/p not the most recent ones. Through DAG all operations will get executed including read operation once action is called. Isn't it?
e.g.
df = spark.sql("select * from dummy.table1")
#Reading from spark table which has two records into dataframe.
df.count()
#Gives count as 2 records
Now, a record inserted into table and action is called withou re-running command1 .
df.count()
#Still gives count as 2 records.
I was expecting Spark will execute read operation again and fetch total 3 records into dataframe.
Where my understanding is wrong ?
To contrast your assertion, this below does give a difference - using Databricks Notebook (cells). The insert operation is not known that you indicate.
But the following using parquet or csv based Spark - thus not Hive table, does force a difference in results as the files making up the table change. For a DAG re-compute, the same set of files are used afaik, though.
//1st time in a cell
val df = spark.read.csv("/FileStore/tables/count.txt")
df.write.mode("append").saveAsTable("tab2")
//1st time in another cell
val df2 = spark.sql("select * from tab2")
df2.count()
//4 is returned
//2nd time in a different cell
val df = spark.read.csv("/FileStore/tables/count.txt")
df.write.mode("append").saveAsTable("tab2")
//2nd time in another cell
df2.count()
//8 is returned
Refutes your assertion. Also tried with .enableHiveSupport(), no difference.
Even if creating a Hive table directly in Databricks:
spark.sql("CREATE TABLE tab5 (id INT, name STRING, age INT) STORED AS ORC;")
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
...
df.count()
...
spark.sql(""" INSERT INTO tab5 VALUES (2, 'Amy SmithS', 77) """)
df.count()
...
Still get updated counts.
However, the for a Hive created ORC Serde table, the following "hive" approach or using an insert via spark.sql:
val dfX = Seq((88,"John", 888)).toDF("id" ,"name", "age")
dfX.write.format("hive").mode("append").saveAsTable("tab5")
or
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
will sometimes show and sometimes not show an updated count when just the 2nd df.count() is issued. This is due to Hive / Spark lack of synchronization that may depend on some internal flagging of changes. In any event not consistent. Double-checked.
This is most related to inmutability as I see it. DataFrames are inmutables, hence changes in the original table are not reflected on them.
Once a dataframe is evaluated, it will be never calculated again. So once the dataframe named df is evaluated, it is the picture of table1 at the time of evaluation, it doesn't matter if table1 changes, df won't. So the second df.count does not trigger evaluation it just return the previous result, which is 2
If you want the desired results you have to load again the DF in a different variable:
val df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
val df2 = spark.sql("select * from dummy.table1")
df2.count() //Will trigger evaluation and return 3
Or using var instead of val (which is bad)
var df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 3
This said: yes, spark read and spark sql are lazy, those are not called until an action is found, but once that happens, evaluation won't be trigger ever again in that dataframe

Caching in spark before diverging the flow

I have a basic question regarding working with Spark DataFrame.
Consider the following piece of pseudo code:
val df1 = // Lazy Read from csv and create dataframe
val df2 = // Filter df1 on some condition
val df3 = // Group by on df2 on certain columns
val df4 = // Join df3 with some other df
val subdf1 = // All records from df4 where id < 0
val subdf2 = // All records from df4 where id > 0
* Then some more operations on subdf1 and subdf2 which won't trigger spark evaluation yet*
// Write out subdf1
// Write out subdf2
Suppose I start of with main dataframe df1(which I lazy read from the CSV), do some operations on this dataframe (filter, groupby, join) and then comes a point where I split this datframe based on a condition (for eg, id > 0 and id < 0). Then I further proceed to operate on these sub dataframes(let us name these subdf1, subdf2) and ultimately write out both the sub dataframes.
Notice that the write function is the only command that triggers the spark evaluation and rest of the functions(filter, groupby, join) result in lazy evaluations.
Now when I write out subdf1, I am clear that lazy evaluation kicks in and all the statements are evaluated starting from reading of CSV to create df1.
My question comes when we start writing out subdf2. Does spark understand the divergence in code at df4 and store this dataframe when command for writing out subdf1 was encountered? Or will it again start from the first line of creating df1 and re-evaluate all the intermediary dataframes?
If so, is it a good idea to cache the dataframe df4(Assuming I have sufficient memory)?
I'm using scala spark if that matters.
Any help would be appreciated.
No, Spark cannot infer that from your code. It will start all over again. To confirm this, you can do subdf1.explain() and subdf2.explain() and you should see that both dataframes have query plans that start right from the beginning where df1 was read.
So you're right that you should cache df4 to avoid redoing all the computations starting from df1, if you have enough memory. And of course, remember to unpersist by doing df4.unpersist() at the end if you no longer need df4 for any further computations.

How to prevent predicate pushdown?

Recently I was working with Spark with JDBC data source. Consider following snippet:
val df = spark.read.(options).format("jdbc").load();
val newDF = df.where(PRED)
PRED is a list of predicates.
If PRED is a simple predicate, like x = 10, query will be much faster. However, if there are some non-equi conditions like date > someOtherDate or date < someOtherDate2, query is much slower than without predicate pushdown. As you may know, DB engines scans of such predicates are very slow, in my case with even 10 times slower (!).
To prevent unnecessary predicate pushdown I used:
val cachedDF = df.cache()
val newDF = cachedDF.where(PRED)
But it requires a lot of memory and - due to problem mentioned here - Spark' Dataset unpersist behaviour - I can't unpersist cachedDF.
Is there any other option to avoid pushing down predicates? Without caching and without writing own data source?
Note: Even if there is an option to turn off predicate pushdown, it's applicable only is other query may still use it. So, if I wrote:
// some fancy option set to not push down predicates
val df1 = ...
// predicate pushdown works again
val df2 = ...
df1.join(df2)// where df1 without predicate pushdown, but df2 with
A JIRA ticket has been opened for this issue. You can follow it here :
https://issues.apache.org/jira/browse/SPARK-24288

Will a persisted-dataframe be calculated repeatedly many times?

I've got the following structured query:
val A = 'load somedata from HDFS'.persist(StorageLevel.MEMORY_AND_DISK_SER)
val B = A.filter('condition 1')
val C = A.filter('condition 2')
val D = A.filter('condition 3')
val E = A.filter('condition 4')
val F = A.filter('condition 5')
val G = A.filter('condition 6')
val H = A.filter('condition 7')
val I = B.union(C).union(D).union(E).union(F).union(G).union(H)
I persist the dataframe A, so that when I use B/C/D/E/F/G/H, the A dataframe should be calculated only once? But the DAG of this job is below:
From the DAG above, it seems that stage 6-12 are all executed and the dataframe A is calculated 7 times?
Why would this happen?
Maybe the DAG is just fake? I found that there are not lines on the top of stage 7-12 where stage 6 does have two lines from other stage
I didn't list all the operations. After union operation, I save the I dataframe to HDFS. Will this action on the I dataframe make the persist operation be done really? Or must I do an action operation such as count on the A dataframe to trigger the persist operation before reuse A dataframe?
Doing the following line won't persist your dataset.
val A = 'load somedata from HDFS'.persist(StorageLevel.MEMORY_AND_DISK_SER)
Caching/persistence is lazy when used with Dataset API so you have to trigger the caching using count operator or similar that in turn submits a Spark job.
After that all the following operators, filter including, should use InMemoryTableScan with the green dot in the plan (as shown below).
In your case even after union the dataset I is not cached since you have not triggered the caching (but merely marked it for caching).
After union operation, I save the I dataframe to HDFS. Will this action on the I dataframe make the persist operation be done really?
Yes. Only actions (like saving to an external storage) can trigger the persistence for future reuse.
Or must I do an action operation such as count on the A dataframe to trigger the persist operation before reuse A dataframe?
That's the point! In your case, since you want to reuse A dataframe across filter operators you should persist it first, count (to trigger the caching) followed by filters.
In your case, no filter will benefit from any performance increase due to persist. That persist is practically void of any impact on the performance and just makes a code reviewer think it's otherwise.
If you want to see when and if your dataset is cached, you can check out Storage tab in web UI or ask CacheManager about it.
val nums = spark.range(5).cache
nums.count
scala> spark.sharedState.cacheManager.lookupCachedData(nums)
res0: Option[org.apache.spark.sql.execution.CachedData] =
Some(CachedData(Range (0, 5, step=1, splits=Some(8))
,InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Range (0, 5, step=1, splits=8)
))

Resources