How does pyspark perform union? - apache-spark

If I have a code like this:
def my_func(df):
df1 = trans1(df)
df2 = trans2(df1)
df3 = trans3(df1)
df4 = df2.unionAll(df3)
return df4
And I run a df.collect()on the result of the function while not having persisted anything. How many times will the operations in '''trans1''' be run? Once or twice? Thanks!

Coming into your question In your DF there is no Action .
So it will not perform anything .
But hypothetically I am taking an example you have performed cache
df.cache().storageLevel
Post that you have performed some count action .
Caching/persistence is lazy when used with Dataset API so you have to trigger the caching using count operator or similar that in turn submits a Spark job.
In your case even after union there is no action if you have used action to write into disk.
Yes. Only actions (like saving to an external storage) can trigger the persistence for future reuse.
you can check out Storage tab in web UI about it.

Twice
All transformations in spark all lazily evaluated which basically creates a lineage of instructions, and once an action is performed, it traverses the lineage from bottom to top until it finds the materialised data.
Since df1 is not materialised it will perform the same operation twice for two different linages. If df1 is persisted, first graph will perform full transformation whereas second will do only further computation by reusing it. You can see it in DAG and SQL plan tab.
Please note that it is not necessary to always cache the data. Sometimes caching can cost more than re-computation, therefore you should cache the if and only if the computation cost is higher.

Related

Spark reuse broadcast DF

I would like to reuse my DataFrame (without falling back to doing this using "Map" function in RDD/Dataset) which I marking as broadcast-eable, but seems Spark keeps broadcasting it again and again.
Having a table "bank" (test table). I perform the following:
val cachedDf = spark.sql("select * from bank").cache
cachedDf.count
val dfBroadcasted = broadcast(cachedDf)
val dfNormal = spark.sql("select * from bank")
dfNormal.join(dfBroadcasted, List("age"))
.join(dfBroadcasted, List("age")).count
I'm caching before just in case it made a difference, but its the same with or without.
If I execute the above code, I see the following SQL plan:
As you can see, my broadcasted DF gets broadcasted TWICE with also different timings (if I add more actions afterwards, they broadcast again too).
I care about this, because I actually have a long-running program which has a "big" DataFrame which I can use to filter out HUGE DataFrames, and I would like that "big" DataFrame to be reused.
Is there a way to force reusability? (not only inside the same action, but between actions, I could survive with the same action tho)
Thanks!,
Ok, updating the question.
Summarising:
INSIDE the same action, left_semis will reuse broadcasts
while normal/left joins won't. Not sure related with the fact that Spark/developers already know the columns of that DF won't affect the output at all so they can reuse it or it's just an optimization spark is missing.
My problem seems mostly-solved, although it would be great if someone knew how to keep the broadcast across actions.
If I use left_semi (which is the join i'm going to use in my real app), the broadcast is only performed once.
With:
dfNormalxx.join(dfBroadcasted, Seq("age"),"left_semi")
.join(dfBroadcasted, Seq("age"),"left_semi").count
The plan becomes (I also changed the size so it matches my real one, but this made no difference):
Also the wall total time is much better than when using "left_semi" (I set 1 executor so it doesn't get parallelized, just wanted to check if the job was really being done twice):
Even though my collect takes 10 seconds, this will speedup table reads+groupBys which are taking like 6-7minutes

Forcing pyspark join to occur sooner

PROBLEM: I have two tables that are vastly different in size. I want to join on some id by doing a left-outer join. Unfortunately, for some reason even after caching my actions after the join are being executed on all records even though I only want the ones from the left table. See below:
MY QUESTIONS:
1. How can I set this up so only the records that match the left table get processed through the costly wrangling steps?
LARGE_TABLE => ~900M records
SMALL_TABLE => 500K records
CODE:
combined = SMALL_TABLE.join(LARGE_TABLE SMALL_TABLE.id==LARGE_TABLE.id, 'left-outer')
print(combined.count())
...
...
# EXPENSIVE STUFF!
w = Window().partitionBy("id").orderBy(col("date_time"))
data = data.withColumn('diff_id_flag', when(lag('id').over(w) != col('id'), lit(1)).otherwise(lit(0)))
Unfortunately, my execution plan shows the expensive transformation operation above is being done on ~900M records. I find this odd since I ran df.count() to force the join to execute eagerly rather than lazily.
Any Ideas?
ADDITIONAL INFORMATION:
- note that the expensive transformation in my code flow occurs after the join (at least that is how I interpret it) but my DAG shows the expensive transformation occurring as a part of the join. This is exactly what I want to avoid as the transformation is expensive. I want the join to execute and THEN the result of that join to be run through the expensive transformation.
- Assume the smaller table CANNOT fit into memory.
The best way to do this is to broadcast the tiny dataframe. Caching is good for multiple actions, which doesnt seem to be applicable ro your particular use case.
df.count has no effect on the execution plan at all. It is just expensive operation executed without any good reason.
Window function application in this requires the same logic as join. Because you join by id and partitionBy idboth stages will require the same hash partitioning and full data scan for both sides. There is no acceptable reason to separate these two.
In practice join logic should be applied before window, serving as a filter for the the downstream transformations in the same stage.

Spark 1.6 Dataframe cache not working correctly

My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory, And whenever that cached dataframe is called in the program it uses already computed dataframe from cache.
but that is not how my program is working.
I have a dataframe like below which I am caching it, and then immediately I run a df.count action.
val df = inputDataFrame.select().where().withColumn("newcol" , "").cache()
df.count
When I run the program. In Spark UI I see that first line runs for 4 min and
when it comes to second line it again runs 4 min basically first line is re computed twice?
Shouldn't first line computed and cached when second line triggers?
how to resolve this behavior. I am stuck, please advise.
My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory,
It is not correct. Simple cache and count (take wouldn't work on RDD either) is a valid method for RDDs but it is not the case with Datasets, which use much more advanced optimizations. With query:
df.select(...).where(...).withColumn("newcol" , "").count()
any column, which is not used in where clause can be ignored.
There is an important discussion on the developer list and quoting Sean Owen
I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every element.
Translated to code:
df.foreach(_ => ())
There is
df.registerAsTempTable("df")
sqlContext.sql("CACHE TABLE df")
which is eager but it is no longer (Spark 2 and forward) documented and should be avoided.
No, if you call cache on a DataFrame it's not cached in this moment, it's only "marked" for potential future caching. The actual caching is only done when an action is followed later. You can also see your cached DataFrame in Spark UI under "Storage"
Another problem in your code is that count on DataFrame does not compute the entire DataFrame because not all columns need to be computed for that. You can use df.rdd.count() to force the entire evualation (see How to force DataFrame evaluation in Spark).
The question is why your first operation takes so long, even if no action is called. I think this is related to the caching logic (e.g. size estimations etc) being computed when calling cache (see eg. Why is rdd.map(identity).cache slow when rdd items are big?)

How to Identify the list of available RDDs?

I am using the below command to get the list of available registered Temp tables
sqlContext.sql("show tables").collect().foreach(println)
Is there any similar command to get list of available RDDs?
Here is my requirement (using scala)
1. Need to create some RDD on the fly
2. Identify list of available RDDs
3. remove/delete/clear the unwanted RDDs and move forward
How to delete an RDD in PySpark for the purpose of releasing resources?
An additional note, I went through this link, but it doesn't answer all my questions... also i tried the below but don't find any difference before and after unpersist, so not sure how to confirm that my RDD has been released the memory
val tempRDD1 = RDD1.reduceByKey((acc,value)=> acc+value)
tempRDD1.collect.foreach(println)
tempRDD1.unpersist()
tempRDD1.collect.foreach(println)
The RDD data is not saved until it is 1. persisted (cached) and 2. an action occurs to force the preceding transformations to occur. If either of these do not occur, no data will be stored. Any RDD that appears to be "created", will just create an action plan to produce the data if it is needed later. This model is called lazy evaluation.
In your example, no RDD is ever cached, so no data will ever be stored in memory. And the unpersist call will have no effect.

Persisting of DataFrames in transformation workflows

I am trying to run a series of transformation over 3 DataFrames. After each transformation, I want to persist DF and save it to text file. The steps I am doing is as follows.
Step0:
Create DF1
Create DF2
Create DF3
Create DF4
(no persist no save yet)
Step1:
Create RESULT-DF1 by joining DF1 and DF2
Persist it to disk and memory
Save it to text file
Step2:
Create RESULT-DF2 by joining RESULT-DF1 and DF3
Persist it to disk and memory
Save it to text file
Step3:
Create RESULT-DF3 by joining RESULT-DF2 and DF4
Persist it to disk and memory
Save it to text file
Observation:
Number of tasks created at Step1 is 601
Number of tasks created at Step2 is 1004 (Didn't skip anything)
Number of tasks created at Step3 is 1400 (Skipped 400 tasks)
As different approach, I broke above steps into three different runs. ie;
Start, Load DF1 and DF2, Do Step1, Save RESULT-DF1 & exit
Start, Load DF3, Load RESULT-DF1 from file, do Step2, save RESULT-DF2
& exit
Start, Load DF4, Load RESULT-DF2 from file, do Step3, save RESULT-DF3
& exit
Later approach runs faster.
My question is:
Am missing something on the persisting side in first approach?
Why Step2 run didn't just use result from Step1 without redoing all it's tasks even after persisting (with only 601 tasks instead of 1004)?
What are some good reads about best practices, when implementing such series of transformation workflows?
Since there is no code provided, I will assume that the join operations you are performing are different each time (on different attributes and data). Even if you have cached the data frames Spark needs to resolve each join into multiple stages and tasks. The Catalyst optimizer is responsible to create the Logical (initial and optimized) and Physical plan for your query. Given this sequence of execution each time a new plan must be computed based on your query and the respective dataset (the data frame could become smaller or bigger after each join).
Given that the tasks are increasing from data frame to data frame after each join, it is possible that your datasets are getting bigger and/or you perform the join operation on multiple attributes. However I cannot understand what you mean by exit in your second approach.
For further reading I would suggest the following:
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
In you case, please consider that how Spark SQL query optimizer works. One of the cases where Catalyst optimizer currently runs into challenges is with very large query plans. These query plans tend to be the result of iterative algorithms, like graph algorithms or machine learning algorithms. One simple work around for this is converting the data to an RDD and back to DataFrame/Dataset at the end of each iteration.
I encountered the exactly same issue as what you described above. And this workaround really helped.
~Erik

Resources