Persisting of DataFrames in transformation workflows - apache-spark

I am trying to run a series of transformation over 3 DataFrames. After each transformation, I want to persist DF and save it to text file. The steps I am doing is as follows.
Step0:
Create DF1
Create DF2
Create DF3
Create DF4
(no persist no save yet)
Step1:
Create RESULT-DF1 by joining DF1 and DF2
Persist it to disk and memory
Save it to text file
Step2:
Create RESULT-DF2 by joining RESULT-DF1 and DF3
Persist it to disk and memory
Save it to text file
Step3:
Create RESULT-DF3 by joining RESULT-DF2 and DF4
Persist it to disk and memory
Save it to text file
Observation:
Number of tasks created at Step1 is 601
Number of tasks created at Step2 is 1004 (Didn't skip anything)
Number of tasks created at Step3 is 1400 (Skipped 400 tasks)
As different approach, I broke above steps into three different runs. ie;
Start, Load DF1 and DF2, Do Step1, Save RESULT-DF1 & exit
Start, Load DF3, Load RESULT-DF1 from file, do Step2, save RESULT-DF2
& exit
Start, Load DF4, Load RESULT-DF2 from file, do Step3, save RESULT-DF3
& exit
Later approach runs faster.
My question is:
Am missing something on the persisting side in first approach?
Why Step2 run didn't just use result from Step1 without redoing all it's tasks even after persisting (with only 601 tasks instead of 1004)?
What are some good reads about best practices, when implementing such series of transformation workflows?

Since there is no code provided, I will assume that the join operations you are performing are different each time (on different attributes and data). Even if you have cached the data frames Spark needs to resolve each join into multiple stages and tasks. The Catalyst optimizer is responsible to create the Logical (initial and optimized) and Physical plan for your query. Given this sequence of execution each time a new plan must be computed based on your query and the respective dataset (the data frame could become smaller or bigger after each join).
Given that the tasks are increasing from data frame to data frame after each join, it is possible that your datasets are getting bigger and/or you perform the join operation on multiple attributes. However I cannot understand what you mean by exit in your second approach.
For further reading I would suggest the following:
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

In you case, please consider that how Spark SQL query optimizer works. One of the cases where Catalyst optimizer currently runs into challenges is with very large query plans. These query plans tend to be the result of iterative algorithms, like graph algorithms or machine learning algorithms. One simple work around for this is converting the data to an RDD and back to DataFrame/Dataset at the end of each iteration.
I encountered the exactly same issue as what you described above. And this workaround really helped.
~Erik

Related

How does pyspark perform union?

If I have a code like this:
def my_func(df):
df1 = trans1(df)
df2 = trans2(df1)
df3 = trans3(df1)
df4 = df2.unionAll(df3)
return df4
And I run a df.collect()on the result of the function while not having persisted anything. How many times will the operations in '''trans1''' be run? Once or twice? Thanks!
Coming into your question In your DF there is no Action .
So it will not perform anything .
But hypothetically I am taking an example you have performed cache
df.cache().storageLevel
Post that you have performed some count action .
Caching/persistence is lazy when used with Dataset API so you have to trigger the caching using count operator or similar that in turn submits a Spark job.
In your case even after union there is no action if you have used action to write into disk.
Yes. Only actions (like saving to an external storage) can trigger the persistence for future reuse.
you can check out Storage tab in web UI about it.
Twice
All transformations in spark all lazily evaluated which basically creates a lineage of instructions, and once an action is performed, it traverses the lineage from bottom to top until it finds the materialised data.
Since df1 is not materialised it will perform the same operation twice for two different linages. If df1 is persisted, first graph will perform full transformation whereas second will do only further computation by reusing it. You can see it in DAG and SQL plan tab.
Please note that it is not necessary to always cache the data. Sometimes caching can cost more than re-computation, therefore you should cache the if and only if the computation cost is higher.

Spark Dataset join performance

I receive a Dataset and I am required to join it with another Table. Hence the most simple solution that came to my mind was to create a second Dataset for the other table and perform the joinWith.
def joinFunction(dogs: Dataset[Dog]): Dataset[(Dog, Cat)] = {
val cats: Dataset[Cat] = spark.table("dev_db.cat").as[Cat]
dogs.joinWith(cats, ...)
}
Here my main concern is with spark.table("dev_db.cat"), as it feels like we are referring to all of the cat data as
SELECT * FROM dev_db.cat
and then doing a join at a later stage. Or will the query optimizer directly perform the join with out referring to the whole table? Is there a better solution?
Here are some suggestions for your case:
a. If you have where, filter, limit, take etc operations try to apply them before joining the two datasets. Spark can't push down these kind of filters therefore you have to do by your own reducing as much as possible the amount of target records. Here an excellent source of information over the Spark optimizer.
b. Try to co-locate the datasets and minimize the shuffled data by using repartition function. The repartition should be based on the keys that participate in join i.e:
dogs.repartition(1024, "key_col1", "key_col2")
dogs.join(cats, Seq("key_col1", "key_col2"), "inner")
c. Try to use broadcast for the smaller dataset if you are sure that it can fit in memory (or increase the value of spark.broadcast.blockSize). This consists a certain boost for the performance of your Spark program since it will ensure the co-existense of two datasets within the same node.
If you can't apply any of the above then Spark doesn't have a way to know which records should be excluded and therefore will scan all the available rows from both datasets.
You need to do an explain and see if predicate push down is used. Then you can judge your concern to be correct or not.
However, in general now, if no complex datatypes are used and/or datatype mismatches are not evident, then push down takes place. You can see that with simple createOrReplaceTempView as well. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html

Why is execution time of spark sql query different between first time and second time of execution?

I am using spark sql to run some aggregated query on the parquet data source.
My parquet data source includes a table with columns: id int, time timestamp, location int, counter_1 long, counter_2 long, ..., counter_48. The total data size is about 887 MB.
My spark version is 2.4.0. I run one master and one slave on a single machine (4 cores, 16G memory).
Using spark-shell, I ran the spark command:
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
The second time I ran a similar command (only change columns):
spark.time(spark.sql("SELECT location, sum(counter_2)+sum(counter_6)+sum(counter_11)+sum(counter_16)+sum(cou
nter_21)+sum(counter_26)+sum(counter_31)+sum(counter_36 )+sum(counter_41)+sum(counter_46) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 3s.
My first question is: Why are they different? I know it is not data caching because of the parquet format. Is it about reusing something like query planning?
I did another test: The first command is
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
In the second command, I change the aggregate function:
spark.time(spark.sql("SELECT location, avg(counter_1)+avg(counter_5)+avg(counter_10)+avg(counter_15)+avg(cou
nter_20)+avg(counter_25)+avg(counter_30)+avg(counter_35 )+avg(counter_40)+avg(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 5s.
My second question is: Why is the second command is faster than the first command but the execution time difference is slightly smaller than the first scenario?
Finally, I have a problem related to above scenarios: The are about 200 formulas like:
formula1 = sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45)
formula2 = avg(counter_2)+avg(counter_5)+avg(counter_11)+avg(counter_15)+avg(cou
nter_21)+avg(counter_25)+avg(counter_31)+avg(counter_35 )+avg(counter_41)+avg(counter_45)
I have to run the following format frequently:
select formulaX,formulaY, ..., formulaZ from table where time > value1 and time < value2 and location in (value1, value 2...) group by location
My third question is: Is there anyway to optimize the performance (the query used once should be faster if it is used again in the future)? Does spark optimize itself or do I have to write some code, change config?
It's called Exchange Reuse. When Spark runs shuffling (i.e. aggregation, join) it stores a copy of the shuffle data on local worker nodes for potential reuse. This is an internally controlled behavior and cannot be directly influenced by end user. If you find you're keep re-using a particular portion of data (or query outcome), you could consider explicitly CACHING it by using the cache(). However, bear in mind although this allows Spark to reuse cached result for potentially faster query performance (if, and only if the Analyzer Plan of your cached query matches your new query), over using CACHE can cause whole lot of different performance problems.
A bad example is when your dataset is very large, it may cause Disk Spill problem. That is, the dataset doesn't fit into your cluster's available memory and needs to be written to slower hard disks.
Another bad example is when your query only needs to access a subset of the cached data. By caching the entire dataset in memory, Spark is forced to perform full in-memory table scan. Not only that's waste of resource but also results in a slower query performance as oppose to not using cache at all.
The best thing to do is try & error with a few of your own example queries, look at the Spark UI and check if there is sign of Disk Spill or large amount of input data scan.
Every query/data combination is unique hence you'll need to experiment it a bit to find the best performance tuning method for your own workload.
When doing an aggregate spark creates what are called shuffle files. If you run the same query twice, it will reuse the shuffle files which are stored locally on the workers fs. Unfortunately you can't rely on them to always be there because eventually the file handler gets gc'd. If your going to run 10 queries on the same dataset, cache it or use databricks.

Spark 1.6 Dataframe cache not working correctly

My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory, And whenever that cached dataframe is called in the program it uses already computed dataframe from cache.
but that is not how my program is working.
I have a dataframe like below which I am caching it, and then immediately I run a df.count action.
val df = inputDataFrame.select().where().withColumn("newcol" , "").cache()
df.count
When I run the program. In Spark UI I see that first line runs for 4 min and
when it comes to second line it again runs 4 min basically first line is re computed twice?
Shouldn't first line computed and cached when second line triggers?
how to resolve this behavior. I am stuck, please advise.
My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory,
It is not correct. Simple cache and count (take wouldn't work on RDD either) is a valid method for RDDs but it is not the case with Datasets, which use much more advanced optimizations. With query:
df.select(...).where(...).withColumn("newcol" , "").count()
any column, which is not used in where clause can be ignored.
There is an important discussion on the developer list and quoting Sean Owen
I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every element.
Translated to code:
df.foreach(_ => ())
There is
df.registerAsTempTable("df")
sqlContext.sql("CACHE TABLE df")
which is eager but it is no longer (Spark 2 and forward) documented and should be avoided.
No, if you call cache on a DataFrame it's not cached in this moment, it's only "marked" for potential future caching. The actual caching is only done when an action is followed later. You can also see your cached DataFrame in Spark UI under "Storage"
Another problem in your code is that count on DataFrame does not compute the entire DataFrame because not all columns need to be computed for that. You can use df.rdd.count() to force the entire evualation (see How to force DataFrame evaluation in Spark).
The question is why your first operation takes so long, even if no action is called. I think this is related to the caching logic (e.g. size estimations etc) being computed when calling cache (see eg. Why is rdd.map(identity).cache slow when rdd items are big?)

Spark SQL - READ and WRITE in sequence or pipeline?

I am working on a cost function for Spark SQL.
While modelling the TABLE SCAN behaviour I cannot understand if READ and WRITE are carried out in pipeline or in sequence.
Let us consider the following SQL query:
SELECT * FROM table1 WHERE columnA = ‘xyz’;
Each task:
Reads a data block (either locally or from a remote node)
Filter out the tuples that do not satisfy the predicate
Write to the disk the remaining tuples
Are (1), (2) and (3) carried out in sequence or in pipeline? In other words, the data block is completely read (all the disk pages composing it) first and then it is filtered and then it is rewritten to the disk or are these activities carried out in pipeline? (i.e. while reading the (n+1)-tuple, n-tuple can be processed and written).
Thanks in advance.
Whenever you submit a job, first thing spark does is create DAG (Directed acyclic graph) for your job.
After creating DAG, spark knows, which tasks it can run in parallel, which task are dependent on output of previous step and so on.
So, in your case,
Spark will read your data in parallel (which you can see in partition), filter them out (in each partition).
Now, since saving required filtering, so it will wait for filtering to finish for at least one partition, then start to save it.
After some more digging I found out that Spark SQL uses a so called "volcano style pull model".
According to such model, a simple scan-filter-write query whould be executed in pipeline and are fully distributed.
In other words, while reading the partition (HDFS block), filtering can be executed on read rows. No need to read the whole block to kick off the filtering. Writing is performed accordingly.

Resources