I've recently started working in a code base with some legacy spark jobs. The general workflow looks something like this
val intermediate = df.sql('select ... from billions b join millions m on b.id1 = m.id1 or b.id2 = m.id2')
intermediate.persist()
intermediate.count()
val final = intermediate.sql('select ... from intermediate i join dim d on i.id = d.id')
final.persist()
final.count()
My initial impression was that the persist/count was a code smell, as it forced the realization of the dataset. However, when I remove them (individually or as a pair), I start seeing OOM errors and the job fails to complete.
I know the OR in the join condition could be problematic, as the data will likely need to be reshuffled.
More broadly though, in Spark jobs, is it common to force the realization of a dataset? In what cases might I want to do that? Are there common problems (in data or the spark code) that a novice spark developer might mask with data realization?
Related
I am developing a Spark SQL analytics solutions using set of tables. Suppose there are 5 tables which i need to building my solution and finally i am creating one output table.
Here is my flow
dataframe1 = table1 join table2
dataframe2 = dataframe1 join table3
dataframe3 = datamframe2 + filter + agg
dataframe4 = dataframe3 join table4 join table 5
// finally
dataframe4.saveAsTable
When I save final dataframe that's when all the above dataframe is evaluated.
Is my approach is good? or
Do i need to cache/persist intermediate dataframes?
This is a very generic question and it is hard to provide a definitive answer.
Depending on the size of tables you would want to do broadcast hint for any of tables that are relatively small.
You can do this via
table_i.join(broadcast(table_j), ....)
This behaviour depends on the value in:
Now broadcast hint will be honoured only if Spark is able to evaluate the value of the table so you might need to cache().
Another option is via Spark checkpoints that can help to truncate local plan for optimisation (also this allows you to resume jobs from checkpoint location, it is similar to writing to HDFS but with some overhead).
In case of broadcasting few houndres of Mb tables, you might need to increase your kryo buffer:
--conf spark.kryoserializer.buffer.max=1g
It also depends which join types you will use.
You would probably want to do filter and aggregagtion as early as possible since it will reduce the join surface.
There are many other considerations to be consider in order to properly optimise this. In case of power law distribution of join keys in any of the joins you would need to do salting and explode smaller table.
In your case, in principle, there is not really a cache or persist required Why?
As there are no reuse paths evident (for other Actions or other Transformations within the same Action), it is all sequential.
Also, lazy evaluation and Catalyst.
Try the .explain and see how Spark will process.
However, due to memory eviction possibilities on the Cluster, there may be the need to re-compute on a Worker. There are various settings that you could apply via .cache and .persist, but Spark handles memory and disk spills without explicit .cache or .persist. See https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/
Also, using .cache can affect performance. So use .explain. See here an excellent posting: Spark: Explicit caching can interfere with Catalyst optimizer's ability to optimize some queries?
So, each case is different but yours seems Ok to answer as I have. In summary: An RDD or DF that is not cached, nor check-pointed, is re-evaluated again each time an Action is invoked on that RDD or DF or if re-accessed within the current Action and no skipped stage situation applies. In your case no issue. Doing otherwise would slow your App down in fact.
Suppose we start from some data and gets some intermediate result df_intermediate. Along the pipeline from source data to df_intermediate, all transformations are lazy and nothing is actually computed.
Then I would like to perform two different transformations to df_intermediate. For example, I would like to calculate df_intermediate.agg({"col":"max"}) and df_intermediate.approxquantile("col", [0.1,0.2,0.3], 0.01) using two separate commands.
I wonder in the following scenario, does spark need to recompute df_intermediate when it is performing the second transformation? In other words, does Spark perform the calculation for the above two transformations both start from the raw data without storing the intermediate result? Obviously I can cache the intermediate result but I'm just wondering if Spark does this kind of optimization internallly.
It is somewhat disappointing. But firstly you need to see it in terms of Actions. I will not consider the caching.
If you do the following there will be optimization for sure.
val df1 = df0.withColumn(...
val df2 = df1.withColumn(...
Your example needs an Action like count to work. But the two statements are too diverse, so that there is no skipped processing evident. There is thus no sharing.
In general the Action = Job is correct way to look at it. For DFs Catalyst Optimizer can kick a Job off even though you may not realize this. For RRDs (legacy) this was a little different.
This does not get optimized either:
import org.apache.spark.sql.functions._
val df = spark.range(1,10000).toDF("c1")
val df_intermediate = df.withColumn("c2", col("c1") + 100)
val x = df_intermediate.agg(max("c2"))
val y = df_intermediate.agg(min("c2"))
val z = x.union(y).count
x and y both go back to source. One would have thought that would be easier to do and it is also 1 Action here. Need to do the .explain, but the idea is to leave it to Spark due to lazy evaluation, etc.
As an aside: Is it efficient to cache a dataframe for a single Action Spark application in which that dataframe is referenced more than once? & In which situations are the stages of DAG skipped?
I'm aware that spark is designed for large datasets for which it's great. But under certain circumstances I don't need this scalability, e.g. for unit tests or for data exploration on small datasets. Under these conditions spark performs relatively bad compared implementation in pure scala/python/matlab/R etc.
Note that I don't want to drop spark entirely, I want to keep the framework for larger workloads without re-implementing everything.
How can I disable sparks overhead as much as possible on small datasets (say 10-1000s of records)? I'm tried using only 1 partition in local mode (setting spark.sql.shuffle.partitions=1 and spark.default.parallelism=1)? Even which these settings, simple queries on 100 records take on the order of 1-2 seconds.
Note that I'm not trying to reduce the time for SparkSession instantiation, just the execution time given SparkSession exists.
operations in spark have same signature as the scala collections.
You could implement something like:
val useSpark = false
val rdd: RDD[String]
val list: List[String] = Nil
def mapping: String => Int = s => s.length
if (useSpark) {
rdd.map(mapping)
} else {
list.map(mapping)
}
I think this code could be abstracted even more.
We usually use Spark as processing engines for data stored on S3 or HDFS. We use Databricks and EMR platforms.
One of the issues I frequently face is when the task size grows, the job performance is degraded severely. For example, let's say I read data from five tables with different levels of transformation like (filtering, exploding, joins, etc), union subset of data from these transformations, then do further processing (ex. remove some rows based on a criteria that requires windowing functions etc) and then some other processing stages and finally save the final output to a destination s3 path. If we run this job without it takes very long time. However, if we save(stage) temporary intermediate dataframes to S3 and use this saved (on S3) dataframe for the next steps of queries, the job finishes faster. Does anyone have similar experience? Is there a better way to handle this kind of long tasks lineages other than checkpointing?
What is even more strange is for longer lineages spark throws an expected error like column not found, while the same code works if intermediate results are temporarily staged.
Writing the intermediate data by saving the dataframe, or using a checkpoint is the only way to fix it. You're probably running into an issue where the optimizer is taking a really long time to generate the plan. The quickest/most efficient way to fix this is to use localCheckpoint. This materializes a checkpoint locally.
val df = df.localCheckpoint()
My Spark program has got several table joins(using SPARKSQL) and I would like to collect the time taken to process each of those joins and save to a statistics table. The purpose is to run it continuously over a period of time and gather the performance at very granular level.
e.g
val DF1= spark.sql("select x,y from A,B ")
Val DF2 =spark.sql("select k,v from TABLE1,TABLE2 ")
finally I join DF1 and DF2 and then initiate an action like saveAsTable .
What I am looking for is to figure out
1.How much time it really took to compute DF1
2.How much time to compute DF2 and
3.How much time to persist those final Joins to Hive / HDFS
and put all these info to a RUN-STATISTICS table / file.
Any help is appreciated and thanks in advance
Spark uses Lazy Evaluation, allowing the engine to optimize RDD transformations at a very granular level.
When you execute
val DF1= spark.sql("select x,y from A,B ")
nothing happens except the transformation is added to the Directed Acyclic Graph.
Only when you perform an Action, such as DF1.count, the driver is forced to execute a physical execution plan. This is deferred as far down the chain of RDD transformations as possible.
Therefore it is not correct to ask
1.How much time it really took to compute DF1
2.How much time to compute DF2 and
at least based on the code examples you provided. Your code did not "compute" val DF1. We may not know how long processing just DF1 took, unless you somehow tricked the compiler into processing each dataframe separately.
A better way to structure the question might be "how many stages (tasks) is my job divided into overall, and how long does it take to finish those stages (tasks)"?
And this can be easily answered by looking at the log files/web GUI timeline (comes in different flavors depending on your setup)
3.How much time to persist those final Joins to Hive / HDFS
Fair question. Check out Ganglia
Cluster-wide monitoring tools, such as Ganglia, can provide insight into overall cluster utilization and resource bottlenecks. For instance, a Ganglia dashboard can quickly reveal whether a particular workload is disk bound, network bound, or CPU bound.
Another trick I like to use it defining every sequence of transformations that must end in an action inside a separate function, and then calling that function on the input RDD inside a "timer function" block.
For instance, my "timer" is defined as such
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0)/1e9 + "s")
result
}
and can be used as
val df1 = Seq((1,"a"),(2,"b")).toDF("id","letter")
scala> time{df1.count}
Elapsed time: 1.306778691s
res1: Long = 2
However don't call unnecessary actions just to break down the DAG into more stages/wide dependencies. This might lead to shuffles or slow down your execution.
Resources:
https://spark.apache.org/docs/latest/monitoring.html
http://ganglia.sourceforge.net/
https://www.youtube.com/watch?v=49Hr5xZyTEA