Does spark automatically un-cache and delete unused dataframes? - apache-spark

I have the following strategy to change a dataframe df.
df = T1(df)
df.cache()
df = T2(df)
df.cache()
.
.
.
df = Tn(df)
df.cache()
Here T1, T2, ..., Tn are n transformations that return spark dataframes. Repeated caching is used because df has to pass through a lot of transformations and used mutiple times in between; without caching lazy evaluation of the transformations might make using df in between very slow. What I am worried about is that the n dataframes that are cached one by one will gradually consume the RAM. I read that spark automatically un-caches "least recently used" items. Based on this I have the following queries -
How is "least recently used" parameter determined? I hope that a dataframe, without any reference or evaluation strategy attached to it, qualifies as unused - am I correct?
Does a spark dataframe, having no reference and evaluation strategy attached to it, get selected for garbage collection as well? Or does a spark dataframe never get garbage collected?
Based on the answer to the above two queries, is the above strategy correct?

How is "least recently used" parameter determined? I hope that a dataframe, without any reference or evaluation strategy attached to it, qualifies as unused - am I correct?
Results are cached on spark executors. A single executor runs multiple tasks and could have multiple caches in its memory at a given point in time. A single executor caches are ranked based on when it is asked. Cache just asked in some computation will have rank 1 always, and others are pushed down. Eventually when available space is full, cache with last rank is dropped to make space for new cache.
Does a spark dataframe, having no reference and evaluation strategy attached to it, get selected for garbage collection as well? Or does a spark dataframe never get garbage collected?
Dataframe is an execution expression and unless an action is called, no computation is materialised. Moreover, everything will be cleared once the executor is done with computation for that task. Only when dataframe is cached (before calling action), results are kept aside in executor memory for further use. And these result caches are cleared based on LRU.
Based on the answer to the above two queries, is the above strategy correct?
Your example seems like transformation are done in sequence and reference for previous dataframe is not used further (no idea why you are using cache). If multiple executions are done by same executor, it is possible that some results are dropped and when asked they will be re-computed again.
N.B. - Nothing is executed unless a spark action is called. Transformations are chained and optimised by spark engine when an action is called.

As far as I have worked with spark and also with the communication with the cloudera that I had, we should unpersist/uncache the data, if we do not do that job will start to slow down, the problem becomes more severe in case of streaming job.
I have nothing to support my answer but
read here and here for details

Related

What happens to the previous RDD when it gets transformed into a new RDD?

I am a beginner in Apache Spark. What I have understood so far regarding RDDs is that, once a RDD is generated, it cannot be modified but can be transformed into another RDD. Now I have several doubts here:
If an RDD is transformed to another RDD on applying some transformation on the RDD, then what happens to the previous RDD? Are both the RDDs stored in the memory?
If an RDD is cached, and some transformation is applied on the cached RDD to generate a new RDD, then can there be a scenario that, there is not enough space in RAM to hold the newly generated RDD? In case, such a scenario occurs, how will Spark handle that?
Thanks in advance!
Due to Spark's lazy evaluation, nothing happens when you do transformations on RDDs. Spark only starts computing when you call an action (save, take, collect, ...). So to answer your questions,
The original RDD stays where it is, and the transformed RDD does not exist / has not been computed yet due to lazy evaluation. Only a query plan is generated for it.
Regardless of whether the original RDD is cached, the transformed RDD will not be computed until an action is called. Therefore running out of memory shouldn't happen.
Normally when you run out of memory, either you encounter an OOM error, or the cached RDDs in memory will be spilled onto the disk to free up memory.
In order to understand the answers of the questions you have, you need to know couple of things about spark.
Spark evaluation Model (Lazy Evaluation)
Spark Operations (Transformations and Actions)
Directed Acyclic Graph (DAG)
Answer to your first Question:
You could think of RDD as virtual data structure that does not get filled with values unless there is some action called on it which materializes the rdd/dataframe. When you perform transformations it just creates query plan which shows the lazily evaluation behavior of spark. When action gets called, it perform all the transformation based on the physical plan that gets generated. So, nothing happens to the RDDs. RDD data gets pulled into memory when action gets called.
Answer to your Second Question:
If an RDD is cached and you perform multiple transformations on top of the cache RDD, actually nothing happens to the RDDs as cache is a transformation operation. Also the RDD that you have cached would be in memory when any action would be performed. So, you won't run out of memory.
You could run into memory issues if you are trying to cache each step of the transformation, which should be avoided.(Whether to cache or not Cache a dataframe/RDD is a million dollar question as beginner but you get to understand that as you learn the basics and spark architecture)
Other workflow where you can run out of memory is when you have huge data size and you are caching the rdd after some transformation as you would like to perform multiple actions on it or it is getting used in the workflow multiple times. In this case you need to verify your cluster configuration and need to make sure that it can handle the data that you are intending to cache.

Spark SQL joining multiple tables design

I am developing a Spark SQL analytics solutions using set of tables. Suppose there are 5 tables which i need to building my solution and finally i am creating one output table.
Here is my flow
dataframe1 = table1 join table2
dataframe2 = dataframe1 join table3
dataframe3 = datamframe2 + filter + agg
dataframe4 = dataframe3 join table4 join table 5
// finally
dataframe4.saveAsTable
When I save final dataframe that's when all the above dataframe is evaluated.
Is my approach is good? or
Do i need to cache/persist intermediate dataframes?
This is a very generic question and it is hard to provide a definitive answer.
Depending on the size of tables you would want to do broadcast hint for any of tables that are relatively small.
You can do this via
table_i.join(broadcast(table_j), ....)
This behaviour depends on the value in:
Now broadcast hint will be honoured only if Spark is able to evaluate the value of the table so you might need to cache().
Another option is via Spark checkpoints that can help to truncate local plan for optimisation (also this allows you to resume jobs from checkpoint location, it is similar to writing to HDFS but with some overhead).
In case of broadcasting few houndres of Mb tables, you might need to increase your kryo buffer:
--conf spark.kryoserializer.buffer.max=1g
It also depends which join types you will use.
You would probably want to do filter and aggregagtion as early as possible since it will reduce the join surface.
There are many other considerations to be consider in order to properly optimise this. In case of power law distribution of join keys in any of the joins you would need to do salting and explode smaller table.
In your case, in principle, there is not really a cache or persist required Why?
As there are no reuse paths evident (for other Actions or other Transformations within the same Action), it is all sequential.
Also, lazy evaluation and Catalyst.
Try the .explain and see how Spark will process.
However, due to memory eviction possibilities on the Cluster, there may be the need to re-compute on a Worker. There are various settings that you could apply via .cache and .persist, but Spark handles memory and disk spills without explicit .cache or .persist. See https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/
Also, using .cache can affect performance. So use .explain. See here an excellent posting: Spark: Explicit caching can interfere with Catalyst optimizer's ability to optimize some queries?
So, each case is different but yours seems Ok to answer as I have. In summary: An RDD or DF that is not cached, nor check-pointed, is re-evaluated again each time an Action is invoked on that RDD or DF or if re-accessed within the current Action and no skipped stage situation applies. In your case no issue. Doing otherwise would slow your App down in fact.

Why there are so many partitions required before shuffling data in Apache Spark?

Background
I am a newbie in Spark and want to understand about shuffling in spark.
I have two following questions about shuffling in Apache Spark.
1) Why there is change in no. of partitions before performing shuffling ? Spark does it by default by changing partition count to value given in spark.sql.shuffle.partitions.
2) Shuffling usually happens when there is a wide transformation. I have read in a book that data is also saved on disk. Is my understanding correct ?
Two questions actually.
Nowhere it it stated that you need to change this parameter. 200 is the default if not set. It applies to JOINing and AGGregating. You make have a far bigger set of data that is better served by increasing the number of partitions for more processing capacity - if more Executors are available. 200 is the default, but if your quantity is huge, more parallelism if possible will speed up processing time - in general.
Assuming an Action has been called - so as to avoid the obvious comment if this is not stated, assuming we are not talking about ResultStage and a broadcast join, then we are talking about ShuffleMapStage. We look at an RDD initially:
DAG dependency involving a shuffle means creation of a separate Stage.
Map operations are followed by Reduce operations and a Map and so forth.
CURRENT STAGE
All the (fused) Map operations are performed intra-Stage.
The next Stage requirement, a Reduce operation - e.g. a reduceByKey, means the output is hashed or sorted by key (K) at end of the Map
operations of current Stage.
This grouped data is written to disk on the Worker where the Executor is - or storage tied to that Cloud version. (I would have
thought in memory was possible, if data is small, but this is an architectural Spark
approach as stated from the docs.)
The ShuffleManager is notified that hashed, mapped data is available for consumption by the next Stage. ShuffleManager keeps track of all
keys/locations once all of the map side work is done.
NEXT STAGE
The next Stage, being a reduce, then gets the data from those locations by consulting the Shuffle Manager and using Block Manager.
The Executor may be re-used or be a new on another Worker, or another Executor on same Worker.
Stages mean writing to disk, even if enough memory present. Given finite resources of a Worker it makes sense that writing to disk occurs for this type of operation. The more important point is, of course, the 'Map Reduce' style of implementation.
Of course, fault tolerance is aided by this persistence, less re-computation work.
Similar aspects apply to DFs.

Spark createDataFrame(df.rdd, df.schema) vs checkPoint for breaking lineage

I'm currently using
val df=longLineageCalculation(....)
val newDf=sparkSession.createDataFrame(df.rdd, df.schema)
newDf.join......
In order to save time when calculating plans, however docs say that checkpointing is the suggested way to "cut" lineage. BUT I don't want to pay the price of saving the RDD to disk.
My process is a batch process which is not-so-long and can be restarted without issues, so checkpointing is not benefit for me (I think).
What are the problems which can arise using "my" method? (Docs suggests checkpointing, which is more expensive, instead of this one for breaking lineages and I would like to know the reason)
Only think I can guess is that if some node fails after my "lineage breaking" maybe my process will fail while the checkpointed one would have worked correctly? (what If the DF is cached instead of checkpointed?)
Thanks!
EDIT:
From SMaZ answer, my own knowledge and the article which he provided. Using createDataframe (which is a Dev-API, so use at "my"/your own risk) will keep the lineage in memory (not a problem for me since I don't have memory problems and the lineage is not big).
With this, it looks (not tested 100%) that Spark should be able to rebuild whatever is needed if it fails.
As I'm not using the data in the following executions, I'll go with
cache+createDataframe versus checkpointing (which If i'm not wrong, is
actually cache+saveToHDFS+"createDataFrame").
My process is not that critical (if it crashes) since an user will be always expecting the result and they launch it manually, so if it gives problems, they can relaunch (+Spark will relaunch it) or call me, so I can take some risk anyways, but I'm 99% sure there's no risk :)
Let me start with creating dataframe with below line :
val newDf=sparkSession.createDataFrame(df.rdd, df.schema)
If we take close look into SparkSession class then this method is annotated with #DeveloperApi. To understand what this annotation means please take a look into below lines from DeveloperApi class
A lower-level, unstable API intended for developers.
Developer API's might change or be removed in minor versions of Spark.
So it is not advised to use this method for production solutions, called as Use at your own risk implementation in open source world.
However, Let's dig deeper what happens when we call createDataframe from RDD. It is calling the internalCreateDataFrame private method and creating LogicalRDD.
LogicalRDD is created when:
Dataset is requested to checkpoint
SparkSession is requested to create a DataFrame from an RDD of internal binary rows
So it is nothing but the same as checkpoint operation without saving the dataset physically. It is just creating DataFrame From RDD Of Internal Binary Rows and Schema. This might truncate the lineage in memory but not at the Physical level.
So I believe it's just the overhead of creating another RDDs and can not be used as a replacement of checkpoint.
Now, Checkpoint is the process of truncating lineage graph and saving it to a reliable distributed/local file system.
Why checkpoint?
If computation takes a long time or lineage is too long or Depends too many RDDs
Keeping heavy lineage information comes with the cost of memory.
The checkpoint file will not be deleted automatically even after the Spark application terminated so we can use it for some other process
What are the problems which can arise using "my" method? (Docs
suggests checkpointing, which is more expensive, instead of this one
for breaking lineages and I would like to know the reason)
This article will give detail information on cache and checkpoint. IIUC, your question is more on where we should use the checkpoint. let's discuss some practical scenarios where checkpointing is helpful
Let's take a scenario where we have one dataset on which we want to perform 100 iterative operations and each iteration takes the last iteration result as input(Spark MLlib use cases). Now during this iterative process lineage is going to grow over the period. Here checkpointing dataset at a regular interval(let say every 10 iterations) will assure that in case of any failure we can start the process from last failure point.
Let's take some batch example. Imagine we have a batch which is creating one master dataset with heavy lineage or complex computations. Now after some regular intervals, we are getting some data which should use earlier calculated master dataset. Here if we checkpoint our master dataset then it can be reused for all subsequent processes from different sparkSession.
My process is a batch process which is not-so-long and can be
restarted without issues, so checkpointing is not benefit for me (I
think).
That's correct, If your process is not heavy-computation/Big-lineage then there is no point of checkpointing. Thumb rule is if your dataset is not used multiple time and can be re-build faster than the time is taken and resources used for checkpoint/cache then we should avoid it. It will give more resources to your process.
I think the sparkSession.createDataFrame(df.rdd, df.schema) will impact the fault tolerance property of spark.
But the checkpoint() will save the RDD in hdfs or s3 and hence if failure occurs, it will recover from the last checkpoint data.
And in case of createDataFrame(), it just breaks the lineage graph.

which is faster in spark, collect() or toLocalIterator()

I have a spark application in which I need to get the data from executors to driver and I am using collect(). However, I also came across toLocalIterator(). As far as I have read about toLocalIterator() on Internet, it returns an iterator rather than sending whole RDD instantly, so it has better memory performance, but what about speed? How is the performance between collect() and toLocalIterator() when it comes to execution/computation time?
The answer to this question depends on what would you do after making df.collect() and df.rdd.toLocalIterator(). For example, if you are processing a considerably big file about 7M rows and for each of the records in there, after doing all the required transformations, you needed to iterate over each of the records in the DataFrame and make a service calls in batches of 100.
In the case of df.collect(), it will dumping the entire set of records to the driver, so the driver will need an enormous amount of memory. Where as in the case of toLocalIterator(), it will only return an iterator over a partition of the total records, hence the driver does not need to have enormous amount of memory. So if you are going to load such big files in parallel workflows inside the same cluster, df.collect() will cause you a lot of expense, where as toLocalIterator() will not and it will be faster and reliable as well.
On the other hand if you plan on doing some transformations after df.collect() or df.rdd.toLocalIterator(), then df.collect() will be faster.
Also if your file size is so small that Spark's default partitioning logic does not break it down into partitions at all then df.collect() will be more faster.
To quote from the documentation on toLocalIterator():
This results in multiple Spark jobs, and if the input RDD is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input RDD should be cached first.
It means that in the worst case scenario (no caching at all) it can be n-partitions times more expensive than collect. Even if data is cached, the overhead of starting multiple Spark jobs can be significant on large datasets. However lower memory footprint can partially compensate that, depending on a particular configuration.
Overall, both methods are inefficient and should be avoided on large datasets.
As for the toLocalIterator, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect method, but instead of returning a List it will return an Iterator.
So, after applying a function to an RDD using foreach you can call toLocalIterator to get an iterator to all the contents of the RDD and process it. However, bear in mind that if your RDD is very big, you may have memory issues. If you want to transform it to an RDD again after doing the operations you need, use the SparkContext to parallelize it.

Resources