Need to release the memory used by unused spark dataframes - apache-spark

I am not caching or persisting the spark dataframe. If I have to do many additional things in the same session by aggregating and modifying content of the dataframe as part of the process then when and how would the initial dataframe be released from memory?
Example:
I load a dataframe DF1 with 10 million records. Then I do some transformation on the dataframe which creates a new dataframe DF2. Then there are a series of 10 steps I do on DF2. All through this, I do not need DF1 anymore. How can I be sure that DF1 no longer exists in memory and hampering performance? Is there any approach using which I can directly remove DF1 from memory? Or does DF1 get automatically removed based on Least Recently Used (LRU) approach?

That's not how spark work. Dataframes are lazy ... the only things stored in memories are the structures and the list of tranformation you have done on your dataframes. The data are not stored in memory (unless you cache them and apply an action).
Therefore, I do not see any problem in your question.

Prompted by a question from A Pantola, in comments I'm returning here to post a better answer to this question. Note there are MANY possible correct answers how to optimize RAM usage which will depend on the work being done!
First, write the dataframe to DBFS, something like this:
spark.createDataFrame(data=[('A',0)],schema=['LETTERS','NUMBERS'])\
.repartition("LETTERS")\
.write.partitionBy("LETTERS")\
.parquet(f"/{tmpdir}",mode="overwrite")
Now,
df = spark.read.parquet(f"/{tmpdir}")
Assuming you don't set up any caching on the above df, then each time spark finds a reference to df it will parallel read the dataframe and compute whatever is specified.
Note the above solution will minimize RAM usage, but may require more CPU on every read. Also, the above solution will have a cost of writing to parquet.

Related

How can i make my Spark Accumulator statistics reliable in Azure Databricks?

I am using a spark accumulator to collect statistics of each pipelines.
In a typical pipeline i would read a data_frame :
df = spark.read.format(csv).option("header",'true').load('/mnt/prepared/orders')
df.count() ==> 7 rows
Then i would actually write it in two diferent locations:
df.write.format(delta).option("header",'true').load('/mnt/prepared/orders')
df.write.format(delta).option("header",'true').load('/mnt/reporting/orders_current/')
Unfortunately my accumulator statistics get updated each write operations. It gives a figure of 14 rows read, while i have only read the input dataframe once.
How can I make my accumulator properly reflects the number of rows that i actually read.
I am a newbie in spark. have checked several threads around the issue, but did not find my answer.
Statistical accumulator in Python
spark Accumulator reset
When are accumulators truly reliable?
The first rule - accumulators aren't 100% reliable. They could be updated multiple times, for example, if tasks were restarted/retried.
In your case, although you read once, it doesn't mean that data won't be re-read again. Read operation just obtains metadata, like, schema, and may read data if you use inferSchema for some data type, but it doesn't mean that it's actually read the data into memory. You can cache your read dataframe, but it will work only for smaller data sets, as it's also don't guarantee that data won't be evicted, and then need to be re-read

What's the overhead of converting an RDD to a DataFrame and back again?

It was my assumption that Spark Data Frames were built from RDDs. However, I recently learned that this is not the case, and Difference between DataFrame, Dataset, and RDD in Spark does a good job explaining that they are not.
So what is the overhead of converting an RDD to a DataFrame, and back again? Is it negligible or significant?
In my application, I create a DataFrame by reading a text file into an RDD and then custom-encoding every line with a map function that returns a Row() object. Should I not be doing this? Is there a more efficient way?
RDDs have a double role in Spark. First of all is the internal data structure for tracking changes between stages in order to manage failures and secondly until Spark 1.3 was the main interface for interaction with users. Therefore after after Spark 1.3 Dataframes constitute the main interface offering much richer functionality than RDDs.
There is no significant overhead when converting one Dataframe to RDD with df.rdd since the dataframes they already keep an instance of their RDDs initialized therefore returning a reference to this RDD should not have any additional cost. On the other side, generating a dataframe from an RDD requires some extra effort. There are two ways to convert an RDD to dataframe 1st by calling rdd.toDF() and 2nd with spark.createDataFrame(rdd, schema). Both methods will evaluate lazily although there will be an extra overhead regarding the schema validation and execution plan (you can check the toDF() code here for more details). Of course that would be identical to the overhead that you have just by initializing your data with spark.read.text(...) but with one less step, the conversion from RDD to dataframe.
This the first reason that I would go directly with Dataframes instead of working with two different Spark interfaces.
The second reason is that when using the RDD interface you are missing some significant performance features that dataframes and datasets offer related to Spark optimizer (catalyst) and memory management (tungsten).
Finally I would use the RDDs interface only if I need some features that are missing in dataframes such as key-value pairs, zipWithIndex function etc. But even then you can access those via df.rdd which is costless as already mentioned. As for your case , I believe that would be faster to use directly a dataframe and use the map function of that dataframe to ensure that Spark leverages the usage of tungsten ensuring efficient memory management.

How to post-process Spark SQL results w/o using UDF

I read
https://medium.com/teads-engineering/spark-performance-tuning-from-the-trenches-7cbde521cf60
It suggests not to use UDF to save deserialization/serialization cost.
In my case, I did a query like this
select MYFUN(f1, f2, ...)
from A ...
I use MYFUN to post-process the query results row by row, for example, sending them to another service.
def my_fun(f1, f2, ...):
service.send(f1, f2, ...)
session.udf.register('MYFUN', my_fun)
W/o using UDF, I may want to save the query results to a Python data frame, or a Parque table on hdfs then reading by a dataframe, and process the dataframe one by one.
The problem is the result table size is large, may be 1M rows.
In such a case, does it still make sense to remove the UDF?
What is the best practice to populate a Spark SQL result to another service?
Python UDFs are not recommended from a performance point of view, but there is nothing wrong in using them when needed, as in this case: the serialization/deserialization cost is probably ridiculous compared to the I/O waits introduced by your send. So it probably doesn't make sense to remove the UDF.
In a more general case, there are two ways with which you can reduce the memory footprint of processing a dataframe. One you already mentioned, is save to file and process the file.
Another way is using toLocalIterator on your dataframe. This way you will iterate on each of the dataframe's partitions: you can repartition the dataframe to make partitions of an arbitrary size:
df =df.repartition(100)
for partition in df.toLocalIterator():
for row in partition:
send(row)
This way your local memory requirements are reduced to the biggest partition of your repartitioned dataframe.

Spark driver running out of memory when reading multiple files

My program works like this:
Read in a lot of files as dataframes. Among those files there is a group of about 60 files with 5k rows each, where I create a separate Dataframe for each of them, do some simple processing and then union them all into one dataframe which is used for further joins.
I perform a number of joins and column calculations on a number of dataframes finally which finally results in a target dataframe.
I save the target dataframe as a Parquet file.
In the same spark application, I load that Parquet file and do some heavy aggregation followed by multiple self-joins on that dataframe.
I save the second dataframe as another Parquet file.
The problem
If I have just one file instead of 60 in the group of files I mentioned above, everything works with driver having 8g memory. With 60 files, the first 3 steps work fine, but driver runs out of memory when preparing the second file. Things improve only when I increase the driver's memory to 20g.
The Question
Why is that? When calculating the second file I do not use Dataframes used to calculate the first file so their number and content should not really matter if the size of the first Parquet file remains constant, should it? Do those 60 dataframes get cached somehow and occupy driver's memory? I don't do any caching myself. I also never collect anything. I don't understand why 8g of memory would not be sufficient for Spark driver.
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//you have to use serialization configuration if you are using MEMORY_AND_DISK_SER
val rdd1 = sc.textFile("some data")
rdd1.persist(storageLevel.MEMORY_AND_DISK_SER) // marks rdd as persist
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd3.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
rdd1.unpersist()
rdd2.unpersist()
rdd3.unpersist()
For tuning your code follow this link
Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.
RDDs can be cached using cache operation. They can also be persisted using persist operation.
The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY.
refer to use of persist and unpersist

Managing Spark partitions after DataFrame unions

I have a Spark application that will need to make heavy use of unions whereby I'll be unioning lots of DataFrames together at different times, under different circumstances. I'm trying to make this run as efficiently as I can. I'm still pretty much brand-spanking-new to Spark, and something occurred to me:
If I have DataFrame 'A' (dfA) that has X number of partitions (numAPartitions), and I union that to DataFrame 'B' (dfB) which has Y number of partitions (numBPartitions), then what will the resultant unioned DataFrame (unionedDF) look like, with result to partitions?
// How many partitions will unionedDF have?
// X * Y ?
// Something else?
val unionedDF : DataFrame = dfA.unionAll(dfB)
To me, this seems like its very important to understand, seeing that Spark performance seems to rely heavily on the partitioning strategy employed by DataFrames. So if I'm unioning DataFrames left and right, I need to make sure I'm constantly managing the partitions of the resultant unioned DataFrames.
The only thing I can think of (so as to properly manage partitions of unioned DataFrames) would be to repartition them and then subsequently persist the DataFrames to memory/disk as soon as I union them:
val unionedDF : DataFrame = dfA.unionAll(dfB)
unionedDF.repartition(optimalNumberOfPartitions).persist(StorageLevel.MEMORY_AND_DISK)
This way, as soon as they are unioned, we repartition them so as to spread them over the available workers/executors properly, and then the persist(...) call tells to Spark to not evict the DataFrame from memory, so we can continue working on it.
The problem is, repartitioning sounds expensive, but it may not be as expensive as the alternative (not managing partitions at all). Are there generally-accepted guidelines about how to efficiently manage unions in Spark-land?
Yes, Partitions are important for spark.
I am wondering if you could find that out yourself by calling:
yourResultedRDD.getNumPartitions()
Do I have to persist, post union?
In general, you have to persist/cache an RDD (no matter if it is the result of a union, or a potato :) ), if you are going to use it multiple times. Doing so will prevent spark from fetching it again in memory and can increase the performance of your application by 15%, in some cases!
For example if you are going to use the resulted RDD just once, it would be safe not to do persist it.
Do I have to repartition?
Since you don't care about finding the number of partitions, you can read in my memoryOverhead issue in Spark
about how the number of partitions affects your application.
In general, the more partitions you have, the smaller the chunk of data every executor will process.
Recall that a worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.
Isn't the Dataframe always in memory?
Not really. And that's something really lovely with spark, since when you handle bigdata you don't want unnecessary things to lie in the memory, since this will threaten the safety of your application.
A DataFrame can be stored in temporary files that spark creates for you, and is loaded in the memory of your application only when needed.
For more read: Should I always cache my RDD's and DataFrames?
Union just add up the number of partitions in dataframe 1 and dataframe 2. Both dataframe have same number of columns and same order to perform union operation. So no worries, if partition columns different in both the dataframes, there will be max m + n partitions.
You doesn't need to repartition your dataframe after join, my suggestion is to use coalesce in place of repartition, coalesce combine common partitions or merge some small partitions and avoid/reduce shuffling data within partitions.
If you cache/persist dataframe after each union, you will reduce performance and lineage is not break by cache/persist, in that case, garbage collection will clean cache/memory in case of some heavy memory intensive operation and recomputing will increase computation time for the same, may be this time partial computation is required for clear/removed data.
As spark transformation are lazy, i.e; unionAll is lazy operation and coalesce/repartition is also lazy operation and come in action at the time of first action, so try to coalesce unionall result after an interval like counter of 8 and reduce partition in resulting dataframe. Use checkpoints to break lineage and store data, if there is lots of memory intensive operation in your solution.

Resources