When i call dataset unpersist it removing dataset data from memory only but actually rdd objects(with meta info only like query plan) are still in the memory.
How to remove those objects from memory?
Related
If you cache a very large dataset that cannot be all stored in memory or on disk how does spark handle the partial cache? how does it know which data needs to be recomputed when you go to use that dataframe again?
Example:
Read 100 GB dataset into memory df1
Compute new dataframe df2 based on df1
cache df2
If spark can only fit 50GB of Cache for df2 what happens if you go to reuse df2 for the next steps? How would spark know which data it doesn't need to recompute and which is does? Will it need to re-read that data again that it couldn't persist?
UPDATE
What happens if you have 5GB memory and 5GB disk and try to cache a 20GB dataset? What happens to the other 10GB of data that can't be cached and how does spark know which data it needs to recompute and which it doesn't?
Spark has this default option for DF and DS:
MEMORY_AND_DISK – This is the default behavior of the DataFrame or
Dataset. In this Storage Level, The DataFrame will be stored in JVM
memory as a deserialized objects. When required storage is greater
than available memory, it stores some of the excess partitions into
local disk and reads the data from local disk when it required. It is slower as
there is I/O involved.
However, to be more specific:
Spark's unit of processing is a partition = 1 task. So the discussion
is more about partition or partitions fitting into memory and/or local
disk.
If a partition of the DF doesn't fit in memory and disk when using
StorageLevel.MEMORY_AND_DISK, then the OS will fail, aka kill, the
Executor / Worker. Eviction of other partitions than your own DF may occur, but
not for your own DF. The .cache is either successful or not, there is no re- reading in this case.
I base this on the fact that partition eviction does not occur for partitions belonging to the same underlying RDD. Not well explained all this stuff, but see here: How does Spark evict cached partitions?. In the end other RDD partitions may be evicted and re-computed but in the end also yo need enough local disk as well as memory.
A good read is: https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/
Say my Spark cluster has 100G memory, during the Spark computing process, more data (new dataframes, caches) with a size of 200G are generated. In this case, will Spark store some of this data on Disk or it will just OOM?
Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task.
If you apply no transformation but only do for instance a count, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the count again it will read in all the data once again. To avoid reading in data several times, you might call cache or persist in which case Spark will try to store the data in you cluster. On cache (which is the same as persist(StorageLevel.MEMORY_ONLY) it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call persist(StorageLevel.MEMORY_AND_DISK) it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers.
In Apache Spark if the data does not fits into the memory then Spark simply persists that data to disk. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
The persist method in Apache Spark provides six persist storage level to persist the data.
MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER
(Java and Scala), MEMORY_AND_DISK_SER
(Java and Scala), DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, OFF_HEAP.
The OFF_HEAP storage is under experimentation.
I run into a pretty simple but frustrating pattern when doing adhoc data analysis:
You have rdd1 cached in memory, then cache rdd2 in memory, which evicts rdd1 to disk because of memory constraints. If you were to unpersist rdd2, is there anyway to tell spark to move rdd1 back to memory?
I dont think it is possible to instruct spark to bring the rdd1 back to memory.
But, next time the rdd1 is accessed, it gets loaded into memory, given that you are using MEMORY_AND_DISK persistence level (otherwise it gets recomputed).
If you are looking to reduce space and load in memory, consider using MEMORY_AND_DISK_SER, this will serialize your objects and store.
My program works like this:
Read in a lot of files as dataframes. Among those files there is a group of about 60 files with 5k rows each, where I create a separate Dataframe for each of them, do some simple processing and then union them all into one dataframe which is used for further joins.
I perform a number of joins and column calculations on a number of dataframes finally which finally results in a target dataframe.
I save the target dataframe as a Parquet file.
In the same spark application, I load that Parquet file and do some heavy aggregation followed by multiple self-joins on that dataframe.
I save the second dataframe as another Parquet file.
The problem
If I have just one file instead of 60 in the group of files I mentioned above, everything works with driver having 8g memory. With 60 files, the first 3 steps work fine, but driver runs out of memory when preparing the second file. Things improve only when I increase the driver's memory to 20g.
The Question
Why is that? When calculating the second file I do not use Dataframes used to calculate the first file so their number and content should not really matter if the size of the first Parquet file remains constant, should it? Do those 60 dataframes get cached somehow and occupy driver's memory? I don't do any caching myself. I also never collect anything. I don't understand why 8g of memory would not be sufficient for Spark driver.
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//you have to use serialization configuration if you are using MEMORY_AND_DISK_SER
val rdd1 = sc.textFile("some data")
rdd1.persist(storageLevel.MEMORY_AND_DISK_SER) // marks rdd as persist
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd3.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
rdd1.unpersist()
rdd2.unpersist()
rdd3.unpersist()
For tuning your code follow this link
Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.
RDDs can be cached using cache operation. They can also be persisted using persist operation.
The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY.
refer to use of persist and unpersist
The RDD, which have been cached used the rdd.cache() method from the scala terminal, are being stored in the memory.
That means it will consume some part of the ram being available for the Spark process itself.
Having said that if the ram is being limited, and more and more RDDs have been cached, when will spark clean the memory automatically which has been occupied by the rdd cache?
Spark will clean cached RDDs and Datasets / DataFrames:
When it is explicitly asked to by calling RDD.unpersist (How to uncache RDD?) / Dataset.unpersist methods or Catalog.clearCache.
In regular intervals, by the cache cleaner:
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
When corresponding distributed data structure is garbage collected.
Spark will automatically un-persist/clean the RDD or Dataframe if the RDD is not used any longer. To check if a RDD is cached, check into the Spark UI and check the Storage tab and look into the Memory details.
From the terminal, we can use ‘rdd.unpersist() ‘or ‘sqlContext.uncacheTable("sparktable") ‘
to remove the RDD or tables from Memory. Spark made for Lazy Evaluation, unless and until you say any action, it does not load or process any data into the RDD or DataFrame.