Caching a large stream - apache-spark

I am working on a streaming application, which I am caching a large RDD (that is only in memory)..
Dstream.cache()
Dstream.foreachRDD(..)
Dstream.foreachRDD(..)
I wanted to know if the Dstream can not be fit into the memory..Is the RDD recomputed or raise an exception?
I am asking this question since I am developing a stateful application using mapwithState function which internally uses an internal stream which is presisted only in memory.(https://github.com/wliuxad/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/MapWithStateDStream.scala#L109-109)

Depends on which RDD we're talking about. MapWithStateDStream caches the data inside a OpenHashMapBasedStateMap. It doesn't spill to disk. This means that you need to have sufficient memory in order for your application to work properly. When you think about it, how can you evict state? It isn't some RDD that is being persisted, it is part of your application logic.
One thing that is evicted is the cached RDD from your source. From your previous example I see that you're using Kafka, so that means that the cached KafkaRDD will be evicted once Spark sees fit.

Related

What happens to the previous RDD when it gets transformed into a new RDD?

I am a beginner in Apache Spark. What I have understood so far regarding RDDs is that, once a RDD is generated, it cannot be modified but can be transformed into another RDD. Now I have several doubts here:
If an RDD is transformed to another RDD on applying some transformation on the RDD, then what happens to the previous RDD? Are both the RDDs stored in the memory?
If an RDD is cached, and some transformation is applied on the cached RDD to generate a new RDD, then can there be a scenario that, there is not enough space in RAM to hold the newly generated RDD? In case, such a scenario occurs, how will Spark handle that?
Thanks in advance!
Due to Spark's lazy evaluation, nothing happens when you do transformations on RDDs. Spark only starts computing when you call an action (save, take, collect, ...). So to answer your questions,
The original RDD stays where it is, and the transformed RDD does not exist / has not been computed yet due to lazy evaluation. Only a query plan is generated for it.
Regardless of whether the original RDD is cached, the transformed RDD will not be computed until an action is called. Therefore running out of memory shouldn't happen.
Normally when you run out of memory, either you encounter an OOM error, or the cached RDDs in memory will be spilled onto the disk to free up memory.
In order to understand the answers of the questions you have, you need to know couple of things about spark.
Spark evaluation Model (Lazy Evaluation)
Spark Operations (Transformations and Actions)
Directed Acyclic Graph (DAG)
Answer to your first Question:
You could think of RDD as virtual data structure that does not get filled with values unless there is some action called on it which materializes the rdd/dataframe. When you perform transformations it just creates query plan which shows the lazily evaluation behavior of spark. When action gets called, it perform all the transformation based on the physical plan that gets generated. So, nothing happens to the RDDs. RDD data gets pulled into memory when action gets called.
Answer to your Second Question:
If an RDD is cached and you perform multiple transformations on top of the cache RDD, actually nothing happens to the RDDs as cache is a transformation operation. Also the RDD that you have cached would be in memory when any action would be performed. So, you won't run out of memory.
You could run into memory issues if you are trying to cache each step of the transformation, which should be avoided.(Whether to cache or not Cache a dataframe/RDD is a million dollar question as beginner but you get to understand that as you learn the basics and spark architecture)
Other workflow where you can run out of memory is when you have huge data size and you are caching the rdd after some transformation as you would like to perform multiple actions on it or it is getting used in the workflow multiple times. In this case you need to verify your cluster configuration and need to make sure that it can handle the data that you are intending to cache.

Is there an extra overhead to cache Spark dataframe in memory?

I am new to Spark and wanted to understand if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
From what I know so far that there is not data movement that happens when we used cache a dataframe and it is just saved on executor's memory. So it should be just a matter of setting/unsetting a flag.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
It depends. If you only mark a DataFrame to be persisted, nothing really happens since it's a lazy operation. You have to execute an action to trigger DataFrame persistence / caching. With the action you do add an extra overhead.
Moreover, think of persistence (caching) as a way to precompute data and save it closer to executors (memory, disk or their combinations). This moving data from where it lives to executors does add an extra overhead at execution time (even if it's just a tiny bit).
Internally, Spark manages data as blocks (using BlockManagers on executors). They're peers to exchange blocks on demand (using torrent-like protocol).
Unpersisting a DataFrame is simply to send a request (sync or async) to BlockManagers to remove RDD blocks. If it happens in async manner, the overhead is none (minus the extra work executors have to do while running tasks).
So it should be just a matter of setting/unsetting a flag.
In a sense, that's how it is under the covers. Since a DataFrame or an RDD are just abstractions to describe distributed computations and do nothing at creation time, this persist / unpersist is just setting / unsetting a flag.
The change can be noticed at execution time.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
If you use async caching (the default), there should be a very minimal delay.

Should you always unpersist earlier cached rdds after you cache an rdd that appears later in the same lineage graph?

I have an rdd that I cache after loading data from s3, since I don't want to have to re-pull from s3 if I lose an executor. I then make a bunch of transformations on that rdd, and then cache again.
At this point, is there any reason to leave the first cached rdd in the cache? Will all later stages just pull from the more recently cached transformation if I don't use the earlier rdd again?
I don't want to have to re-pull from s3 if I lose an executor.
Default caching variants don't protect you from executor loss. Spark provides replicated cache options (MEMORY_ONLY_SER_2, MEMORY_AND_DISK_SER_2, DISK_ONLY_2) which add some protection in case of node failure, but there more expensive than non-replicated variants.
is there any reason to leave the first cached rdd in the cache?
If the second one has been materialized then there is no reason to keep the first one, but LRU cleaner should be able to do handle this case without your help, if it is necessary.

spark streaming failed batches

I see some failed batches in my spark streaming application because of memory related issues like
Could not compute split, block input-0-1464774108087 not found
, and I was wondering if there is a way to re process those batches on the side without messing with the current running application, just in general , does not have to be the same exact exception.
Thanks in advance
Pradeep
This may happen in cases where your data ingestion rate into spark is higher than memory allocated or can be kept. You can try changing StorageLevel to MEMORY_AND_DISK_SER so that when it is low on memory Spark can spill data to disk. This will prevent your error.
Also, I don't think this error means that any data was lost while processing, but that input block which was added by your block manager just timed out before processing started.
Check similar question on Spark User list.
Edit:
Data is not lost, it was just not present where the task was expecting it to be. As per Spark docs:
You can mark an RDD to be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it will be
kept in memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.

What happens if the data can't fit in memory with cache() in Spark?

I am new to Spark. I have read at multiple places that using cache() on a RDD will cause it to be stored in memory but I haven't so far found clear guidelines or rules of thumb on "How to determine the max size of data" that one could cram into memory? What happens if the amount of data that I am calling "cache" on, exceeds the memory ? Will it cause my job to fail or will it still complete with a noticeable impact on Cluster performance?
Thanks!
As it is clearly stated in the official documentation with MEMORY_ONLY persistence (equivalent to cache):
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
Even if data fits into memory it can be evicted if new data comes in. In practice caching is more a hint than a contract. You cannot depend on caching take place but you don't have to if it succeeds either.
Note:
Please keep in mind that the default StorageLevel for Dataset is MEMORY_AND_DISK, which will:
If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
See also:
(Why) do we need to call cache or persist on a RDD
Why do I have to explicitly tell Spark what to cache?

Resources