How to investigate a kryo buffer overflow happening in spark? - apache-spark

I encountered a kryo buffer overflow exception, but I really don't understand what data could require more than the current buffer size. I already have spark.kryoserializer.buffer.max set to 256Mb, and even a toString applied on the dataset items, which should be much bigger than what kryo requires, take less than that (per item).
I know I can increase this parameter, and I will right now, but I don't think this is a good practice to simply increase resources when reaching a bound without investigating what happens (same as if I get an OOM and simply increase ram allocation without checking what takes more ram)
=> So, is there a way to investigate what is put in the buffer along the spark dag execution?
I couldn't find anything in the spark ui.
Note that How Kryo serializer allocates buffer in Spark is not the same question. It ask how it works (and actually no one answers it), and I ask how to investigate. In the above question, all answers discuss the parameters to use, I know which param to use and I do manage to avoid the exception by increasing the parameters. However, I already consume too much ram, and need to optimize it, kryo buffer included.

All data that is sent over the network or written to the disk or persisted in the memory should be serialized along with the spark dag. Hence, Kryo serialization buffer must be larger than any object you attempt to serialize and must be less than 2048m.
https://spark.apache.org/docs/latest/tuning.html#data-serialization

Related

how spark handles out of memory error when cached( MEMORY_ONLY persistence) data does not fit in memory?

I'm new to the spark and i am not able to find clear answer that What happens when a cached data does not fit in memory?
many places i found that If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
for example:lets say 500 partition is created and say 200 partition didn't cached then again we have to re-compute the remaining 200 partition by re-evaluating the RDD.
If that is the case then OOM error should never occur but it does.What is the reason?
Detailed explanation is highly appreciated.Thanks in advance
There are different ways you can persist in your dataframe in spark.
1)Persist (MEMORY_ONLY)
when you persist data frame with MEMORY_ONLY it will be cached in spark.cached.memory section as deserialized Java objects. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level and can some times cause OOM when the RDD is too big and cannot fit in memory(it can also occur after recalculation effort).
To answer your question
If that is the case then OOM error should never occur but it does.What is the reason?
even after recalculation you need to fit those rdd in memory. if there no space available then GC will try to clean some part and try to allocate it.if not successfully then it will fail with OOM
2)Persist (MEMORY_AND_DISK)
when you persist data frame with MEMORY_AND_DISK it will be cached in spark.cached.memory section as deserialized Java objects if memory is not available in heap then it will be spilled to disk. to tackle memory issues it will spill down some part of data or complete data to disk. (note: make sure to have enough disk space in nodes other no-disk space errors will popup)
3)Persist (MEMORY_ONLY_SER)
when you persist data frame with MEMORY_ONLY_SER it will be cached in spark.cached.memory section as serialized Java objects (one-byte array per partition). this is generally more space-efficient than MEMORY_ONLY but it is a cpu-intensive task because compression is involved (general suggestion here is to use Kyro for serialization) but this still faces OOM issues similar to MEMORY_ONLY.
4)Persist (MEMORY_AND_DISK_SER)
it is similar to MEMORY_ONLY_SER but one difference is when no heap space is available then it will spill RDD array to disk the same as (MEMORY_AND_DISK) ... we can use this option when you have a tight constraint on disk space and you want to reduce IO traffic.
5)Persist (DISK_ONLY)
In this case, heap memory is not used.RDD's are persisted to disk. make sure to have enough disk space and this option will have huge IO overhead. don't use this when you have dataframes that are repeatedly used.
6)Persist (MEMORY_ONLY_2 or MEMORY_AND_DISK_2)
These are similar to above mentioned MEMORY_ONLY and MEMORY_AND_DISK. the only difference is these options replicate each partition on two cluster nodes just to be on the safe side.. use these options when you are using spot instances.
7)Persist (OFF_HEAP)
Off heap memory generally contains thread stacks, spark container application code, network IO buffers, and other OS application buffers. even you can utilize this part of the memory from RAM for caching your RDD with the above option.

what is driver memory and executor memory in spark? [duplicate]

This question already has answers here:
How to set Apache Spark Executor memory
(13 answers)
Closed 3 years ago.
I am new to spark framework and i would like to know what is driver memory and executor memory? what is the effective way to get the maximum performance from both of them?
Spark need a driver to handle the executors. So the best way to understand is:
Driver
The one responsible to handle the main logic of your code, get resources with yarn, handle the allocation and handle some small amount of data for some type of logic. The Driver Memory is all related to how much data you will retrieve to the master to handle some logic. If you retrieve too much data with a rdd.collect() your driver will run out of memory. The memory for the driver usually is small 2Gb to 4Gb is more than enough if you don't send too much data to it.
Worker
Here is where the magic happens, the worker will be the one responsible to execute your job. The amount of memory depends of what you are going to do. If you just going to do a map function where you just going to transform the data with no type of aggregation, you usually don't need much memory. But if you are going to run big aggregations, a lot of steps and etc. Usually you will use a good amount of memory. And it is related to the size of your files that you will read.
Tell you a proper amount of memory for each case all depends of how your job will work. You need to understand what is the impact of each function and monitor to tune your memory usage for each job. Maybe 2Gb per worker is what you need, but sometimes 8Gb per workers is what you need.

How to solve "job aborted due to stage failure" from "spark.akka.framesize"?

I have a spark program which is doing a bunch of column operations, and then calling .collect() to pull the results into memory.
I am receiving this problem when running the code:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 302987:27 was 139041896 bytes, which exceeds max allowed: spark.akka.frameSize (134217728 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values.
The more full stack trace can be seen here: https://pastebin.com/tuP2cPPe
Now I'm wondering what I need to change to my code and/or configuration to solve this. I have a few ideas:
Increase the spark.akka.frameSize, as suggested. I am a bit reluctant to do this because I do not know this parameter very well, and for other jobs I might prefer the default. Is there a way to specify this within an application? And can it be changed dynamically on the fly within the code similar to number of partitions?
Decrease the number of partitions before calling collect() on the table. I have a feeling that calling collect() when there are too many partitions is causing this to fail. It is putting too much stress on the driver when pulling all of these pieces into memory.
I do not understand the suggestion Consider...using broadcast variables for large values. How will this help? I still need to pull the results back to the driver whether I have a copy of the data on each executor or not.
Are there other ideas that I am missing? Thx.
I think that error is a little misleading. The error is because the result you are trying to download back to your driver is larger than Akka (the underlying networking library used by spark) can fit in a message. Broadcast variables are used to efficiently SEND data to the worker nodes, which is the opposite direction as what you are trying to do.
Usually you don't want to do a collect when it is going to pull back a lot of data because you will lose any parallelism for the job by trying to download that result to one node. If you have too much data this could either take forever or potentially cause your job to fail. You can try increasing the Akka frame size until it is large enough that your job doesn't fail, but that will probably just break again in the future when your data grows.
A better solution would be to save the the result to some distributed filesystem (HDFS, S3) using the RDD write APIs. Then you could either perform more distributed operations with it in follow on jobs using Spark to read it back in, or you could just download the result directly from the distributed file system and do whatever you want with it.

What happens if a Spark broadcast join is too large?

In doing Spark performance tuning, I've found (unsurprisingly) that doing broadcast joins eliminates shuffles and improves performance. I've been experimenting with broadcasting on larger joins, and I've been able to successfully use far larger broadcast joins that I expected -- e.g. broadcasting a 2GB compressed (and much larger uncompressed) dataset, running on a 60-node cluster with 30GB memory/node.
However, I have concerns about putting this into production, as the size of our data fluctuates, and I'm wondering what will happen if the broadcast becomes "too large". I'm imagining two scenarios:
A) Data is too big to fit in memory, so some of it gets written to disk, and performance degrades slightly. This would be okay. Or,
B) Data is too big to fit in memory, so it throws an OutOfMemoryError and crashes the whole application. Not so okay.
So my question is: What happens when a Spark broadcast join is too large?
Broadcast variables are plain local objects and excluding distribution and serialization they the behave as any other object you use. If they don't fit into memory you'll get OOM. Other than memory paging there is no magic that can prevent that.
So broadcasting is not applicable for objects that may not fit into memory (and leave a lot of free memory for standard Spark operations).

What happens if the data can't fit in memory with cache() in Spark?

I am new to Spark. I have read at multiple places that using cache() on a RDD will cause it to be stored in memory but I haven't so far found clear guidelines or rules of thumb on "How to determine the max size of data" that one could cram into memory? What happens if the amount of data that I am calling "cache" on, exceeds the memory ? Will it cause my job to fail or will it still complete with a noticeable impact on Cluster performance?
Thanks!
As it is clearly stated in the official documentation with MEMORY_ONLY persistence (equivalent to cache):
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
Even if data fits into memory it can be evicted if new data comes in. In practice caching is more a hint than a contract. You cannot depend on caching take place but you don't have to if it succeeds either.
Note:
Please keep in mind that the default StorageLevel for Dataset is MEMORY_AND_DISK, which will:
If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
See also:
(Why) do we need to call cache or persist on a RDD
Why do I have to explicitly tell Spark what to cache?

Resources