Right Spark executor memory size given certain data size - apache-spark

A lot of the discussions I found on the internet on resource allocation was about the max memory config for --executor-memory, taking into account a few memory overheads.
But I would imagine that for simple job like reading in a 100MB file and then count # of rows, with a cluster of a total 500GB memory available across nodes, I shouldn't ask for # of executors and memory allocation that, with all memory overheads accounted for, could take all 500GB memory, right? Even 1 executor of 3GB or 5GB memory seems to be an overkill. How should I think about the right memory size for a job?
Thank you!

Related

How to tune Spark to avoid sort disk spill?

We have an algorithm that currently processes data in partition-by-partition manner foreachPartition. I realize this might not be the best way to process data in Spark but, in theory, we should be able to make it work.
We have an issue with Spark spilling data after sortWithinPartitions call with approx 45 GB of data in partition. Our executor has 250 GB memory defined. In theory, there is enough space to fit the data in the memory (unless Spark's overhead for sorting is huge). Yet we experience the spills. Is there a way accurately calculate how much memory per executor we'd need to make it work?
You need to decrease a number of tasks per executor or to increase memory.
The first option is just to decrease spark.executor.cores
The second option obviously is to increase spark.executor.memory
The third option is to increase spark.task.cpus because number of tasks per executor are spark.executor.cores / spark.task.cpus. For example you set spark.executor.cores=4 and spark.task.cpus=2. Number of tasks in that case would be 4/2=2. And if you don't set spark.task.cpus, by default it is 1, 4/1=4 tasks which would consume much more memory.
I prefer the third option because it keeps a balance between occupied memory and cores and allow to use more than one core per task.

If I give row_cache_size_in_mb =5Gb in cassandra.yaml file, does cassandra reserves 5GB from Heap Memory?

I am running my cassandra cluster having memory 32 GB on each node,
And row cache capacity (row_cache_size_in_mb) 5GB,
Just want to know, does 5gb memory ram is reserved for row caching from my heap??
It will let it grow to that size over time. Can use nodetool info to see the current size and limit and nodetool setcachecapacity to change it at runtime. Note that its kinda an estimate though and heap can grow a bit larger. I would be sure to test that the row_cache is actually improving things though since in a lot of cases having no row cache can be faster.

Any tips for scaling Spark horizontally

Does anybody have any tips when moving Spark execution from a few large nodes to many, smaller nodes?
I am running a system with 4 executors, each executor has 24Gb of ram and 12 cores. If I try to scale that out to 12 executors, 4 cores each and 8 Gb of ram (Same total RAM, same total cores, just distributed differently) I run into out of memory errors:
Container killed by YARN for exceeding memory limits. 8.8 GB of 8.8 GB physical memory used.
I have increased the number partitions by a factor of 3 to create more (yet smaller) partitions, but this didn't help.
Does anybody have any tips & tricks when trying to scale spark horizontally?
This is a pretty broad question, executor sizing in Spark is a very complicated kind of black magic, and the rules of thumb which were correct in 2015 for example are obsolete now, as will whatever I say be obsolete in 6 months with the next release of Spark. A lot comes down to exactly what you are doing and avoiding key skew in your data.
This is a good place to start to learn and develop your own understanding:
https://spark.apache.org/docs/latest/tuning.html
There are also a multitude of presentations on Slideshare about tuning Spark, try and read / watch the most recent ones. Anything older than 18 months be sceptical of, and anything older than 2 years just ignore.
I will make the assumption that you are using at least Spark 2.x.
The error you're encountering is indeed because of poor executor sizing. What is happening is that your executors are attempting to do too much at once, and running themselves into the ground as they run out of memory.
All other things being equal these are the current rules of thumb as I apply them:
The short version
3 - 4 virtual (hyperthreaded) cores and 29GB of RAM is a reasonable default executor size (I will explain why later). If you know nothing else, partition your data well and use that.
You should normally aim for a data partition size (in memory) on the order of ~100MB to ~3GB
The formulae I apply
Executor memory = number of executor cores * partition size * 1.3 (safety factor)
Partition size = size on disk of data / number of partitions * deser ratio
The deserialisation ratio is the ratio between the size of the data on disk and the size of data in memory. The Java memory representation of the same data tends to be a decent bit larger than on disk.
You also need to account for whether your data is compressed, many common formats like Parquet and ORC use compression like gzip or snappy.
For snappy compressed text data (very easily compressed), I use ~10X - 100X.
For snappy compressed data with a mix of text, floats, dates etc I see between 3X and 15X typically.
number of executor cores = 3 to 4
Executor cores totally depends on how compute vs memory intensive your calculation is. Experiment and see what is best for your use case. I have never seen anyone informed on Spark advocate more than 6 cores.
Spark is smart enough to take advantage of data locality, so the larger your executor, the better chance that your data is PROCESS_LOCAL
More data locality is good, up to a point.
When a JVM gets too large > 50GB, it begins to operate outside what it was originally designed to do, and depending on your garbage collection algorithm, you may begin to see degraded performance and high GC time.
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
There also happens to be a performance trick in Java that if your JVM is smaller than 32GB, you can use 32 bit compressed pointers rather than 64 bit pointers, which saves space and reduces cache pressure.
https://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
It also so happens that YARN adds 7% or 384MB of RAM (whichever is larger) to your executor size for overhead / safety factor, which is where 29GB rule of thumb comes from: 29GB + 7% ~= 32GB
You mentioned that you are using 12 core, 24GB RAM executors. This sends up a red flags for me.
Why?
Because every "core" in an executor is assigned one "task" at time. A task is equivalent to the work required to compute the transformation of one partition from "stage" A to "stage" B.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-taskscheduler-tasks.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-DAGScheduler-Stage.html
If your executor has 12 cores, then it is going to try and do 12 tasks simulatenously with a 24GB memory budget. 24GB / 12 cores = 2GB per core. If your partitions are greater than 2GB, you will get an out of memory error. If the particular transformation doubles the size of the input (even intermediately), then you need to account for that as well.

The actual executor memory does not match the executoy-memory I set

I hava a spark2.0.1 cluster with 1 Master(slaver1) and 2 worker(slaver2,slaver3),every machine has 2GB RAM.when I run the command
./bin/spark-shell --master spark://slaver1:7077 --executor-memory 500m
when I check the executor memory in the web (slaver1:4040/executors/). I found it is 110MB.
The memory you are talking about is Storage memory Actually Spark Divides the memory [Called Spark Memory] into 2 Region First is Storage Memory and Second is Execution Memory
The Total Memory can Be calculated by this Formula
(“Java Heap” – “Reserved Memory”) * spark.memory.fraction
Just to give you an overview Storage Memory is This pool is used for both storing Apache Spark cached data and for temporary space serialized data “unroll”. Also all the “broadcast” variables are stored there as cached blocks
If you want to check total memory provided you can go to Spark UI Spark-Master-Ip:8080[default port] in the start you can find Section called MEMORY that is total memory used by spark.
Thanks
From Spark 1.6 version, The memory is divided according to the following picture
There is no hard boundary between execution and storage memory. The storage memory is required more then it takes from execution memory and viceversa. The
Execution and storage memory is given by (ExecutorMemory-300Mb)* spark.memory.fraction
In your case (500-300)*).75 = 150mb there will be 3 to 5% error in Executor memory that is allocated.
300Mb is the reserved memory
User memory = (ExecutorMemory-300)*).(1-spark.memory.fraction).
In your case (500-300)*).25 = 50mb
Java Memory : Runtime.getRuntime().maxMemory()

Spark: Memory Usage

I am measuring memory usage for an application (WordCount) in Spark with ps -p WorkerPID -o rss. However the results don’t make any sense. Because for every amount of data (1MB, 10MB, 100MB, 1GB, 10GB) there is the same amount of memory used. For 1GB and 10GB data the result of the measurement is even less than 1GB. Is Worker the wrong process for measuring memory usage? Which process of the Spark Process Model is responsible for memory allocation?
Contrary to popular belief Spark doesn't have to load all data into main memory. Moreover WordCount is a trivial application and amount of required memory only marginally depends on the input:
amount of data loaded per partition with SparkContext.textFile depends on a configuration not input size (see for example: Why does partition parameter of SparkContext.textFile not take effect?).
size of the key-value pairs is roughly constant with typical input.
intermediate data can be spilled to disk if needed.
last but not least amount of memory used by executors is capped by a configuration.
Keeping all of that in mind behavior different than what you see would be troubling at best.

Resources