I was watching a video on apache spark here . Where the speaker Paco Nathan says the following
"If you have 128 GB of RAM, you are not going to throw them all at once at the jvm.That will just cause a lot of garbage collection. And so one of the things with spark is, use more sophisticated ways to leverage the memory space, do more off-heap."
I am not able to understand what he says with regard to how spark efficiently handles this scenario.
also more specifically i completely did not understand the statement
"If you have 128 GB of RAM you are not going to throw them all at once at the jvm.That will just cause of lot of garbage collection"
Can someone explain what the reasoning actually is behind these statements ?
"If you have 128 GB of RAM you are not going to throw them all at once
at the jvm.That will just cause of lot of garbage collection"
This means that you will not assign all the memory to the JVM only when there is memory requirement for other stuff like garbage collection, off-heap operations, etc.
Spark does this by assigning fractions of the memory(that you have assigned to Spark executors) for such operations as shown in image below(for Spark 1.5.0):
Related
I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.
I have followed : How to tune spark executor number, cores and executor memory?
and I understand how to utilise maximum resources out of my spark cluster.
However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .
For instance,
if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this?
Assume I have the configurations of my nodes same as the one mentioned in the link above.
I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.
The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.
This question already has answers here:
How to set Apache Spark Executor memory
(13 answers)
Closed 3 years ago.
I am new to spark framework and i would like to know what is driver memory and executor memory? what is the effective way to get the maximum performance from both of them?
Spark need a driver to handle the executors. So the best way to understand is:
Driver
The one responsible to handle the main logic of your code, get resources with yarn, handle the allocation and handle some small amount of data for some type of logic. The Driver Memory is all related to how much data you will retrieve to the master to handle some logic. If you retrieve too much data with a rdd.collect() your driver will run out of memory. The memory for the driver usually is small 2Gb to 4Gb is more than enough if you don't send too much data to it.
Worker
Here is where the magic happens, the worker will be the one responsible to execute your job. The amount of memory depends of what you are going to do. If you just going to do a map function where you just going to transform the data with no type of aggregation, you usually don't need much memory. But if you are going to run big aggregations, a lot of steps and etc. Usually you will use a good amount of memory. And it is related to the size of your files that you will read.
Tell you a proper amount of memory for each case all depends of how your job will work. You need to understand what is the impact of each function and monitor to tune your memory usage for each job. Maybe 2Gb per worker is what you need, but sometimes 8Gb per workers is what you need.
I am trying to join two large spark dataframes and keep running into this error:
Container killed by YARN for exceeding memory limits. 24 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
This seems like a common issue among spark users, but I can't seem to find any solid descriptions of what spark.yarn.executor.memoryOverheard is. In some cases it sounds like it's a kind of memory buffer before YARN kills the container (e.g. 10GB was requested, but YARN won't kill the container until it uses 10.2GB). In other cases it sounds like it's being used to to do some kind of data accounting tasks that are completely separate from the analysis that I want to perform. My questions are:
What is the spark.yarn.executor.memoryOverhead being using for?
What is the benefit of increasing this kind of memory instead of
executor memory (or the number of executors)?
In general, are there things steps I can take to reduce my
spark.yarn.executor.memoryOverhead usage (e.g. particular
datastructures, limiting the width of the dataframes, using fewer executors with more memory, etc)?
Overhead options are nicely explained in the configuration document:
This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
This also includes user objects if you use one of the non-JVM guest languages (Python, R, etc...).
I want to make a simple query over approximately 10 mio rows.
I have 32GB RAM (20GB is free). And Cassandra is using so much memory, that the available RAM is used to a maximum, and the process is killed.
How can I optimize Cassandra? I have read about "Tuning Java resources" and changing the Java heap sizing, but I still have no solution.
Cassandra will use up as much memory as is available to it on the system. It's a greedy process and will use any available memory for caching, similar to the way the kernel page cache works. Don't worry if Cassandra is using all your hosts memory, it will just be in cache and will be released to other processes if necessary.
If your query is suffering from timeouts this will probably be from reading too much data from a single partition so that the query doesn't return in under read_request_timeout_in_ms. If this is the case you should look at making your partition sizes smaller.
Running spark job with 1 TB data with following configuration :
33G executor memory
40 executors
5 cores per executor
17 g memoryoverhead
What are the possible reasons for this Error?
Where did you get that warning from? Which particular logs? Your lucky you even get a warning :). Indeed 17g seems like enough, but then you do have 1TB of data. I've had to use more like 30g for less data than that.
The reason for the error is that yarn uses extra memory for the container that doesn't live in the memory space of the executor. I've noticed that more tasks (partitions) means much more memory used, and shuffles are generally heavier, other than that I've not seen any other correspondences to what I do. Something somehow is eating memory unnecessarily.
It seems the world is moving to Mesos, maybe it doesn't have this problem. Even better, just use Spark stand alone.
More info: http://www.wdong.org/wordpress/blog/2015/01/08/spark-on-yarn-where-have-all-my-memory-gone/. This link seems kinda dead (it's a deep dive into the way YARN gobbles memory). This link may work: http://m.blog.csdn.net/article/details?id=50387104. If not try googling "spark on yarn where have all my memory gone"
One possible issue is that your virtual memory is getting very large in proportion to your physical memory. You may want to set yarn.nodemanager.vmem-check-enabled to false in yarn-site.xml to see what happens. If the error stops, then that may be the issue.
I answered a similar question elsewhere and provided more information there: https://stackoverflow.com/a/42091255/3505110