Behaviour of Mesos when using cgroups for Spark - apache-spark

I was wondering what the behaviour of Spark in fine-grained mode on Mesos would be, when cgroups are enabled.
One concern is the following: when I use Mesos+spark without cgroups, it already shows that the actual spark executor process uses at least 10% more memory, than what it promised to Mesos it would use. When enabling cgroups, would it kill the Spark-executors?
Second, how is file-cache handled? Spark relies heavily on file-cache. Is file-cache accounted to the amount of memory in Mesos? Probably not, but could we influence this? So for example, ideally I want Spark to use 8GB in total, of which 5GB should be used for the java process -- assuming that spark plays nicely and does not grow beyond 5GB -- and 3GB should be used as file-cache (max).
I hope someone has experience with this, because in order to test these things myself I would have to go through a lot of support requests from our cluster sysadmin, as cgroups rely on root credentials at one point - and I'd hate it to be in vain without having asked others.

To answer your first question, it seems you've got something mixed up with how cgroups work. The executor simply would not (,and it indeed does, as I can confirm) be able to allocate more memory than the cgroups would allow. So Mesos would not actually act as an process killer or anything*. But, some types of programs would indeed break on being unable to allocate more memory and it depends on the program if it then quits, or is able to run fine, but perhaps with less memory and/or performance.
For your second question, there don't seem to be any configuration settings in order to influence the actual cgroup memory amount. There seems to be a 1-to-1 mapping between the executor memory setting and what Spark gets from Mesos. However, I do think there is a hidden factor, because I can see Spark asks for roughly 5.8GB, but actually I set the executor memory to 5GB. (I'll update the ticket once I can find this hidden factor of probably 15% in the source code.)
Update, the setting you'd want is spark.mesos.executor.memoryOverhead. You can give a number in MegaBytes which is added to the executor memory as the total memory which will be used as Mesos resource, and thus as a cgroup memory limit.
*=Update2, actually cgroups by default does kill processes which grow beyond the control group's limit. I can confirm that the memory.oom_control in /cgroups/memory/x/ is set to '0' (which counter-intuitively is enabled). However in the case of Spark, it is the aformentioned 10-15% overhead which gives enough leeway to not encounter OOM.

Related

Slurm memory resource management

Despite a thorough read of https://slurm.schedmd.com/slurm.conf.html, there is several things I don't understand regarding how Slurm manages the memory resource. My slurm.conf contains
DefMemPerCPU=1000
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
NodeName=orion Gres=gpu:RTX2080Ti:4 MemSpecLimit=2048 RealMemory=95232
When not specifying --mem, jobs are launched with MIN_MEMORY=0 and don't seem to fail when allocating memory. What is the maximum memory job can use and how to display it?
When specifying --mem=0, jobs are pending waiting for resources. How can this be?
The value provided in DefMemPerCPU=1000 doesn't seem to have an effect. Is it related to SelectTypeParameters=CR_Core_Memory? If so, what is the equivalent for CPU cores?
Ultimately, what should be the configuration for having a default memory limit?

Apache Yarn - Allocating more memory than the Physical memory or RAM

I was considering changing yarn.nodemanager.resource.memory-mb to a value higher than the RAM available on my machine. Doing a quick search revealed that not many people are doing this.
Many long lived applications on yarn, are bound to have a jvm heap space allocation in which some of their memory is more frequently used and some of it is rarely used. In this case, it would make perfect sense for such applications to have some of their infrequently used memory portions swapped to disk and reallocating the available physical memory to other applications that need it.
Given the above background, can someone either please corroborate my reasoning or offer an alternate perspective? Also, can you please also clarify how the parameter yarn.nodemanager.vmem-pmem-ratio would work in the above case?
This is not a good idea. Trying to use more memory than what is available will eventually crash your Node Manager hosts.
There already is a feature called opportunistic containers which uses spare memory not used by the NMs and adds more containers to those hosts. Refer to:
YARN-1011 [Umbrella] Schedule containers based on utilization of currently allocated containers
In addition, Pepperdata has a product that does almost the same thing if you can't wait for YARN-1011.
https://www.pepperdata.com/products/capacity-optimizer/
As for yarn.nodemanager.vmem-pmem-ratio, don't enable this as it's not recommended anymore.
YARN-782 vcores-pcores ratio functions differently from vmem-pmem ratio in misleading way

How to determine Node --max_old_space_size for a given memory limit?

We run Node processes inside Docker containers with hard memory caps of 1GB, 2GB, or 4GB. Each container generally just runs a single Node process (plus maybe a tiny shell script wrapper). Let's assume for the purposes of this question that the Node process never forks more processes.
For our larger containers, if we don't set --max_old_space_size ourselves, then in the version of Node we use (on a 64-bit machine) it defaults to 1400MB. (This will change to 2048MB in a later version of Node.)
Ideally we want our Node process to use as much of the container as possible without going over and running out of memory. The question is — what number should we use? My understanding is that this particular flag tunes the size of one of the largest pools of memory used by Node, but it's not the only pool — eg, there's a "non-old" part of the heap, there's stack, etc. How much should I subtract from the container's size when setting this flag in order to stay away from the cgroup memory limit but still make maximal use of the amount of memory allowed in this container?
I do note that from the same place where kMaxOldSpaceSizeHugeMemoryDevice is defined, it looks like the default "max semi space" is 16MB and the default "max executable size" is 512MB. So I suspect this means I should subtract at least 528 from the container's memory limit when determining the value for this flag. But surely there are other ways that Node uses memory?
(To be more specific, we are a hosting service that sells containers of particular sizes to our users, most of which use them for Node processes. We'd like to be able to advise our customers as to what flag to set so that they neither are killed by our limits nor pay us for capacity that Node's configuration doesn't let them actually use.)
There is, unfortunately, no particularly satisfactory answer to this question.
The constants you've found control the size of the garbage-collected heap, but as you've already guessed, there are many ways to consume memory that's not part of that heap:
For example, big strings and big TypedArrays are typically managed by the embedder (i.e. node and its modules, not V8 itself), and outside the GC'ed heap.
Node modules, in general, can consume whatever memory they want. Presumably you don't want to restrict what modules your customers can run, but that implies that you also can't predict how much memory those modules are going to require.
V8 also uses temporary memory outside the GC'ed heap for parsing and compilation. Numbers depend on the code that's being run, from a few kilobytes up to a gigabyte or more (e.g. for huge asm.js codebases) anything is possible. These are relatively short-lived memory consumption peaks, so on the one hand you probably don't want to limit long-lived heap memory to account for them, but on the other hand that means they can make your processes run into the system limit.

Why swap needs to be turned off in Datastax Cassandra?

I am new to Datastax cassandra. While going through the installation procedure of cassandra. It is recommended that the swap area of OS should be turned off. Does anyone provide the reason for that? Will it affect any OS level operations ?
In production, if your database is using swap you will have very bad performance. In a ring of Cassandra nodes, you are better off having one node go completely down than allowing it to limp along in swap.
The easiest way to ensure that you never go into swap is to simply disable it.
If you dont disable swap space, when there is an Out of Memory(when the address space is used up by cassandra mmap) problem at OS level, the OS will try to take a slice of the JVM which is in turn actually trying to clear it via JNI by default. Now , your JVM is slowed down as a slice of its heap memory is lost. Now GC will be happening along with the cassandra write operation with lesser heap memory. This brings the overall performance of the cassandra node down and gradually kills it one point of time where there is no more memory left at OS level.
Thats why they suggest you two things.
Bundle jna.jar. When there is a GC operation for mmap of cassandra, it is all done by java code and not JNI portion that is shipped by default in cassandra. So, it avoids a portion of address space that JNI tries to store in case of native operations.
Disable swap space . A low performing cassandra node will slow down all the operations at its client end. Even the replication to this node will be slowed down and hence you read/write will appear slower than you think. A node shall die and restart when such an Out of Memory occurs instead of taking a portion of JVM that slows down the entire process.

vm/min_free_kbytes - Why Keep Minimum Reserved Memory?

According to this article:
/proc/sys/vm/min_free_kbytes: This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim)
My question is that what does it mean by "those which cannot wait for reclaim"? In other words, I would like to understand why there's a need to tell the system to always keep a certain minimum amount of memory free and under what circumstances will this memory be used? [It must be used by something; don't see the need otherwise]
My second question: does setting this memory to something higher than 4MB (on my system) leads to better performance? We have a server which occasionally exhibit very poor shell performance (e.g. ls -l takes 10-15 seconds to execute) when certain processes get going and if setting this number to something higher will lead to better shell performance?
(link is dead, looks like it's now here)
That text is referring to atomic allocations, which are requests for memory that must be satisfied without giving up control (i.e. the current thread can not be suspended). This happens most often in interrupt routines, but it applies to all cases where memory is needed while holding an essential lock. These allocations must be immediate, as you can't afford to wait for the swapper to free up memory.
See Linux-MM for a more thorough explanation, but here is the memory allocation process in short:
_alloc_pages first iterates over each memory zone looking for the first one that contains eligible free pages
_alloc_pages then wakes up the kswapd task [..to..] tap into the reserve memory pools maintained for each zone.
If the memory allocation still does not succeed, _alloc pages will either give up [..] In this process _alloc_pages executes a cond_resched() which may cause a sleep, which is why this branch is forbidden to allocations with GFP_ATOMIC.
min_free_kbytes is unlikely to help much with the described "ls -l takes 10-15 seconds to execute"; that is likely caused by general memory pressure and swapping rather than zone exhaustion. The min_free_kbytes setting only needs to allow enough free pages to handle immediate requests. As soon as normal operation is resumed, the swapper process can be run to rebalance the memory zones. The only time I've had to increase min_free_kbytes is after enabling jumbo frames on a network card that didn't support dma scattering.
To expand on your second question a bit, you will have better results tuning vm.swappiness and the dirty ratios mentioned in the linked article. However, be aware that optimizing for "ls -l" performance may cause other processes to become slower. Never optimize for a non-primary usecase.
All linux systems will attempt to make use of all physical memory available to the system, often through the creation of a filesystem buffer cache, which put simply is an I/O buffer to help improve system performance. Technically this memory is not in use, even though it is allocated for caching.
"wait for reclaim", in your question, refers to the process of reclaiming that cache memory that is "not in use" so that it can be allocated to a process. This is supposed to be transparent but in the real world there are many processes that do not wait for this memory to become available. Java is a good example, especially where a large minimum heap size has been set. The process tries to allocate the memory and if it is not instantly available in one large contiguous (atomic?) chunk, the process dies.
Reserving a certain amount of memory with min_free_kbytes allows this memory to be instantly available and reduces the memory pressure when new processes need to start, run and finish while there is a high memory load and a full buffer cache.
4MB does seem rather low because if the buffer cache is full, any process that wants an immediate allocation of more than 4MB will likely fail. The setting is very tunable and system-specific, but if you have a few GB of memory available it can't hurt to bump up the reserve memory to 128MB. I'm not sure what effect it will have on shell interactivity, but likely positive.
This memory is kept free from use by normal processes. As #Arno mentioned, the special processes that can run include interrupt routines, which must be run now (as it's an interrupt), and finish before any other processes can run (atomic). This can include things like swapping out memory to disk when memory is full.
If the memory is filled an interrupt (memory management) process runs to swap some memory into disk so it can free some memory for use by normal processes. But if vm.min_free_kbytes is too small for it to run, then it locks up the system. This is because this interrupt process must run first to free memory so others can run, but then it's stuck because it doesn't have enough reserved memory vm.min_free_kbytes to do its task resulting in a deadlock.
Also see:
https://www.linbit.com/en/kernel-min_free_kbytes/ and
https://askubuntu.com/questions/41778/computer-freezing-on-almost-full-ram-possibly-disk-cache-problem (where the memory management process has so little memory to work with it takes so long to swap little by little that it feels like a freeze.)

Resources