Why swap needs to be turned off in Datastax Cassandra? - linux

I am new to Datastax cassandra. While going through the installation procedure of cassandra. It is recommended that the swap area of OS should be turned off. Does anyone provide the reason for that? Will it affect any OS level operations ?

In production, if your database is using swap you will have very bad performance. In a ring of Cassandra nodes, you are better off having one node go completely down than allowing it to limp along in swap.
The easiest way to ensure that you never go into swap is to simply disable it.

If you dont disable swap space, when there is an Out of Memory(when the address space is used up by cassandra mmap) problem at OS level, the OS will try to take a slice of the JVM which is in turn actually trying to clear it via JNI by default. Now , your JVM is slowed down as a slice of its heap memory is lost. Now GC will be happening along with the cassandra write operation with lesser heap memory. This brings the overall performance of the cassandra node down and gradually kills it one point of time where there is no more memory left at OS level.
Thats why they suggest you two things.
Bundle jna.jar. When there is a GC operation for mmap of cassandra, it is all done by java code and not JNI portion that is shipped by default in cassandra. So, it avoids a portion of address space that JNI tries to store in case of native operations.
Disable swap space . A low performing cassandra node will slow down all the operations at its client end. Even the replication to this node will be slowed down and hence you read/write will appear slower than you think. A node shall die and restart when such an Out of Memory occurs instead of taking a portion of JVM that slows down the entire process.

Related

Apache Yarn - Allocating more memory than the Physical memory or RAM

I was considering changing yarn.nodemanager.resource.memory-mb to a value higher than the RAM available on my machine. Doing a quick search revealed that not many people are doing this.
Many long lived applications on yarn, are bound to have a jvm heap space allocation in which some of their memory is more frequently used and some of it is rarely used. In this case, it would make perfect sense for such applications to have some of their infrequently used memory portions swapped to disk and reallocating the available physical memory to other applications that need it.
Given the above background, can someone either please corroborate my reasoning or offer an alternate perspective? Also, can you please also clarify how the parameter yarn.nodemanager.vmem-pmem-ratio would work in the above case?
This is not a good idea. Trying to use more memory than what is available will eventually crash your Node Manager hosts.
There already is a feature called opportunistic containers which uses spare memory not used by the NMs and adds more containers to those hosts. Refer to:
YARN-1011 [Umbrella] Schedule containers based on utilization of currently allocated containers
In addition, Pepperdata has a product that does almost the same thing if you can't wait for YARN-1011.
https://www.pepperdata.com/products/capacity-optimizer/
As for yarn.nodemanager.vmem-pmem-ratio, don't enable this as it's not recommended anymore.
YARN-782 vcores-pcores ratio functions differently from vmem-pmem ratio in misleading way

Behaviour of Mesos when using cgroups for Spark

I was wondering what the behaviour of Spark in fine-grained mode on Mesos would be, when cgroups are enabled.
One concern is the following: when I use Mesos+spark without cgroups, it already shows that the actual spark executor process uses at least 10% more memory, than what it promised to Mesos it would use. When enabling cgroups, would it kill the Spark-executors?
Second, how is file-cache handled? Spark relies heavily on file-cache. Is file-cache accounted to the amount of memory in Mesos? Probably not, but could we influence this? So for example, ideally I want Spark to use 8GB in total, of which 5GB should be used for the java process -- assuming that spark plays nicely and does not grow beyond 5GB -- and 3GB should be used as file-cache (max).
I hope someone has experience with this, because in order to test these things myself I would have to go through a lot of support requests from our cluster sysadmin, as cgroups rely on root credentials at one point - and I'd hate it to be in vain without having asked others.
To answer your first question, it seems you've got something mixed up with how cgroups work. The executor simply would not (,and it indeed does, as I can confirm) be able to allocate more memory than the cgroups would allow. So Mesos would not actually act as an process killer or anything*. But, some types of programs would indeed break on being unable to allocate more memory and it depends on the program if it then quits, or is able to run fine, but perhaps with less memory and/or performance.
For your second question, there don't seem to be any configuration settings in order to influence the actual cgroup memory amount. There seems to be a 1-to-1 mapping between the executor memory setting and what Spark gets from Mesos. However, I do think there is a hidden factor, because I can see Spark asks for roughly 5.8GB, but actually I set the executor memory to 5GB. (I'll update the ticket once I can find this hidden factor of probably 15% in the source code.)
Update, the setting you'd want is spark.mesos.executor.memoryOverhead. You can give a number in MegaBytes which is added to the executor memory as the total memory which will be used as Mesos resource, and thus as a cgroup memory limit.
*=Update2, actually cgroups by default does kill processes which grow beyond the control group's limit. I can confirm that the memory.oom_control in /cgroups/memory/x/ is set to '0' (which counter-intuitively is enabled). However in the case of Spark, it is the aformentioned 10-15% overhead which gives enough leeway to not encounter OOM.

How to shrink the page talbe size of a process?

mongodb server map all db files into RAM. Along with size of database becoming bigger, the server will has a huge page table which is up to 3G bytes.
Is there a way to shrink it when the server is running?
mongodb version is 2.0.4
Mongodb will memory-map all of the data files that it creates, plus the journal files (if you're using journaling). There is no way to prevent this from happening. This means that the virtual memory size of the MongoDB process will always be roughly twice the size of the data files.
Note that the OS memory management system will page out unused RAM pages, so that the physical memory size of the process will typically be much less than the virtual memory size.
The only way to reduce the virtual memory size of the 'mongod' process is to reduce the size of the MongoDB data files. The only way to reduce the size of the data files is to take the node offline and perform a 'repair'.
See here for more details:
- http://www.mongodb.org/display/DOCS/Excessive+Disk+Space#ExcessiveDiskSpace-RecoveringDeletedSpace
Basically you are asking to do something that the MongoDB manual recommends not to: http://docs.mongodb.org/manual/administration/ulimit/ in this specific scenario. Recommended however does not mean required and it is just a guideline really.
This is just the way MongoDB runs and something you have got to accept unless you wish to toy around and test out different scenarios and how they work.
You probably want to reduce the used memory of the process. You could use the ulimit bash builtin (before starting your server, perhaps in some /etc/rc.d/mongodb script) which calls the setrlimit(2) syscall

Solr uses too much memory

We have a Solr 3.4 instance running on Windows 2008 R2 with Oracle Java 6 Hotspot JDK that becomes unresponsive. When we looked at the machine, we noticed that the available physical memory went to zero.
The Tomcat7.exe process was using ~70Gigs (Private Working Set) but Working Set (Memory) was using all the memory on the system. There were no errors in the Tomcat / Solr logs. We used VMMap to identify that the memory was being used for memory mapping the Solr segement files.
Restarting Tomcat fixed the problem temporarily, but it eventually came back.
We then tried decreasing the JVM size to give more space for the memory mapped files, but then the Solr eventually becomes unresponsive with the old generation at 100%. Again resetting fixed the problem, but it did not throw an out-of-memory exception before we reset.
Currently our spidey sense is telling us that the cache doesn't shrink when there is memory pressure, and that maybe there are too many MappedByteBuffers hanging around so that the OS can not free up the memory from memory mapped files.
There are too many parameters and too little information to help with any details. This answer is also quite old as are the mentioned systems.
Here are some things that helped in my experience:
rather decrease RAM usage in Tomcat as well as in SOLR to decrease the risk of swapping. Leave the system space to breath.
if this "starts" to appear without any changes to Tomcat or the SOLR config - than maybe it is due to the fact that the amount of data that SOLR has to index and query has increased. This can either mean that the original config was never good to begin with or that the limit of the current resources have been reached and have to be reviewed. Which one is it?
check the queries (if you can influence them): move any subquery constructs that are often requested into filterqueries, move very individual request constructs into the regular query parameter. Decrease query cache, increase/keep filter query cache - or decrease filter cache in case filter queries aren't used that much in your system.
check the schema.xml of SOLR on configuration errors (could just be misconceptions really). I ran into this once: while importing fields would be created en masse causing the RAM to overflow.
if it happens during import: check whether the import process is set to autocommit and commits rather often and does also optimize - maybe it is possible to commit less often and optimize only once at the end.
upgrade Java, Tomcat and SOLR
It's probably due to the new default policy SOLR uses for directories (it tries to map them to RAM or something like that).
Read this:
http://grokbase.com/t/lucene/solr-user/11789qq3nm/virtual-memory-usage-increases-beyond-xmx-with-solr-3-3

unbuffered I/O in Linux

I'm writing lots and lots of data that will not be read again for weeks - as my program runs the amount of free memory on the machine (displayed with 'free' or 'top') drops very quickly, the amount of memory my app uses does not increase - neither does the amount of memory used by other processes.
This leads me to believe the memory is being consumed by the filesystems cache - since I do not intend to read this data for a long time I'm hoping to bypass the systems buffers, such that my data is written directly to disk. I dont have dreams of improving perf or being a super ninja, my hope is to give a hint to the filesystem that I'm not going to be coming back for this memory any time soon, so dont spend time optimizing for those cases.
On Windows I've faced similar problems and fixed the problem using FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH - the machines memory was not consumed by my app and the machine was more usable in general. I'm hoping to duplicate the improvements I've seen but on Linux. On Windows there is the restriction of writing in sector sized pieces, I'm happy with this restriction for the amount of gain I've measured.
is there a similar way to do this in Linux?
The closest equivalent to the Windows flags you mention I can think of is to open your file with the open(2) flags O_DIRECT | O_SYNC:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The O_DIRECT flag on its
own makes at an effort to transfer data synchronously, but does not
give the guarantees of the O_SYNC that data and necessary metadata are
transferred. To guarantee synchronous I/O the O_SYNC must be used in
addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is
described in raw(8).
Granted, trying to do research on this flag to confirm it's what you want I found this interesting piece telling you that unbuffered I/O is a bad idea, Linus describing it as "brain damaged". According to that you should be using madvise() instead to tell the kernel how to cache pages. YMMV.
You can use O_DIRECT, but in that case you need to do the block IO yourself; you must write in multiples of the FS block size and on block boundaries (it is possible that it is not mandatory but if you do not its performance will suck x1000 because every unaligned write will need a read first).
Another much less impacting way of stopping your blocks using up the OS cache without using O_DIRECT, is to use posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED). Under Linux 2.6 kernels which support it, this immediately discards (clean) blocks from the cache. Of course you need to use fdatasync() or such like first, otherwise the blocks may still be dirty and hence won't be cleared from the cache.
It is probably a bad idea of fdatasync() and posix_fadvise( ... POSIX_FADV_DONTNEED) after every write, but instead wait until you've done a reasonable amount (50M, 100M maybe).
So in short
after every (significant chunk) of writes,
Call fdatasync followed by posix_fadvise( ... POSIX_FADV_DONTNEED)
This will flush the data to disc and immediately remove them from the OS cache, leaving space for more important things.
Some users have found that things like fast-growing log files can easily blow "more useful" stuff out of the disc cache, which reduces cache hits a lot on a box which needs to have a lot of read cache, but also writes logs quickly. This is the main motivation for this feature.
However, like any optimisation
a) You're not going to need it so
b) Do not do it (yet)
as my program runs the amount of free memory on the machine drops very quickly
Why is this a problem? Free memory is memory that isn't serving any useful purpose. When it's used to cache data, at least there is a chance it will be useful.
If one of your programs requests more memory, file caches will be the first thing to go. Linux knows that it can re-read that data from disk whenever it wants, so it will just reap the memory and give it a new use.
It's true that Linux by default waits around 30 seconds (this is what the value used to be anyhow) before flushing writes to disk. You can speed this up with a call to fsync(). But once the data has been written to disk, there's practically zero cost to keeping a cache of the data in memory.
Seeing as you write to the file and don't read from it, Linux will probably guess that this data is the best to throw out, in preference to other cached data. So don't waste effort trying to optimise unless you've confirmed that it's a performance problem.

Resources