Multiple Cassandra node goes down - cassandra

WE have a 12 node cassandra cluster across 2 different datacenter. We are migrating the data from sql DB to cassandra through a net application and there is another .net app thats reads data from the cassandra. Off recently we are seeing one or the other node going down (nodetool status shows DN and the service is stopped on it). Below is the output of the nodetool status. WE have to start the service to again get it working but it stops again.
https://ibb.co/4P1T453
Path to the log: https://pastebin.com/FeN6uDGv

So in looking through your pastebin, I see a few things that can be adjusted.
First I'm reasonably sure that this is your primary issue:
Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out,
especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
From GNU Error Codes:
Macro: int ENOMEM
“Cannot allocate memory.” The system cannot allocate more virtual
memory because its capacity is full.
-Xms12G, -Xmx12G, -Xmn3000M,
How much RAM is on your instance? From what I'm seeing your node is dying from an OOM (Out of Memory error). My guess is that you're designating too much RAM to the heap, and there isn't enough for the OS/page-cache. In fact, I wouldn't designate much more than 50%-60% of RAM to the heap.
For example, I mostly build instances on 16GB of RAM, and I've found that a 10GB max heap is about as high as you'd want to go on that.
-XX:+UseParNewGC, -XX:+UseConcMarkSweepGC
In fact, as you're using CMS GC, I wouldn't go higher than 8GB for max heap size.
Maximum number of memory map areas per process (vm.max_map_count) 65530 is too low,
recommended value: 1048575, you can change it with sysctl.
This means you haven't adjusted your limits.conf or sysctl.conf. Check through the guide (DSE 6.0 - Recommended Production Settings), but generally it's a good idea to add the following to these files:
/etc/limits.conf
* - memlock unlimited
* - nofile 100000
* - nproc 32768
* - as unlimited
/etc/sysctl.conf
vm.max_map_count = 1048575
Note: After adjusting sysctl.conf, you'll want to run a sudo sysctl -p or reboot.
Is swap disabled? : false,
You will want to disable swap. If Cassandra starts swapping contents of RAM to disk, things will get really slow. Run a swapoff -a and then edit /etc/fstab and remove any swap entries.
tl;dr; Summary
Set your initial and max heap sizes to 8GB (heap new size is fine).
Modify your limits.conf an sysctl.conf files appropriately.
Disable swap.
It's also a good idea to get on the latest version of 3.11 (3.11.4).
Hope this helps!

Related

What is docker --kernel-memory

Good day , I know that Docker containers are using the host's kernel (which is why containers are considered as lightweight vms) Here the the source . However, after reading Runtime Options part of a docker documentation I met an option called --kernel-memory. The doc says
The maximum amount of kernel memory the container can use.
I didn't understand what it does. My guess is every container will allocate some memory in host's kernel space .If so then what is the reason , isn't it vulnerable for a user process to allocate memory in kernel space ?
The whole CPU/Memory Limitation stuff is using cgroups.
You can find all settings performed by docker run (either per args or per default) under /sys/fs/cgroup/memory/docker/<container ID> for memory or /sys/fs/cgroup/cpu/docker/<container ID> for cpu.
So the --kernel-memory:
Reading: cat memory.kmem.limit_in_bytes
Writing: sudo -s echo 2167483648 > memory.kmem.limit_in_bytes
And also the benchmarking memory.kmem.max_usage_in_bytes and memory.kmem.usage_in_bytes which shows (rather selfexplaining) the current usage and the highest usage overall.
CGroup docs about Kernel Memory
For the functionality I will recommend reading Kernel Docs for CGroups V1 instead of the docker docs:
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
With the Kernel memory extension, the Memory Controller is able to
limit the amount of kernel memory used by the system. Kernel memory is
fundamentally different than user memory, since it can't be swapped
out, which makes it possible to DoS the system by consuming too much
of this precious resource.
[..]
The memory used is
accumulated into memory.kmem.usage_in_bytes, or in a separate counter
when it makes sense. (currently only for tcp). The main "kmem" counter
is fed into the main counter, so kmem charges will also be visible
from the user counter.
Currently no soft limit is implemented for kernel memory. It is future
work to trigger slab reclaim when those limits are reached.
and
2.7.2 Common use cases
Because the "kmem" counter is fed to the main user counter, kernel
memory can never be limited completely independently of user memory.
Say "U" is the user limit, and "K" the kernel limit. There are three
possible ways limits can be set:
U != 0, K = unlimited:
This is the standard memcg limitation mechanism already present before kmem
accounting. Kernel memory is completely ignored.
U != 0, K < U:
Kernel memory is a subset of the user memory. This setup is useful in
deployments where the total amount of memory per-cgroup is overcommited.
Overcommiting kernel memory limits is definitely not recommended, since the
box can still run out of non-reclaimable memory.
In this case, the admin could set up K so that the sum of all groups is
never greater than the total memory, and freely set U at the cost of his
QoS.
WARNING: In the current implementation, memory reclaim will NOT be
triggered for a cgroup when it hits K while staying below U, which makes
this setup impractical.
U != 0, K >= U:
Since kmem charges will also be fed to the user counter and reclaim will be
triggered for the cgroup for both kinds of memory. This setup gives the
admin a unified view of memory, and it is also useful for people who just
want to track kernel memory usage.
Clumsy Attempt of a Conclusion
Given a running container with --memory="2g" --memory-swap="2g" --oom-kill-disable using
cat memory.kmem.max_usage_in_bytes
10747904
10 MB of Kernel-Memory in normal state. Would make sense to me to limit it, let's say to 20 MB of Kernel-Memory. Then it should kill or limit the container to protect the host. But due to the fact that there is - according to the docs - no possibility to reclaim the memory and the OOM Killer is starting to kill processes on host then even with a plenty of free memory (according to this: https://github.com/docker/for-linux/issues/1001) for me it is rather unpractical to use that.
The quoted option to set it >= memory.limit_in_bytes is not really helpful in that scenario either.
Deprecated
--kernel-memory is deprecated in v20.10, due to the fact someone (=Linux Kernel) realized all that as well..
What we can do then?
ULimit
Docker API exposes HostConfig|Ulimit which writes to /etc/security/limits.conf. For docker run should be --ulimit <type>=<soft>:<hard>. Use cat /etc/security/limits.conf or man setrlimit to see the categories and you can try to protect your system from filling kernel memory by e.g. generate unlimited processes with --ulimit nproc=500:500, but be careful, nproc works for users and not for containers, so count together..
To prevent DDoS (intentionally or unintentionally) i would suggest to limit at least nofile and nproc. Maybe someone can elaborate further..
sysctl:
docker run --sysctl can change kernel variables on message queue and shared memory, also network, e.g. docker run --sysctl net.ipv4.tcp_max_orphans= for orphan tcp connections which defaults on my system to 131072, and by a kernel memory usage of 64 kB each: Bang 8 GB on malfunction or dos. Maybe someone can elaborate further..

Prioritize write cache over read cache on Linux

My pc (with 4 GB of RAM) is running several IO bound applications, and I want to avoid as many writes as possible on my SSD.
In /etc/sysctl.conf file I have set:
vm.dirty_background_ratio = 75
vm.dirty_ratio = 90
vm.dirty_expire_centisecs = 360000
vm.swappiness = 0
And in /etc/fstab I added the commit=3600 parameter.
According to free command, my pc usually stays on with 1 GB of RAM used by applications and about 2500 of available ram. So with my settings I should be able to write at least about 1500-2000 MB of data without writing actually on the disk.
I have done some tests with moderate writes (300MB - 1000MB) and with free and cat /proc/meminfo | grep Dirty commands I noticed that often a few time later these writes (far less that dirty_expire_centisecs time), the dirty bytes go down to a value next to 0.
I suspect that the subsequent read operations fill the cache until the machine is near a OOM condition and is forced to flush dirty writes ignoring my sysctl.conf settings (correct me if my hypothesis is wrong).
So the question is: is it possible disabling only read caching (AFAIK not possible), or at least change pagecache replace policy, giving more priority to write cache, so that read cache can not force a writes flushing (maybe tweaking kernel source code...)? I know that I can solve easily this problem using tmpfs or union-fs like AUFS or OverlayFS, but for many reason I would like to avoid them.
Sorry for my bad english, I hope you understand my question. Thank you.

How can I increase OpenFabrics memory limit for Torque jobs?

When I run MPI job over InfiniBand, I get the following worning. We use Torque Manager.
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory. This can cause MPI jobs to
run with erratic performance, hang, and/or crash.
This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered. You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.
See this Open MPI FAQ item for more information on these Linux kernel module
parameters:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: host1
Registerable memory: 65536 MiB
Total memory: 196598 MiB
Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------
I've read the link on the warning message, and I've done so far is;
Append options mlx4_core log_num_mtt=20 log_mtts_per_seg=4 on /etc/modprobe.d/mlx4_en.conf.
Make sure the following lines are written on /etc/security/limits.conf
* soft memlock unlimited
* hard memlock unlimited
Append session required pam_limits.so on /etc/pam.d/sshd
Make sure ulimit -c unlimited is uncommented on /etc/init.d/pbs_mom
Can anyone help me to find out what I'm missing?
Your mlx4_core parameters allow for the registration of 2^20 * 2^4 * 4 KiB = 64 GiB only. With 192 GiB of physical memory per node and given that it is recommended to have at least twice as much registerable memory, you should set log_num_mtt to 23, which would increase the limit to 512 GiB - the closest power of two greater or equal to twice the amount of RAM. Be sure to reboot the node(s) or unload and then reload the kernel module.
You should also submit a simple Torque job script that executes ulimit -l in order to verify the limits on locked memory and make sure there is no such limit. Note that ulimit -c unlimited does not remove the limit on the amount of locked memory but rather the limit on the size of core dump files.

What is the best practice for kdump disk size

We have a redhat 6 servers and memory is around 64GB, we are planing to configure kdump and I am confused about disk size I should set. Redhat suggest it would be memory + 2% more (that means around ~66GB Disk space). I need your suggestion what would be the best size I should define for kdump.
First, don't enable kdump unless Redhat support tells you to. KDumps don't really produce anything useful for most Linux 'customers'.
Second, kdump could (potentially) dump the entire contents of RAM into the dump file. If you have 64GB of RAM ..AND.. it is full when the kdump is triggered, then yes, the space for your kdump file will need to be what RH suggested. That said, most problems can be identified with partial kdumps. RH support has even said before to perform 'head -c ' on the file before sending it in to reduce its size. Usually trimming it down to the first 64MBs.
Finally, remember to disable kdumps after you have finished trouble-shooting the issue. This isn't something you want running constantly on any system above a 'Development/Test' level. Most importantly is to remember to clean this space out after a kdump has occurred.
Unless the system has enough memory, the kdump crash recovery service will not be operational. For the information on minimum memory requirements, refer to the Required minimums section of the Red Hat Enterprise Linux Technology capabilities and limits comparison chart. When kdump is enabled, the minimum memory requirements increase by the amount of memory reserved for it. This value is determined by a user, and when the crashkernel=auto option is used, it defaults to 128 MB plus 64 MB for each TB of physical memory (that is, a total of 192 MB for a system with 1 TB of physical memory).
In Red Hat Enterprise Linux 6, the crashkernel=auto only reserves memory if the system has 4 GB of physical memory or more.
To configure the amount of memory that is reserved for the kdump kernel, as root, open the /boot/grub/grub.conf file in a text editor and add the crashkernel=M (or crashkernel=auto) parameter to the list of kernel options
I had the same question before.
The amount of memory reserved for the kdump kernel can be estimated with the following scheme:
base memory to be reserved = 128MB
an additional 64MB added for each
TB of physical RAM present in the system.
So for example if a system
has 1TB of memory 192MB (128MB + 64MB) will be reserved.
So I believe 128MB will be sufficient in your case.
You may want to have a look at this link Configuring crashkernel on RHEL6.2 (and later) kernels

Swappiness in JVM

I stumbled upon some interesting problems last week where I noticed that the one of our production servers, which has Apache HTTP Server over Tomcat running, stopped, reporting an HTTP outage.
On further investigating this issue it seemed like it was due to JVM causing memory pages to be swapped out quickly. This resulted in the swap space being completely populated, causing a memory issue the next time a page is moved to swap.
Investigating further, it seems that there is a JVM swappiness factor that's set to 60% by default with some of the our Linux distributions. Based on some research, it seems that this could be a high value for a web service that has high traffic. Our swap space was set to 2 GB.
The swap details are:
Filename Type Size Used Priority
/dev/sda3 partition 2096472 1261420 -1
From /proc/meminfo
SwapCached: 944668 kB
Our JVM properties are as follows:
-Xmx6g -Xms4g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:PermSize=512M -XX:MaxPermSize=1024M -XX:NewSize=2g -XX:MaxNewSize=2g -XX:ParallelGCThreads=8
The server runs with 12 GB RAM.
Swappiness does not go well with the JVM's GC process. Thus I tried reducing the swappiness to 0, but that did not change anything. We still see cases where the entire swap space is consumed and resulting in an OutOfMemory error.
How can I tune the JVM performance?
The swappiness factor is actually a system-wide setting in Linux, not JVM-specific.
My personal experience with some kinds of applications is that they need swap to be turned off in order to work without hiccups, period. While I can't say for sure if your application belongs to this category, I have seen apps being swapped out despite enough free RAM being available. As you pointed out, GC and swap don't mix well. Swappiness is just an indication to the OS, so its impact on how much gets swapped out may be larger or smaller depending on circumstances. My suggestion would be to try turning swap completely off.
In order to do this, you need to be root or have sudo access. Comment out the line describing swap in /etc/fstab, which will prevent swap from being turned on after a reboot. Then, in order to turn swap off for the current server run, run swapoff -a. This may take up to a few minutes if there's data in swap that needs to be pulled back into RAM. Then, check the last line of the output of free to ensure that the total size of available swap is 0. After this, observe your app in order to determine if turning swap off solved your issue.
If your physical RAM is 12 GB and your heap is set to only 6 GB (max), then you have plenty of RAM. Is this server dedicated to the Tomcat instance?
Take a look at the verbose GC log files to see what the memory usage is over time. Check your access log files to confirm whether the access requests have increased over time.

Resources