PostgreSQL out of memory: Linux OOM killer - linux

I am having issues with a large query, that I expect to rely on wrong configs of my postgresql.config. My setup is PostgreSQL 9.6 on Ubuntu 17.10 with 32GB RAM and 3TB HDD. The query is running pgr_dijkstraCost to create an OD-Matrix of ~10.000 points in a network of 25.000 links. Resulting table is thus expected to be very big ( ~100'000'000 rows with columns from, to, costs). However, creating simple test as select x,1 as c2,2 as c3
from generate_series(1,90000000) succeeds.
The query plan:
QUERY PLAN
--------------------------------------------------------------------------------------
Function Scan on pgr_dijkstracost (cost=393.90..403.90 rows=1000 width=24)
InitPlan 1 (returns $0)
-> Aggregate (cost=196.82..196.83 rows=1 width=32)
-> Seq Scan on building_nodes b (cost=0.00..166.85 rows=11985 width=4)
InitPlan 2 (returns $1)
-> Aggregate (cost=196.82..196.83 rows=1 width=32)
-> Seq Scan on building_nodes b_1 (cost=0.00..166.85 rows=11985 width=4)
This leads to a crash of PostgreSQL:
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
normally and possibly corrupted shared memory.
Running dmesg I could trace it down to be an Out of memory issue:
Out of memory: Kill process 5630 (postgres) score 949 or sacrifice child
[ 5322.821084] Killed process 5630 (postgres) total-vm:36365660kB,anon-rss:32344260kB, file-rss:0kB, shmem-rss:0kB
[ 5323.615761] oom_reaper: reaped process 5630 (postgres), now anon-rss:0kB,file-rss:0kB, shmem-rss:0kB
[11741.155949] postgres invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
[11741.155953] postgres cpuset=/ mems_allowed=0
When running the query I also can observe with topthat my RAM is going down to 0 before the crash. The amount of committed memory just before the crash:
$grep Commit /proc/meminfo
CommitLimit: 18574304 kB
Committed_AS: 42114856 kB
I would expect the HDD is used to write/buffer temporary data, when RAM is not enough. But the available space on my hdd does not change during the processing. So I began to dig for missing configs (expecting issues due to my relocated data-directory) and following different sites:
https://www.postgresql.org/docs/current/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
https://www.credativ.com/credativ-blog/2010/03/postgresql-and-linux-memory-management
My original settings of postgresql.conf are default except for changes in the data-directory:
data_directory = '/hdd_data/postgresql/9.6/main'
shared_buffers = 128MB # min 128kB
#huge_pages = try # on, off, or try
#temp_buffers = 8MB # min 800kB
#max_prepared_transactions = 0 # zero disables the feature
#work_mem = 4MB # min 64kB
#maintenance_work_mem = 64MB # min 1MB
#replacement_sort_tuples = 150000 # limits use of replacement selection sort
#autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem
#max_stack_depth = 2MB # min 100kB
dynamic_shared_memory_type = posix # the default is the first option
I changed the config:
shared_buffers = 128MB
work_mem = 40MB # min 64kB
maintenance_work_mem = 64MB
Relaunched with sudo service postgresql reload and tested the same query, but found no change in behavior. Does this simply mean, such a large query can not be done? Any help appreciated.

I'm having similar trouble, but not with PostgreSQL (which is running happily): what is happening is simply that the kernel cannot allocate more RAM to the process, whichever process it is.
It would certainly help to add some swap to your configuration.
To check how much RAM and swap you have, run: free -h
On my machine, here is what it returns:
total used free shared buff/cache available
Mem: 7.7Gi 5.3Gi 928Mi 865Mi 1.5Gi 1.3Gi
Swap: 9.4Gi 7.1Gi 2.2Gi
You can clearly see that my machine is quite overloaded: about 8Gb of RAM, and 9Gb of swap, from which 7 are used.
When the RAM-hungry process got killed after Out of memory, I saw both RAM and swap being used at 100%.
So, allocating more swap may improve our problems.

Related

What is the best way to calculate the real and combined memory usage of forked processes?

I am working on a per process memory monitoring (Bash) script but it turns out to be more of a headache than I thought. Especially on forked processes such as PostgreSQL. There are a couple of reasons:
RSS is a potential value to be used as memory usage, however this also contains shared libraries etc which are used in other processes
PSS is another potential value which (should) show only the private memory of a process. Problem here is that PSS can only be retrieved from /proc//smaps which requires elevated capability privileges (or root)
USS (calculated as Private_Dirty + Private_Clean, source How does smem calculate RSS, USS and PSS?) could also be a potential candidate but here again we need access to /proc//smaps
For now I am trying to solve the forked process problem by looping through each PID's smaps (as suggested in https://www.depesz.com/2012/06/09/how-much-ram-is-postgresql-using/), for example:
for pid in $(pgrep -a -f "postgres" | awk '{print $1}' | tr "\n" " " ); do grep "^Pss:" /proc/$pid/smaps; done
Maybe some of the postgres processes should be excluded, I am not sure.
Using this method to calculate and sum the PSS and USS values, resulting in:
PSS: 4817 MB - USS: 4547 MB - RES: 6176 MB - VIRT: 26851 MB used
Obviously this only works with elevated privileges, which I would prefer to avoid. If these values actually represent the truth is not known because other tools/commands show yet again different values.
Unfortunately top and htop are unable to combine the postgres processes. atop is able to do this and seems to be (from a feeling) the most accurate with the following values:
NPROCS SYSCPU USRCPU VSIZE RSIZE PSIZE SWAPSZ RDDSK WRDSK RNET SNET MEM CMD 1/1
27 56m50s 16m40s 5.4G 1.1G 0K 2308K 0K 0K 0 0 11% postgres
Now to the question: What is the suggested and best way to retrieve the most accurate memory usage of an application with forked processes, such as PostgreSQL?
And in case atop already does an accurate calculation, how does atop get the to RSIZE value? Note that this value shown as root and non-root user, which would probably mean that /proc/<pid>/smaps is not used for the calculation.
Please comment if more information is needed.
EDIT: I actually found a bug in my pgrep pattern in my final script and it falsely parsed a lot more than just the postgres processes.
The new output now shows the same RES value as seen in atop RSIZE:
Script output:
PSS: 205 MB - USS: 60 MB - RES: 1162 MB - VIRT: 5506 MB
atop summarized postgres output:
NPROCS SYSCPU USRCPU VSIZE RSIZE PSIZE SWAPSZ RDDSK WRDSK RNET SNET MEM CMD
27 0.04s 0.10s 5.4G 1.1G 0K 2308K 0K 32K 0 0 11% postgres
But the question remains of course. Unless I am now using the most accurate way with the summarized RSS (RES) memory value. Let me know your thoughts, thanks :)

Write high bandwidth real-time data to SSD in Linux

I have a real-time process that receives 16 kB of data every 200 us for about 1 hr. I need to store this data.
I have a 240 GB SSD on a SATA III channel and I thought I could use it as a plain storage device without any filesystem on it. I am running 5.4.0-109-generic kernel with 8 GB or ram.
Here is what I have done so far:
I set up a shared memory shm, dimension of 1 GB, where I write the data and I use a semaphore to tell a logger process when data is available.
In the logger process:
I open the SSD:
fd = open("/dev/sdb",O_WRONLY|O_LARGEFILE);
I wait for the data to be available in the shm and then I write a chunk, writing_size, of it to the SSD:
written_size = write(fd,local_buffer,writing_size);
I checked and written_size is always equal to writing_size;
performs an fsync() after N cycles:
if(written_cycles > N)
ret = fsync(fd);
I checked and fsync never returns -1.
I did set the I/O scheduler of /dev/sdb as noop and I did experiment with different writing_size and N. The final values I came up with are writing_size = 64 kB and N = 16.
The behavior that I am seeing is this:
the whole process works very well up until 17 GB have been written. At that point the logger process is being put to "uninterruptible sleep (D)" quite often and for quite some time, 1 or 2 seconds. This is still fine as the shared memory buffer will fill up in ~13 sec. When the data written reaches 20 GB, the logger process is being put to sleep for way longer, until it reaches 13 sec and I start to lose data. The threshold when the process is starting to being put to sleep is quite repeatable, 16 - 17 GB, but the maximum amount of data I can save before I lose it is random.
This is the best I can achieve so far with my method and the writing_size and N tuning mentioned previously.
I tried to set the logger process nice to -20 with no improvements.
It also looks like the noop I/O scheduler does not support the ionice so I tried the CFQ scheduler with maximum ionice but still got worse performances.
I bet the logger process is being put to sleep for I/O access but I do not understand why it happens after a certain number of bytes have been written. iotop shows that the I/O bandwidth of the logger process is stable around 85 MB/s.
I welcome any suggestions.
PS: I did try to mmap the the SSD and do memcpy instead of write()+fsync() but mmap is slower and results are worse.

Docker container memory usage seems incorrect

I have a container mounted using docker-compose version 2 which has a memory limit on it of 32mb.
Whenever I run the container I can monitor the used resources like so:
docker stats 02bbab9ae853
It shows the following:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
02bbab9ae853 client-web_postgres-client-web_1_e4513764c3e7 0.07% 8.078MiB / 32MiB 25.24% 5.59MB / 4.4MB 135GB / 23.7MB 0
What looks really weird to me is the memory part:
8.078MiB / 32MiB 25.24%
If outside the container I list of Postgres PIDs I get:
$ pgrep postgres
23051, 24744, 24745, 24746, 24747, 24748, 24749, 24753, 24761
If I stop the container and re-run the above command I get no PID.
That is a clear proof that all PID where created by the stopped container.
Now, if I re-run the container and get every PID and I calculate its RSS memory usage and I sum it together with a python method, I don't get ~8Mb docker is telling me but a much higher value not even close to it (like ~100Mb or so).
This is the python method I'm using to calculate the RSS memory:
def get_process_memory(name):
total = 0.0
try:
for pid in map(int, check_output(["pgrep",name]).split()):
total += psutil.Process(pid).memory_info().rss
except Exception as e:
pass
return total
Does anybody know why the memory declared by docker is so different?
This is of course a problem for me because the memory limit applied doesn't look respected.
I'm using a Raspberry PI.
That's because Docker is reporting only RSS from cgroups memory.stats, but you actually need to sum up cache, rss and swap (https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt). More info about that in https://sysrq.tech/posts/docker-misleading-containers-memory-usage/

Linux `top` command: how much process memory is physically stored in swap space?

Let's say I run my program on a 64-bit Linux machine with 64 Gb of RAM. In my very small C program immediately after the start I do
void *p = sbrk(1024ull * 1024 * 1024 * 120);
this moving my data segment break forward by 120 Gb.
After the above sbrk call top entry for my process shows RES at some low value, VIRT at 120g, and SWAP at 120g.
After this operation I write something into the first 90 Gb of the above region
memset(p, 0xAB, 1024ull * 1024 * 1024 * 90);
This causes some changes in the top entry for my process: VIRT expectedly remains at 120g, RES becomes almost 64g, SWAP drops to around 56g.
The common Swap stats in the header of top output show that swap file usage increases, which is expected since my program will have to push about 26 Gb of memory pages into the swap file.
So, according to the above observations, SWAP column simply reports my process's non-RES address space regardless of whether this address space has been "materialized", i.e. regardless of whether I already wrote something into that region of virtual memory.
But is there any way to figure out how much of that SWAP size has actually been "materialized" and backed up by something stored in the swap file? I.e. is there any way to make top to display that 26 Gb value for my process?
The behavior depends on a version of procps you are using. For instance, in version 3.0.5 SWAP value equals:
task->size - task->resident
and it is exactly what you are encountering. Man top.1 says:
VIRT = SWAP + RES
Procps-ng, however, reads /proc/pid/status and sets SWAP correctly
https://gitlab.com/procps-ng/procps/blob/master/proc/readproc.c#L383
So, you can update procps or look at /proc/pid/status directly

How to scale ejabberd Server machine on CentOS to handle 200 K connections?

I am working on a considerably good ejabberd instance with 40 core CPU machine and 160 GB RAM.
The issue is I am unable to scale up to 200 K parallel connections.
The sysctl config is as follows:
net.ipv4.tcp_window_scaling = 1
net.core.rmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
#http://linux-ip.net/html/ether-arp.html#ether-arp-flux
net.ipv4.conf.all.arp_filter = 1
kernel.exec-shield=1
kernel.randomize_va_space=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.all.accept_source_route=0
net.ipv4.icmp_echo_ignore_broadcasts=1
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
net.ipv4.ip_local_port_range = 12000 65535
fs.nr_open = 20000500
fs.file-max = 1000000
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_max_orphans = 60000
net.ipv4.tcp_synack_retries = 3
net.core.somaxconn = 10000
The /etc/security/limits.conf file entries is as follows:
* soft core 900000
* hard rss 900000
* soft nofile 900000
* hard nofile 900000
* soft nproc 900000
* hard nproc 900000
The machine starts to lose connections when the server reaches around 112 K.
Things that happen around 112 K
The CPU usage goes up to 200 ~ 300 % (but it is the usual spike)
Background - When all things are normal the CPU usage shoots up to 80 % as seen below (only two CPUs are doing actual work)
I am unable to work on the machine. I am using top and ss command to see what is going on the server. The machine just stops responding at this point and the connections begin to drop.
What is a saving grace is that the connections don't drop abruptly, but drop at the rate they are connected.
I am using TSUNG to generate the load. There are 4 load generator boxes hitting 4 different ips mapped to only one machine internally.
Any suggestions, opinions are very welcome.
As the first call you would need to establish what's the bottleneck in your case:
CPU
Memory
System limits (open sockets, open files)
Application architecture
If possible add a resource-tracking application to your node, e.g. recon. It will allow you to check the length of process queues, memory fragmentation, etc. In our production system the amount of memory consumed by Erlang VM was different when reported by the system than when reported by the Erlang VM itself due to Transparent Huge Pages (the system was virtualized). There may be other issues that may not be obvious when inspecting the node using system tools.
So I would propose:
Determine processes with the longest queue sizes - they will be responsible for slowing down the system because Erlang VM needs to scan the whole inbox of a process when it receives a message
Determine processes with the biggest amount of allocated memory
Determine how much memory Erlang itself thinks is allocated
Also, it would be good if you added parameters used to start the Erlang VM.
Addition
Forgot to mention that it may be worth looking at the tuning WhatsApp did to their Erlang nodes to handle hundreds of thousands of simultaneous connections:
The WhatsApp Architecture Facebook Bought For $19 Billion

Resources