ArangoDB Too many open files - arangodb

since a few days we encounter a problem with our ArangoDB installation. A few minutes/up to an hour after start up all connections to the database are refused. The arango log file says that there are "Too many open files". A "lsof | grep arango | wc -l" shows that the database has around 50,000 open file handles, which is a lot under the max. allowed by the linux system (around 3m).
Has anyone an idea where this error comes from?
We are using a Ubuntu Linux with a 3.13 kernel. 30 GB RAM and three cores. The database is still very small with around 1,5m entries and a size of 50GB.
Thx, secana
EDIT:
"netstat -anpt | fgrep 2480" shows:
root#syssec-graphdb-001-test:~# netstat -anpt | fgrep 2480
tcp 0 0 10.215.17.193:2480 0.0.0.0:* LISTEN 7741/arangod
tcp 0 0 10.215.17.193:2480 10.215.50.30:53453 ESTABLISHED 7741/arangod
tcp 0 0 10.215.17.193:2480 10.215.50.31:49299 ESTABLISHED 7741/arangod
tcp 0 0 10.215.17.193:2480 10.215.50.30:53155 ESTABLISHED 7741/arangod
"ulimit -n" has a result of 1024, so I think that the ~50,000 are all arango processes together.
Last lines in log file before the database died:
2015-05-26T12:20:43Z [9672] ERROR cannot open datafile '/data/arangodb/databases/database-235999516/collection-28464454696/datafile-18806474509149.db': 'Too many open files'
2015-05-26T12:20:43Z [9672] ERROR cannot open datafile '/data/arangodb/databases/database-235999516/collection-28464454696/datafile-18806474509149.db': Too many open files
2015-05-26T12:20:43Z [9672] DEBUG [arangod/VocBase/collection.cpp:1632] cannot open '/data/arangodb/databases/database-235999516/collection-28464454696', check failed
2015-05-26T12:20:43Z [9672] ERROR cannot open document collection from path '/data/arangodb/databases/database-235999516/collection-28464454696'

It looks like it will make sense to increase the max. number of open files a process is allowed to manage. Given the stated database size of around 50 GB, the (presumably default) value of 1024 seems to be too low.
arangod will require one file descriptor for each parallel client connection. That may not be many, but in the face of HTTP keep-alive connections this could already account for several file descriptors.
Additionally, each datafile of an active collection will need to be memory-mapped and cost one file descriptor as well. With the default datafile size of 32 MB, a database size of 50 GB (on disk) will already consume 1,600 file descriptors:
50 GB database size / (32 MB default size / 1 datafile) = 1600 datafiles
Increasing the ulimit -n value for the arangod user and environment therefore will make sense. You can confirm that arangod can actually use the configured number of file descriptors by starting it with option --server.descriptors-minimum <value>, e.g.
--server.descriptors-minimum 32768
for that many file descriptors. If arangod cannot effectively use that specified amount of file descriptors, it will fail at start with a fatal error. Of course that option can also be put into the arangod.conf file.
Additionally, the default size for (new) datafiles can be increased via the journalSize parameter for collections. That won't help right now, but will lower the number of required file descriptors for data saved in the future.

For emergencies when you can't restart the database, like in my case, you will find very useful this blog post that explains how you can change the ulimit of a running process.
If your distribution has util-linux-2.21, you can use the "prlimit" tool, or you can compile the small example C program in the blog post that worked great for me.
To check the actual limits of a process you can use:
cat /proc/<PID>/limits
Good luck!

Related

unreasonable netperf benchmark results

I used netperf benchmark with the next commands:
server side:
netserver -4 -v -d -N -p
client side:
netperf -H -p -l 60 -T 1,1 -t TCP_RR
And I received the results:
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.28 () port 0 AF_INET : demo : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 131072 1 1 60.00 9147.83
16384 131072
But when I changed the client to single CPU (same machine) by adding "maxcpus=1 nr_cpus=1" to kernel command line.
And I ran the next command:
netperf -H -p -l 60 -t TCP_RR
I received the next results:
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.28 () port 0 AF_INET : demo : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 131072 1 1 60.00 10183.33
16384 131072
Q: I don't understand how the performance has been improved when I decreased the CPUs number from 64 to 1 CPU?
Some technique information: I used Standard_L64s_v3 instance type of Azure; OS: sles:15:sp2
• The ‘netperf’ utility command executed by you on the client side is as follows and is the same after changing the number of CPUs on the client side but you can see an improvement in performance after decreasing the number of vCPUs on the client VM: -
netperf -H -p -l 60 -I 1,1 -t TCP_RR
The above command implies that you want to test the network connectivity performance between the host ‘Server’ and ‘Client’ for TCP Request/Response and get the results in a default directory path where pipes will be created for a period of 60 seconds.
• The CPU utilization measurement mechanism uses ‘proc/stat’ on Linux OS to record the time spent for such command executions. The code for this mechanism can be found in ‘src/netcpu_procstat.c’. Thus, you can check the configuration file accordingly.
Also, the CPU utilization mechanism in a virtual guest environment, i.e., a virtual machine may not reflect the actual utilization as in a bare metal environment because much of the networking processing happens outside the context of the virtual machine. Thus, as per the below documentation link by Hewlett-Packard: -
https://hewlettpackard.github.io/netperf/doc/netperf.html
If one is looking to measure the added overhead of a virtualization mechanism, rather than rely on CPU utilization, one can rely instead on netperf _RR tests - path-lengths and overheads can be a significant fraction of the latency, so increases in overhead should appear as decreases in transaction rate. Whatever you do, DO NOT rely on the throughput of a _STREAM test. Achieving link-rate can be done via a multitude of options that mask overhead rather than eliminate it.
As a result, I would suggest you rely on other monitoring tools available in Azure, i.e., Azure Monitor, Application insights, etc.
Looking more closely at your netperf command line:
netperf -H -p -l 60 -T 1,1 -t TCP_RR
The -H option expects to take a hostname as an argument. And the -p option expects to take a port number as an argument. As written the "-p" will be interpreted as a hostname. And when I tried it at least will fail. I assume you've omitted some of the command line?
The -T option will bind where netperf and netserver will run (in this case on vCPU 1 on the netperf side and vCPU 1 on the netserver side) but it will not necessarily control where at least some of the network stack processing will take place. So, in your 64-vCPU setup, the interrupts for the networking traffic and perhaps the stack will run on a different vCPU. In your 1-vCPU setup, everything will be on the one vCPU. It is quite conceivable you are seeing the effects of cache-to-cache transfers in the 64-vCPU case leading to lower transaction/s rates.
Going to multi-processor will increase aggregate performance, but it will not necessarily increase single thread/stream performance. And single thread/stream performance can indeed degrade.

Is it possible to monitor all write access to the filesystem of all process under linux

Is it possible to monitor all write access to the filesystem of all process under linux?
I've some different mounted filesystems. A lot of them are tempfs.
I'm interested in all writes to the root filesystem except the tempfs,devtmpfs etc.
I'm looking for something that will output: <PID xy> write n Bytes to /targe/filepath.
What monitoring tool can list all this write syscalls? Can they be filtered by mount points?
iotop (kernel version 2.6.20 or higher) or dstat could help you. E.g. iotop -o -b -d 10 like discussed in this similar thread.
/proc/diskstats has data for all the block devices.
https://www.kernel.org/doc/Documentation/iostats.txt
The /proc/diskstats file displays the I/O statistics of block devices. Each line contains the following 14 fields:
1 - major number
2 - minor mumber
3 - device name
4 - reads completed successfully
5 - reads merged
6 - sectors read
7 - time spent reading (ms)
8 - writes completed
9 - writes merged
10 - sectors written
11 - time spent writing (ms)
12 - I/Os currently in progress
13 - time spent doing I/Os (ms)
14 - weighted time spent doing I/Os (ms)
For more details refer to Documentation/iostats.txt
You can write a SystemTap script to monitor filesystem operations. Maybe you can visit the Brendan D. Gregg's blog, where there are many monitor tools.
fatrace (File Activity Trace)
fatrace reports file access events (Open, Read, Write, Close) from all running processes. Its main purpose is to find processes which keep waking up the disk unnecessarily and thus prevent some power saving.
When running it outputs one line per event in this format:
<timestamp> <processName(id)>: <accessType> </path/to/file>
For example:
23:10:21.375341 Plex Media Serv(2290): W /srv/dev-disk-by-uuid-UID/Plex/Library/Application Support/Plex Media Server/Logs/Plex Media Server.log
From which you easily get the all necessary infos
Timestamp from the --timestamp option
Process name (who is accessing)
File operation (O-pen R-read W-rite C-lose)
Filepath (where is it writing to).
You can limit the search scope with --current-mount to only record events on partition/mount of current directory.
So simply cd into the volume which corresponds to your spinning HDD first, and there run ftrace with the --current-mount option.
Without this option, all (real) partitions/mount points are being watched.
Very practical
With it I found out easily that the reason why my NAS disk was spinning 24/7 also when nobody accessed the NAS and also no maintenance tasks where about to run was unnecessary logging of the Plex Media Server.

rabbitmq inet_gethost reading ~50 times per second /etc/hosts file

My task is to reduce load on Linux machine running rabbitmq. First place in top is taken by inet_gethost 4 ( there are two such processes, but one is constantly sitting on the top of the top ). I started analyzing that process with strace and it's showing huge number of opening and reading from /etc/hosts. strace -fp 4571 -e open 2> count && wc -l count revealed over 50 reads of that file per second. Question is if that kind of behavior is normal or is this a result of badly configured rabbit or some networking settings.

SIPP: open file limit > FD_SETSIZE

actually I try to start SIPP 3.3 on opensuse 11 with a bash console with java.
When I start SIPP with
proc = Runtime.getRuntime().exec("/bin/bash", null, wd);
...
printWriter.println("./sipp -i "+Config.IP+" -sf uac.xml "+Config.IP+":5060");
the error stream gives the following output
Warning: open file limit > FD_SETSIZE; limiting max. # of open files to FD_SETSIZE = 1024
Resolving remote host '137.58.120.17'... Done.
What does the warning means? And is it possible that the bash terminal freezes because of this warning?
How can i remove this warning?
I'm the maintainer of SIPp and I've been looking into the FD_SETSIZE issues recently.
As is mentioned at Increasing limit of FD_SETSIZE and select, FD_SETSIZE is the maximum file descriptor that can be passed to the select() call, as it uses a bit-field internally to keep track of file descriptors. SIPp had code in it to check its own maximum open file limit (i.e. the one shown by ulimit -n), and if it was larger than FD_SETSIZE, to reduce it to FD_SETSIZE in order to avoid issues with select().
However, this has actually been unnecessary for a while - SIPp has used poll() rather than select() (which doesn't have the FD_SETSIZE limit, and has been POSIX-standardised and portable since 2001) since before I became maintainer in 2012. SIPp now also uses epoll where available for even better performance, as of the v3.4 release.
I've now removed this FD_SETSIZE check in the development code at https://github.com/SIPp/sipp, and replaced it with a more sensible check - making sure that the maximum number of open sockets (plus the maximum number of open calls, each of which may open its own media socket) is below the maximum number of file descriptors.
This warning is supposedly related to multi-socket transport options in SIPp eg. -t un or -t tn, (though I have observed it generate these warnings even when not specifying one of these).
SIPp includes an option that controls this warning message:
-skip_rlimit : Do not perform rlimit tuning of file descriptor limits. Default: false.
Though it has the desired effect for me of suppressing the warning output, on its own, it seems like a slightly dangerous option. Although I'm not certain of what will happen if you include this option and SIPp attempts to open more sockets than are available according to FD_SETSIZE, you may avoid possible problems on that front by also including the max_socket argument:
-max_socket : Set the max number of sockets to open simultaneously. This option is
significant if you use one socket per call. Once this limit is reached,
traffic is distributed over the sockets already opened. Default value is
50000
It means pretty much what it says... Your per-process open file limit (ulimit -n) is greater than the pre-defined constant FD_SETSIZE, which is 1024. So the program is adjusting your open file limit down to match FD_SETSIZE.

Node.js struggling with lots of concurrent connections

I'm working on a somewhat unusual application where 10k clients are precisely timed to all try to submit data at once, every 3 mins or so. This 'ab' command fairly accurately simulates one barrage in the real world:
ab -c 10000 -n 10000 -r "http://example.com/submit?data=foo"
I'm using Node.js on Ubuntu 12.4 on a rackspacecloud VPS instance to collect these submissions, however, I'm seeing some very odd behavior from Node, even when I remove all my business logic and turn the http request into a no-op.
When the test gets about 90% done, it hangs for a long period of time. Strangely, this happens consistently at 90% - for c=n=10k, at 9000; for c=n=5k, at 4500; for c=n=2k, at 1800. The test actually completes eventually, often with no errors. But both ab and node logs show continuous processing up till around 80-90% of the test run, then a long pause before completing.
When node is processing requests normally, CPU usage is typically around 50-70%. During the hang period, CPU goes up to 100%. Sometimes it stays near 0. Between the erratic CPU response and the fact that it seems unrelated to the actual number of connections (only the % complete), I do not suspect the garbage collector.
I've tried this running 'ab' on localhost and on a remote server - same effect.
I suspect something related to the TCP stack, possibly involving closing connections, but none of my configuration changes have helped. My changes:
ulimit -n 999999
When I listen(), I set the backlog to 10000
Sysctl changes are:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_max_orphans = 20000
net.ipv4.tcp_max_syn_backlog = 10000
net.core.somaxconn = 10000
net.core.netdev_max_backlog = 10000
I have also noticed that I tend to get this msg in the kernel logs:
TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
I'm puzzled by this msg since the TCP backlog queue should be deep enough to never overflow. If I disable syn cookies the "Sending cookies" goes to "Dropping connections".
I speculate that this is some sort of linux TCP stack tuning problem and I've read just about everything I could find on the net. Nothing I have tried seems to matter. Any advice?
Update: Tried with tcp_max_syn_backlog, somaxconn, netdev_max_backlog, and the listen() backlog param set to 50k with no change in behavior. Still produces the SYN flood warning, too.
Are you running ab on the same machine running node? If not do you have a 1G or 10G NIC? If you are, then aren't you really trying to process 20,000 open connections?
Also if you are changing net.core.somaxconn to 10,000 you have absolutely no other sockets open on that machine? If you do then 10,000 is not high enough.
Have you tried to use nodejs cluster to spread the number of open connections per process out?
I think you might find this blog post and also the previous ones useful
http://blog.caustik.com/2012/08/19/node-js-w1m-concurrent-connections/

Resources