rabbitmq inet_gethost reading ~50 times per second /etc/hosts file - linux

My task is to reduce load on Linux machine running rabbitmq. First place in top is taken by inet_gethost 4 ( there are two such processes, but one is constantly sitting on the top of the top ). I started analyzing that process with strace and it's showing huge number of opening and reading from /etc/hosts. strace -fp 4571 -e open 2> count && wc -l count revealed over 50 reads of that file per second. Question is if that kind of behavior is normal or is this a result of badly configured rabbit or some networking settings.

Related

Is it possible to monitor all write access to the filesystem of all process under linux

Is it possible to monitor all write access to the filesystem of all process under linux?
I've some different mounted filesystems. A lot of them are tempfs.
I'm interested in all writes to the root filesystem except the tempfs,devtmpfs etc.
I'm looking for something that will output: <PID xy> write n Bytes to /targe/filepath.
What monitoring tool can list all this write syscalls? Can they be filtered by mount points?
iotop (kernel version 2.6.20 or higher) or dstat could help you. E.g. iotop -o -b -d 10 like discussed in this similar thread.
/proc/diskstats has data for all the block devices.
https://www.kernel.org/doc/Documentation/iostats.txt
The /proc/diskstats file displays the I/O statistics of block devices. Each line contains the following 14 fields:
1 - major number
2 - minor mumber
3 - device name
4 - reads completed successfully
5 - reads merged
6 - sectors read
7 - time spent reading (ms)
8 - writes completed
9 - writes merged
10 - sectors written
11 - time spent writing (ms)
12 - I/Os currently in progress
13 - time spent doing I/Os (ms)
14 - weighted time spent doing I/Os (ms)
For more details refer to Documentation/iostats.txt
You can write a SystemTap script to monitor filesystem operations. Maybe you can visit the Brendan D. Gregg's blog, where there are many monitor tools.
fatrace (File Activity Trace)
fatrace reports file access events (Open, Read, Write, Close) from all running processes. Its main purpose is to find processes which keep waking up the disk unnecessarily and thus prevent some power saving.
When running it outputs one line per event in this format:
<timestamp> <processName(id)>: <accessType> </path/to/file>
For example:
23:10:21.375341 Plex Media Serv(2290): W /srv/dev-disk-by-uuid-UID/Plex/Library/Application Support/Plex Media Server/Logs/Plex Media Server.log
From which you easily get the all necessary infos
Timestamp from the --timestamp option
Process name (who is accessing)
File operation (O-pen R-read W-rite C-lose)
Filepath (where is it writing to).
You can limit the search scope with --current-mount to only record events on partition/mount of current directory.
So simply cd into the volume which corresponds to your spinning HDD first, and there run ftrace with the --current-mount option.
Without this option, all (real) partitions/mount points are being watched.
Very practical
With it I found out easily that the reason why my NAS disk was spinning 24/7 also when nobody accessed the NAS and also no maintenance tasks where about to run was unnecessary logging of the Plex Media Server.

ArangoDB Too many open files

since a few days we encounter a problem with our ArangoDB installation. A few minutes/up to an hour after start up all connections to the database are refused. The arango log file says that there are "Too many open files". A "lsof | grep arango | wc -l" shows that the database has around 50,000 open file handles, which is a lot under the max. allowed by the linux system (around 3m).
Has anyone an idea where this error comes from?
We are using a Ubuntu Linux with a 3.13 kernel. 30 GB RAM and three cores. The database is still very small with around 1,5m entries and a size of 50GB.
Thx, secana
EDIT:
"netstat -anpt | fgrep 2480" shows:
root#syssec-graphdb-001-test:~# netstat -anpt | fgrep 2480
tcp 0 0 10.215.17.193:2480 0.0.0.0:* LISTEN 7741/arangod
tcp 0 0 10.215.17.193:2480 10.215.50.30:53453 ESTABLISHED 7741/arangod
tcp 0 0 10.215.17.193:2480 10.215.50.31:49299 ESTABLISHED 7741/arangod
tcp 0 0 10.215.17.193:2480 10.215.50.30:53155 ESTABLISHED 7741/arangod
"ulimit -n" has a result of 1024, so I think that the ~50,000 are all arango processes together.
Last lines in log file before the database died:
2015-05-26T12:20:43Z [9672] ERROR cannot open datafile '/data/arangodb/databases/database-235999516/collection-28464454696/datafile-18806474509149.db': 'Too many open files'
2015-05-26T12:20:43Z [9672] ERROR cannot open datafile '/data/arangodb/databases/database-235999516/collection-28464454696/datafile-18806474509149.db': Too many open files
2015-05-26T12:20:43Z [9672] DEBUG [arangod/VocBase/collection.cpp:1632] cannot open '/data/arangodb/databases/database-235999516/collection-28464454696', check failed
2015-05-26T12:20:43Z [9672] ERROR cannot open document collection from path '/data/arangodb/databases/database-235999516/collection-28464454696'
It looks like it will make sense to increase the max. number of open files a process is allowed to manage. Given the stated database size of around 50 GB, the (presumably default) value of 1024 seems to be too low.
arangod will require one file descriptor for each parallel client connection. That may not be many, but in the face of HTTP keep-alive connections this could already account for several file descriptors.
Additionally, each datafile of an active collection will need to be memory-mapped and cost one file descriptor as well. With the default datafile size of 32 MB, a database size of 50 GB (on disk) will already consume 1,600 file descriptors:
50 GB database size / (32 MB default size / 1 datafile) = 1600 datafiles
Increasing the ulimit -n value for the arangod user and environment therefore will make sense. You can confirm that arangod can actually use the configured number of file descriptors by starting it with option --server.descriptors-minimum <value>, e.g.
--server.descriptors-minimum 32768
for that many file descriptors. If arangod cannot effectively use that specified amount of file descriptors, it will fail at start with a fatal error. Of course that option can also be put into the arangod.conf file.
Additionally, the default size for (new) datafiles can be increased via the journalSize parameter for collections. That won't help right now, but will lower the number of required file descriptors for data saved in the future.
For emergencies when you can't restart the database, like in my case, you will find very useful this blog post that explains how you can change the ulimit of a running process.
If your distribution has util-linux-2.21, you can use the "prlimit" tool, or you can compile the small example C program in the blog post that worked great for me.
To check the actual limits of a process you can use:
cat /proc/<PID>/limits
Good luck!

Get the load, cpu usage and time of executing a bash script

I have a bash script that I plan to run every 5 or 15 mins using crontab based on the load it puts on server.
I can find time of running the script, but load, memory usage and CPU usage I am not sure how to find.
Can someone help me?
Also any suggestions of rough benchmark that will help me decide if the script puts too much load and should be run every 15 mins and not 5 mins.
Thanks in Advance!
You can use "top -b", top gives the CPU usage, memory usage etc,
Insert these lines in your script, this will process in background and will terminate the process as soon as your testing overs.
ssh server_name "nohup top -b -d 0.5 >> file_name &"
\top process will run in background because of &, -d 0.5 will give you the cpu status at every 0.5 secs, redirect the output in file_name for latter analysis.
for killing the process after your test, insert following in your script,
ssh server_name "kill \`ps -elf | grep 'top -b' | grep -v grep | sed 's/ */ /g' |cut -d ' ' -f4\`"
Your main testing script should be between top command and command for killing top.
I presumed you are running the script from client side, if not ignore "ssh server_name".
If you are running it from client side, because of "ssh", you will be asked for the password, for avoiding this follow these 3 simple steps
This will definitely solve the issue.
You can check following utilities
pidstat for CPU load, man page
pmap for memory load, man page
Although you might need to make measurements also for the child processes of your executable, in order to collect summarizing information
For memory, use free -m. Your actual memory available is the second number next to +/- buffers/cache (in megabytes with -m) (source).
For CPU, it's a bit more complicated. Start by looking at cat /proc/stat | grep 'cpu ' (note the space). You'll see something like this:
cpu 2255 34 2290 22625563 6290 127 456
The columns are from left to right, "user, nice, system, idle". CPU usage is usually calculated as (user+nice+system) / (user+nice+system+idle). However, these numbers show the number of "time units" that the CPU has spent doing that thing since boot, and thus are always increasing. If you were to do the aforementioned calculation, you'd get the CPU usage average since boot. To get a point-in-time usage, you have to take 2 samples, find their difference, and calculate the usage from that. To be clear, that will be the average CPU usage between your samples. (source)

Logging VMStat data to file

I am trying to create some capacity planning reports and one of the requrements is to have info on Memory usage for a few Unix Servers.
Now my knowledge of Unix is very low. I usually just log on and run a few scripts.
But for this report I need to gather VMStat data and produce reports based on previous the previous weeks data broken down by hour which is an average of Vmstat data taken every 10 seconds.
So first question: is VMStat logging on by default and if so what location on the server is the data output to?
If not how can I set this up?
Thanks
vmstat is a command that you run.
To generate one week of Virtual Memory stats spaced out at ten second intervals (less the last one) is 60,479 10 second intervals
So the command you want is:
nohup vmstat 10 604879 > myvmstatfile.dat &
This will make a very big file myvmstatfile.dat
EDIT: RobKielty (The & will put this job in the background, the nohup will prevent the task from hanging up when you logout of the command shell. If you ran this command it would be prudent to monitor the disk partition to which this file was being written to. Use df -h /path/to/directory/where/outputfile/resides to monitor the disk space usage.)
I have no idea what you need to do with the data, so I can't help you there.
Create a crontab entry (crontab -e) like this
0 0 * * 0 /path/to/my/vmstat_script.sh
The file vmstat_script.sh will contain the follow bash script commands.
#!/bin/bash
# vmstat_script.sh
vmstat 10 604879 > myvmstatfile.dat
mv myvmstatfile.dat myvmstatfile.dat.`date +%Y-%m-%d`
This will create one file per week with a name like myvmstatfile.dat.2012-07-01
The command I use for monitoring the Linux vm metrics is below:
nohup vmstat 10 720| (while read; do echo "$(date +%d-%m-%Y" "%H:%M:%S) $REPLY"; done) >> nameofLogfile.log
Here nohup is used for running the process in background.
It will run for 2 hours with interval of 10 secs.
This is the best command for generating graphs and reports as timestamp will also be included in logs along with different metrics, so that we can filter the logs accordingly.

Get CPU usage in shell script?

I'm running some JMeter tests against a Java process to determine how responsive a web application is under load (500+ users). JMeter will give the response time for each web request, and I've written a script to ping the Tomcat Manager every X seconds which will get me the current size of the JVM heap.
I'd like to collect stats on the server of the % of CPU being used by Tomcat. I tried to do it in a shell script using ps like this:
PS_RESULTS=`ps -o pcpu,pmem,nlwp -p $PID`
...running the command every X seconds and appending the results to a text file. (for anyone wondering, pmem = % mem usage and nlwp is number of threads)
However I've found that this gives a different definition of "% of CPU Utilization" than I'd like - according to the manpages for ps, pcpu is defined as:
cpu utilization of the process in "##.#" format. It is the CPU time used divided by the time the process has been running (cputime/realtime ratio), expressed as a percentage.
In other words, pcpu gives me the % CPU utilization for the process for the lifetime of the process.
Since I want to take a sample every X seconds, I'd like to be collecting the CPU utilization of the process at the current time only - similar to what top would give me
(CPU utilization of the process since the last update).
How can I collect this from within a shell script?
Use top -b (and other switches if you want different outputs). It will just dump to stdout instead of jumping into a curses window.
The most useful tool I've found for monitoring a server while performing a test such as JMeter on it is dstat. It not only gives you a range of stats from the server, it outputs to csv for easy import into a spreadsheet and lets you extend the tool with modules written in Python.
User load: top -b -n 2 |grep Cpu |tail -n 1 |awk '{print $2}' |sed 's/.[^.]*$//'
System load: top -b -n 2 |grep Cpu |tail -n 1 |awk '{print $3}' |sed 's/.[^.]*$//'
Idle load: top -b -n 1 |grep Cpu |tail -n 1 |awk '{print $5}' |sed 's/.[^.]*$//'
Every outcome is a round decimal.
Off the top of my head, I'd use the /proc filesystem view of the system state - Look at man 5 proc to see the format of the entry for /proc/PID/stat, which contains total CPU usage information, and use /proc/stat to get global system information. To obtain "current time" usage, you probably really mean "CPU used in the last N seconds"; take two samples a short distance apart to see the current rate of CPU consumption. You can then munge these values into something useful. Really though, this is probably more a Perl/Ruby/Python job than a pure shell script.
You might be able to get the rough data you're after with /proc/PID/status, which gives a Sleep average for the process. Pretty coarse data though.
also use 1 as iteration count, so you will get current snapshot without waiting to get another one in $delay time.
top -b -n 1
This will not give you a per-process metric, but the Stress Terminal UI is super useful to know how badly you're punishing your boxes. Add -c flag to make it dump the data to a CSV file.

Resources