How can I record what process or kernel activity is using the disk in GNU/Linux? - linux

On a particular Debian server, iostat (and similar) report an unexpectedly high volume (in bytes) of disk writes going on. I am having trouble working out which process is doing these writes.
Two interesting points:
Tried turning off system services one at a time to no avail. Disk activity remains fairly constant and unexpectedly high.
Despite the writing, do not seem to be consuming more overall space on the disk.
Both of those make me think that the writing may be something that the kernel is doing, but I'm not swapping, so it's not clear to me what Linux might try to write.
Could try out atop:
http://www.atcomputing.nl/Tools/atop/
but would like to avoid patching my kernel.
Any ideas on how to track this down?

iotop is good (great, actually).
If you have a kernel from before 2.6.20, you can't use most of these tools.
Instead, you can try the following (which should work for almost any 2.6 kernel IIRC):
sudo -s
dmesg -c
/etc/init.d/klogd stop
echo 1 > /proc/sys/vm/block_dump
rm /tmp/disklog
watch "dmesg -c >> /tmp/disklog"
CTRL-C when you're done collecting data
echo 0 > /proc/sys/vm/block_dump
/etc/init.d/klogd start
exit (quit root shell)
cat /tmp/disklog | awk -F"[() \t]" '/(READ|WRITE|dirtied)/ {activity[$1]++} END {for (x in activity) print x, activity[x]}'| sort -nr -k2
The dmesg -c lines clear your kernel log . The logger is then shut off, manually (using watch) dumped to a disk (the memory buffer is small, which is why we need to do this). Let it run for about five minutes or so, and then CTRL-c the watch process. After shutting off the logging and restarting klogd, analyze the results using the little bit of awk at the end.

If you are using a kernel newer than 2.6.20 that is very easy, as that is the first version of Linux kernel that includes I/O accounting. If you are compiling your own kernel, be sure to include:
CONFIG_TASKSTATS=y
CONFIG_TASK_IO_ACCOUNTING=y
Kernels from Debian packages already include these flags, so there is no need for recompiling your kernel. Standard utility for accessing I/O accounting data in real time is iotop(1). It gives you a complete list of processes managed by I/O scheduler, and displays per process statistics for read, write and total I/O bandwidth used.

You may want to investigate iotop for Linux. There are some Solaris versions floating around, but there is a Debian package for example.

You can use the UNIX-command lsof (list open files). That prints out the process, process-id, user for any open file.

You could also use htop, enabling IO_RATR column. Htop is an exelent top replacement.

Brendan Gregg's iosnoop script can (heuristically) tell you about currently using the disk on recent kernels (example iosnoop output).

You could try to use SystemTap , it has a lot of examples , and if I'm not mistaken , it shows how to do this sort of thing .

I've recently heard about Mortadelo, a Filemon clone, but have not checked it out myself yet:
http://gitorious.org/mortadelo

Related

ionice 'idle' not having the expected effects

We're working with a reasonably busy web server. We wanted to use rsync to do some data-moving which was clearly going to hammer the magnetic disk, so we used ionice to put the rsync process in the idle class. The queues for both disks on the system (SSD+HDD) are set to use the CFQ scheduler.
The result... was that the disk was absolutely hammered and the website performance was appalling.
I've done some digging to see if any tuning might help with this.
The man page for ionice says:
Idle: A program running with idle I/O priority will only get disk time
when no other program has asked for disk I/O for a defined grace period.
The impact of an idle I/O process on normal system activity should be zero.
This "defined grace period" is not clearly explained anywhere I can find with the help of Google. One posting suggest that it's the value of fifo_expire_async but I can't find any real support for this.
However, on our system, both fifo_expire_async and fifo_expire_sync are set sufficiently long (250ms, 125ms, which are the defaults) that the idle class should actually get NO disk bandwidth at all. Even if the person who believes that the grace period is set by fifo_expire_async is plain wrong, there's not a lot of wiggle-room in the statement "The impact of an idle I/O process on normal system activity should be zero".
Clearly this is not what's happening on our machine so I am wondering if CFQ+idle is simply broken.
Has anyone managed to get it to work? Tips greatly appreciated!
Update:
I've done some more testing today. I wrote a small Python app to read random sectors from all over the disk with short sleeps in between. I ran a copy of this without ionice and set it up to perform around 30 reads per second. I then ran a second copy of the app with various ionice classes to see if the idle class did what it said on the box. I saw no difference at all between the results when I used classes 1, 2, 3 (real-time, best-effort, idle). This, despite the fact that I'm now absolutely certain that the disk was busy.
Thus, I'm now certain that - at least for our setup - CFQ+idle does not work. [see Update 2 below - it's not so much "does not work" as "does not work as expected"...]
Comments still very welcome!
Update 2:
More poking about today. Discovered that when I push the I/O rate up dramatically, the idle-class processes DO in fact start to become starved. In my testing, this happened at I/O rates hugely higher than I had expected - basically hundreds of I/Os per second. I'm still trying to work out what the tuning parameters do...
I also discovered the rather important fact that async disk writes aren't included at all in the I/O prioritisation system! The ionice manpage I quoted above makes no reference to that fact, but the manpage for the syscall ioprio_set() helpfully states:
I/O priorities are supported for reads and for synchronous (O_DIRECT,
O_SYNC) writes. I/O priorities are not supported for asynchronous
writes because they are issued outside the context of the program
dirtying the memory, and thus program-specific priorities do not
apply.
This pretty significantly changes the way I was approaching the performance issues and I will be proposing an update for the ionice manpage.
Some more info on kernel and iosched settings (sdb is the HDD):
Linux 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux
/etc/debian_version = 9.3
(cd /sys/block/sdb/queue/iosched; grep . *)
back_seek_max:16384
back_seek_penalty:2
fifo_expire_async:250
fifo_expire_sync:125
group_idle:8
group_idle_us:8000
low_latency:1
quantum:8
slice_async:40
slice_async_rq:2
slice_async_us:40000
slice_idle:8
slice_idle_us:8000
slice_sync:100
slice_sync_us:100000
target_latency:300
target_latency_us:300000
AFAIK, the only opportunity to solve your problem is using CGroup v2 (kernel v. 4.5 or newer). Please see the following article:
https://andrestc.com/post/cgroups-io/
Also please note, that you may use the systemd's wrappers to configure CGroup limits on per-service basis:
http://0pointer.de/blog/projects/resources.html
Add nocache to that and you're set (you can join it with ionice and nice):
https://github.com/Feh/nocache
On Ubuntu install with:
apt install nocache
It simply omits cache on IO and thanks to that other processes won't starve when the cache is flushed.
It's like calling the commands with O_DIRECT, so now you can limit the IO for example with:
systemd-run --scope -q --nice=19 -p BlockIOAccounting=true -p BlockIOWeight=10 -p "BlockIOWriteBandwidth=/dev/sda 10M" nocache youroperation_here
I usually use it with:
nice -n 19 ionice -c 3 nocache youroperation_here

Core dump is created, but not written to a file?

I'm trying to get a core dump of a proprietary application running on an embedded linux system, for which I wrote some plugins.
What I did was:
ulimit -c unlimited
echo "/tmp/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern
kill -3 <PID>`
However, no core dump is created. '/tmp/cores' exists and is writable for everyone, and the disk has enough space available. When I try the same thing with sleep 100 & as an example process and then kill it, the core dump is created.
I tried the example for the pipe syntax from the core manpage, which writes some parameters and the size of the core dump into a file called core.info. This file IS created, and the size is greater than 0. So if the core dump is created, why isn't it written to /tmp/cores? To be sure, I also searched for core* on the file system - it's not there. dmesg doesn't show any errors (but it does if I pipe the core dump to an invalid program).
Some more info: The system is probably based on Debian, but I'm not quite sure. GDB is not available, as well as many other tools - there is only busybox for basic stuff.
The process I'm trying to debug is automatically restarted soon after being killed.
So, I guess one solution would be to modify the example program in order to write the dump to a file instead of just counting bytes. But why doesn't it work just normally if there obviously is some data?
If your proprietary application calls setrlimit(2) with RLIMIT_CORE set to 0, or if it is setuid, no core dump happens. See core(5). Perhaps use strace(1) to find out. And you could install  gdb (perhaps by [cross-] compiling it). See also gcore(1).
Also, check (and maybe set) the limit in the invoking shell. With bash(1) use ulimit builtin. Otherwise, cat /proc/self/limits should display the limits. If you don't have bash you could code a small wrapper in C calling setrlimit then execve ...

LINUX: How to lock the pages of a process in memory

I have a LINUX server running a process with a large memory footprint (some sort of a database engine). The memory allocated by this process is so large that part of it needs to be swapped (paged) out.
What I would like to do is to lock the memory pages of all the other processes (or a subset of the running processes) in memory, so that only the pages of the database process get swapped out. For example I would like to make sure that i can continue to connect remotely and monitor the machine without having the processes impacted by swapping. I.e. I want sshd, X, top, vmstat, etc to have all pages memory resident.
On linux there are the mlock(), mlockall() system calls that seem to offer the right knob to do the pinning. Unfortunately, it seems to me that I need to make an explicit call inside every process and cannot invoke mlock() from a different process or from the parent (mlock() is not inherited after fork() or evecve()).
Any help is greatly appreciated. Virtual pizza & beer offered :-).
It has been a while since I've done this so I may have missed a few steps.
Make a GDB command file that contains something like this:
call mlockall(3)
detach
Then on the command line, find the PID of the process you want to mlock. Type:
gdb --pid [PID] --batch -x [command file]
If you get fancy with pgrep that could be:
gdb --pid $(pgrep sshd) --batch -x [command file]
Actually locking the pages of most of the stuff on your system seems a bit crude/drastic, not to mention being such an abuse of the mechanism it seems bound to cause some other unanticipated problems.
Ideally, what you probably actually want is to control the "swappiness" of groups of processes so the database is first in line to be swapped while essential system admin tools are the last, and there is a way of doing this.
While searching for mlockall information I ran across this tool. You may be able to find it for your distribution. I only found the man page.
http://linux.die.net/man/8/memlockd
Nowadays, the easy and right way to tackle the problem is cgroup.
Just restrict memory usage of database process:
1. create a memory cgroup
sudo cgcreate -g memory:$test_db -t $User:$User -a $User:$User
2. limit the group's RAM usage to 1G.
echo 1000M > /sys/fs/cgroup/memory/$test_db/memory.limit_in_bytes
or
echo 1000M > /sys/fs/cgroup/memory/$test_db/memory.soft_limit_in_bytes
3. run the database program in the $test_db cgroup
cgexec -g memory:$test_db $db_program_name

Run an untrusted C program in a sandbox in Linux that prevents it from opening files, forking, etc.?

I was wondering if there exists a way to run an untrusted C program under a sandbox in Linux. Something that would prevent the program from opening files, or network connections, or forking, exec, etc?
It would be a small program, a homework assignment, that gets uploaded to a server and has unit tests executed on it. So the program would be short lived.
I have used Systrace to sandbox untrusted programs both interactively and in automatic mode. It has a ptrace()-based backend which allows its use on a Linux system without special privileges, as well as a far faster and more poweful backend which requires patching the kernel.
It is also possible to create a sandbox on Unix-like systems using chroot(1), although that is not quite as easy or secure. Linux Containers and FreeBSD jails are a better alternative to chroot. Another alternative on Linux is to use a security framework like SELinux or AppArmor, which is what I would propose for production systems.
We would be able to help you more if you told as what exactly it is that you want to do.
EDIT:
Systrace would work for your case, but I think that something based on the Linux Security Model like AppArmor or SELinux is a more standard, and thus preferred, alternative, depending on your distribution.
EDIT 2:
While chroot(1) is available on most (all?) Unix-like systems, it has quite a few issues:
It can be broken out of. If you are going to actually compile or run untrusted C programs on your system, you are especially vulnerable to this issue. And if your students are anything like mine, someone WILL try to break out of the jail.
You have to create a full independent filesystem hierarchy with everything that is necessary for your task. You do not have to have a compiler in the chroot, but anything that is required to run the compiled programs should be included. While there are utilities that help with this, it's still not trivial.
You have to maintain the chroot. Since it is independent, the chroot files will not be updated along with your distribution. You will have to either recreate the chroot regularly, or include the necessary update tools in it, which would essentially require that it be a full-blown Linux distribution. You will also have to keep system and user data (passwords, input files e.t.c.) synchronized with the host system.
chroot() only protects the filesystem. It does not prevent a malicious program from opening network sockets or a badly-written one from sucking up every available resource.
The resource usage problem is common among all alternatives. Filesystem quotas will prevent programs from filling the disk. Proper ulimit (setrlimit() in C) settings can protect against memory overuse and any fork bombs, as well as put a stop to CPU hogs. nice(1) can lower the priority of those programs so that the computer can be used for any tasks that are deemed more important with no problem.
I wrote an overview of sandboxing techniques in Linux recently. I think your easiest approach would be to use Linux containers (lxc) if you dont mind about forking and so on, which don't really matter in this environment. You can give the process a read only root file system, an isolated loopback network connection, and you can still kill it easily and set memory limits etc.
Seccomp is going to be a bit difficult, as the code cannot even allocate memory.
Selinux is the other option, but I think it might be more work than a container.
Firejail is one of the most comprehensive tools to do that - it support seccomp, filesystem containers, capabilities and more:
https://firejail.wordpress.com/features-3/
You can use Qemu to test assignments quickly. This procedure below takes less than 5 seconds on my 5 year old laptop.
Let's assume the student has to develop a program that takes unsigned ints, each on their own line, until a line with "-1" arrives. The program should then average all the ints and output "Average: %f". Here's how you could test program completely isolated:
First, get root.bin from Jslinux, we'll use that as the userland (it has the tcc C-compiler):
wget https://github.com/levskaya/jslinux-deobfuscated/raw/master/root.bin
We want to put the student's submission in root.bin, so set up the loop device:
sudo losetup /dev/loop0 root.bin
(you could use fuseext2 for this too, but it's not very stable. If it stabilizes, you won't need root for any of this)
Make an empty directory:
mkdir mountpoint
Mount root.bin:
sudo mount /dev/loop0 mountpoint
Enter the mounted filesystem:
cd mountpoint.
Fix rights:
sudo chown -R `whoami` .
mkdir -p etc/init.d
vi etc/init.d:
#!/bin/sh
cd /root
echo READY 2>&1 > /dev/ttyS0
tcc assignment.c 2>&1 > /dev/ttyS0
./a.out 2>&1 > /dev/ttyS0
chmod +x etc/init.d/rcS
Copy the submission to the VM:
cp ~/student_assignment.c root/assignment.c
Exit the VM's root FS:
cd ..
sudo umount mountpoint
Now the image is ready, we just need to run it. It will compile and run the submission after booting.
mkfifo /tmp/guest_output
Open a seperate terminal and start listening for guest output:
dd if=/tmp/guest_output bs=1
In another terminal:
qemu-system-i386 -kernel vmlinuz-3.5.0-27-generic -initrd root.bin -monitor stdio -nographic -serial pipe:/tmp/guestoutput
(I just used the Ubuntu kernel here, but many kernels will work)
When the guest output shows "READY", you can send keys to the VM from the qemu prompt.
For example, to test this assignment, you could do
(qemu) sendkey 1
(qemu) sendkey 4
(qemu) sendkey ret
(qemu) sendkey 1
(qemu) sendkey 0
(qemu) sendkey ret
(qemu) sendkey minus
(qemu) sendkey 1
(qemu) sendkey ret
Now Average = 12.000000 should appear on the guest output pipe. If it doesn't, the student failed.
Quit qemu: quit
A program passing the test is here: https://stackoverflow.com/a/14424295/309483. Just use tcclib.h instead of stdio.h.
Try User-mode Linux. It has about 1% performance overhead for CPU-intensive jobs, but it may be 6 times slower for I/O-intensive jobs.
Running it inside a virtual machine should offer you all the security and restrictions you want.
QEMU would be a good fit for that and all the work (downloading the application, updating the disk image, starting QEMU, running the application inside it, and saving the output for later retrieval) could be scripted for automated tests runs.
When it goes about sanboxing based on ptrace (strace) check-out:
"sydbox" sandbox and "pinktrace" programming library ( it's C99 but there are bindings to python and ruby as far as I know).
Collected links related to topic:
http://www.diigo.com/user/wierzowiecki/sydbox
(sorry that not direct links, but no enough reputation points yet)
seccomp and seccomp-bpf accomplish this with the least effort: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
ok thanks to all the answers they helped ME a lot. But i would suggest none of them as an solution for the person who asked the original question. All mentioned tools require to much work for the purpose to test students code as a teacher,tutor,prof. The best way in this case would be in my opinion virtualbox. Ok, its emulates an complete x68-system and has nothing to do with the meaning of sandboxing in this way but if i imagine my programming teacher it would be the best for him. So "apt-get install virtualbox" on debian based systems, all others head over to http://virtualbox.org/ , create a vm, add an iso, click install, wait some time and be lucky. It will be much easier to use as to set up user-mode-linux or doing some heavy strace stuff...
And if you have fears about your students hacking you i guess you have an authority problem and a solution for that would be threaten them that you will sue the living daylights out of them if you can prove just one bite of maleware in the work they give you...
Also if there is a class and 1% of it is as good as he could do such things, dont bore them with such simple tasks and give them some big ones where they have to code some more. Integrative learning is best for everyone so dont relay on old deadlocked structures...
And of cause, never use the same computer for important things (like writing attestations and exams), that you are using for things like browsing the web and testing software.
Use an off line computer for important things and an on line computer for all other things.
However to everyone else who isnt a paranoid teacher (dont want to offend anybody, i am just the opinion that you should learn the basics about security and our society before you start being a programmers teacher...)
... where was i ... for everyone else:
happy hacking !!

How to set CPU load on a Red Hat Linux box?

I have a RHEL box that I need to put under a moderate and variable amount of CPU load (50%-75%).
What is the best way to go about this? Is there a program that can do this that I am not aware of? I am happy to write some C code to make this happen, I just don't know what system calls will help.
This is exactly what you need (internet archive link):
https://web.archive.org/web/20120512025754/http://weather.ou.edu/~apw/projects/stress/stress-1.0.4.tar.gz
From the homepage:
"stress is a simple workload generator for POSIX systems. It imposes a configurable amount of CPU, memory, I/O, and disk stress on the system. It is written in C, and is free software licensed under the GPL."
Find a simple prime number search program that has source code. Modify the source code to add a nanosleep call to the main loop with whichever delay gives you the desired CPU load.
One common way to get some load on a system is to compile a large software package over and over again. Something like the Linux kernel.
Get a copy of the source code, extract the tar.bz2, go into the top level source directory, copy your kernel config from /boot to .config or zcat /proc/config.gz > .config, the do make oldconfig, then while true; do make clean && make bzImage; done
If you have an SMP system, then make -j bzImage is fun, it will spawn make tasks in parallel.
One problem with this is adjusting the CPU load. It will be a maximum CPU load except for when waiting on disk I/O.
You could possibly do this using a Bash script. Use " ps -o pcpu | grep -v CPU" to get the CPU Usage of all the processes. Add all those values together to get the current usage. Then have a busy while loop that basically keeps on checking those values, figuring out the current CPU usage, and waiting a calculated amount of time to keep the processor at a certain threshhold. More detail is need, but hopefully this will give you a good starting point.
Take a look at this CPU Monitor script I found and try to get some other ideas on how you can accomplish this.
It really depends what you're trying to test. If you're just testing CPU load, simple scripts to eat empty CPU cycles will work fine. I personally had to test the performance of a RAID array recently and I relied on Bonnie++ and IOZone. IOZone will put a decent load on the box, particularly if you set the file size higher than the RAM.
You may also be interested in this Article.
Lookbusy enables set value of CPU load.
Project site
lookbusy -c util[-high_util], --cpu-util util[-high_util]
i.e. 60% load
lookbusy -c 60
Use the "nice" command.
a) Highest priority:
$ nice -n -20 my_command
or
b) Lowest priority:
$ nice -n 20 my_command
A Simple script to load & hammer the CPU using awk. The script does mathematical calculations and thus CPU load peaks up on higher values passwd to loadserver.sh .
checkout the script # http://unixfoo.blogspot.com/2008/11/linux-cpu-hammer-script.html
You can probably use some load-generating tool to accomplish this, or run a script to take all the CPU cycles and then use nice and renice on the process to vary the percentage of cycles that the process gets.
Here is a sample bash script that will occupy all the free CPU cycles:
#!/bin/bash
while true ; do
true
done
Not sure what your goal is here. I believe glxgears will use 100% CPU.
So find any process that you know will max out the CPU to 100%.
If you have four CPU cores(0 1 2 3), you could use "taskset" to bind this process to say CPUs 0 and 1. That should load your box 50%. To load it 75% bind the process to 0 1 2 CPUs.
Disclaimer: Haven't tested this. Please let us know your results. Even if this works, I'm not sure what you will achieve out of this?

Resources