Get notification on cgroup process change? - linux

Basically, inotify which normally serves to notify on filesystem changes doesn't work within the cgroup virtual filesystem.
Essentially I want a way to get a notification similar to inotify when a process in a cgroup either is dies or forks. I tried attaching inotify to the tasks virtual file inside the cgroup filesystem but that does nothing when a process forks on its own, only when a usespace tool actually manually writes to it to influence the cgroup.

inotify does not work on such virtual file system, be it cgroup, proc or sys.
Note: I tried this too, it would have been very handy in some situations, but nope. :-)
This is because the files and directories do not actually exist per see (for example they take 0 disk space), they are produced for you on the fly by the kernel as you visit them.
So the alternative would be to actively visit the files and dir in a busy loop periodically, which is so ugly that it is not a real alternative in most cases.
And this is why programs such as top, htop and such consume so much CPU. They do actually and actively browse the proc virtual file system rather than inotify or select or stuff like that in an eventing manner.
EDIT:
But there are some things that could help you though:
1/ For recent kernels (cgroups have been re-designed):
Look at:
https://www.kernel.org/doc/Documentation/cgroup-v2.txt
I quote:
2-3. [Un]populated Notification
Each non-root cgroup has a "cgroup.events" file which contains
"populated" field indicating whether the cgroup's sub-hierarchy has
live processes in it. Its value is 0 if there is no live process in
the cgroup and its descendants; otherwise, 1. poll and [id]notify
events are triggered when the value changes. [...]
1/ For older kernels:
You may want to have a look at notify_on_release and release_agent. Have a look at:
https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
notify_on_release flag: run the release agent on exit?
release_agent: the path to use for release notifications (this file exists in the top cgroup only)
And the sections "1.4 What does notify_on_release do ?" and "1.5 What does clone_children do ?"

Related

Limit the percentage of CPU a process tree is allowed to use?

Can I limit the percentage of CPU a running process and all its current and future children can use, combined? I've heard about the cpulimit tool, but that seems to ignore child processes.
Edit: So, the answer I found requires cpulimit to run constantly untill we want the limit to stay in effect, since it is doing the limiting by actively sending suspend and then continue signals to the process. Are there perhaps other ways to achieve this limiting effect, perhaps without the need for such a secondary process running in the background?
Yes!
Just as I was writing this question, found out that I was trying an old version of cpulimit.
The new version supports limiting child processes too.
$ cpulimit -h
Usage: cpulimit [OPTIONS...] TARGET
OPTIONS
-l, --limit=N percentage of cpu allowed from 0 to 400 (required)
-v, --verbose show control statistics
-z, --lazy exit if there is no target process, or if it dies
-i, --include-children limit also the children processes
-h, --help display this help and exit
TARGET must be exactly one of these:
-p, --pid=N pid of the process (implies -z)
-e, --exe=FILE name of the executable program file or path name
COMMAND [ARGS] run this command and limit it (implies -z)
Report bugs to <marlonx80#hotmail.com>.
I've been researching this problem for the last few days, and I found at least two more options: cgroups and CPU affinity
Given that this topic has been viewed more than 2k times, and it's been difficult to find a good source of information, let my post my notes here for future reference.
cgroups
Caveats: you can't use cgroups inside of Docker and you need root access for a one-time setup.
There is cgroups v1 and v2. As of 2020-04-22, only Red Hat has switched to v2 by default and you can't use both at the same time. I.e., you're probably on v1.
You need root to create a cgroup directory / configure your system to create one at startup and delegate access to your non-root user, like so:
v1: mkdir /sys/fs/cgroup/cpu/<directory>/ && chown -R user /sys/fs/cgroup/cpu/<directory>/ (this is specific to restricting CPU usage - there are other cgroup 'controllers' that use different directories; a process can be in multiple cgroups)
v2: mkdir /sys/fs/cgroup/unified/<directory> && chown -R user /sys/fs/cgroup/unified/<directory> (this is a unified cgroup hierarchy, and you can control all cgroup 'controllers' via a single cgroup; a process can be in only one cgroup, and that cgroup must not contain other cgroups - i.e., a leaf cgroup)
Configure the cgroup by writing to control files in this tree:
Configure CPU quota using cpu.cfs_quota_us and cpu.cfs_period_us, e.g., echo 10000 > cpu.cfs_quota_us
Add a process to the new cgroup by writing its pid to the cgroup.procs control file. (All subprocesses are automatically in the same cgroup.)
This link has more info:
https://drill.apache.org/docs/configuring-cgroups-to-control-cpu-usage/
CPU affinity
You can only use CPU affinity to limit CPU usage to an integer number of logical CPUs (aka cpu cores), not to a specific percentage. On today's multi-core systems, that may be good enough.
The documentation for this feature is at $ man sched_setaffinity. Note that cgroups also support setting CPU affinity through the cpuset controller.

Write module of kernel (Linux), which to save the page of process from removing to the swap

Need to save the page of process (the user part!) from removing to the swap.
I need to do it in the kernel, only. (language C I know)
(Maybe insert hook in shrink_page_list?)
I have IDs of processes, which need to save and threshold amount of physical memory in the system (We fill, while it isn't filled). IDs and threshold write in /proc, /dev or /sys.
How to approach this?
What files to look at?
What tutorials to read?
Maybe there are examples that are somehow are related with this task.
Info: I compilling kernel of Debian Lenny, use Qemu for start it on my Ubuntu.
See get_user_pages. http://www.makelinux.net/ldd3/chp-15-sect-3.
Use get_user_pages, you can get whatever page you want and keep it locked in memory.
Even better, look at the comments on the source at
http://lxr.free-electrons.com/source/mm/gup.c#L637

How to check the state of Linux threads?

How could I check the state of a Linux threads using codes, not tools? I want to know if a thread is running, blocked on a lock, or asleep for some other reason. I know the Linux tool "top" could do this work. But how to implement it in my own codes. Thanks.
I think you should study in details the /proc file system, also documented here, inside kernel source tree.
It is the way the Linux kernel tells things to outside!
There is a libproc also (used by ps and top, which reads /proc/ pseudo-files).
See this question, related to yours.
Reading files under /proc/ don't do any disk I/O (because /proc/ is a pseudo file system), so goes fast.
Lets say your process id is 100.
Go to /proc/100/task directory and there you could see multiple directories representing each threads.
then inside each subdirectory e.g. /proc/100/task/10100 there is a file named status.
the 2nd line inside this file is the state information of the thread.
You could also find it with by looking at the cgroup hierarchy of the service that your process belongs. Cgroups have a file called "tasks" and this file lists all the tasks of a service.
For example:
cat /sys/fs/cgroup/systemd/system.slice/hello.service/tasks
Note: cgroup should be enabled in your linux kernel.

trigger alert when a specified command executes in linux

I have 3 samba shares mounted in my system, but suddenly, one of them gets umounted without my permision. Maybe one of houndreds of scripts which run in my crontab, but i dont know which one.
I've reviewed all /var/log directory looking for umount word without success, then i want to log when command umount is executed and which process is running it.
Maybe with syslog, maybe with another log, maybe a mail to my box....
Thanks a lot.
I have this software:
mount: mount-2.12q
mount.cifs version: 1.14-3.5.4
Unmounting does not only happen by calling the umount binary, many programs might do it. See the manual page (man syscalls) and search for umount. This said, you would have to hook the corresponding syscall and see who invokes it. I'm not sure, but most probably it's possible to disconnect inside the kernel by calling the corresponding method directly, so functionality might bypass the syscall interface which is mainly required for userspace interaction. In this case you would have to use some debugging technique on the kernel itself, which maybe is a little much for finding your problem!
You may have success using strace on an already running process (man strace), for example smbd, and see if this process invokes umount, which is quite possible.
Anyways, if you can recompile your kernel from source, you might add some printk message inside the function that is used to unmount a device to see which process did it (this would be my approach for cases where nothing else, including strace, helps).
Since the mount is a change in the filesystem, maybe the inode-observer incron is a solution for you. Another option might be the auditd.

How to "hibernate" a process in Linux by storing its memory to disk and restoring it later?

Is it possible to 'hibernate' a process in linux?
Just like 'hibernate' in laptop, I would to write all the memory used by a process to disk, free up the RAM. And then later on, I can 'resume the process', i.e, reading all the data from memory and put it back to RAM and I can continue with my process?
I used to maintain CryoPID, which is a program that does exactly what you are talking about. It writes the contents of a program's address space, VDSO, file descriptor references and states to a file that can later be reconstructed. CryoPID started when there were no usable hooks in Linux itself and worked entirely from userspace (actually, it still does work, depending on your distro / kernel / security settings).
Problems were (indeed) sockets, pending RT signals, numerous X11 issues, the glibc caching getpid() implementation amongst many others. Randomization (especially VDSO) turned out to be insurmountable for the few of us working on it after Bernard walked away from it. However, it was fun and became the topic of several masters thesis.
If you are just contemplating a program that can save its running state and re-start directly into that state, its far .. far .. easier to just save that information from within the program itself, perhaps when servicing a signal.
I'd like to put a status update here, as of 2014.
The accepted answer suggests CryoPID as a tool to perform Checkpoint/Restore, but I found the project to be unmantained and impossible to compile with recent kernels.
Now, I found two actively mantained projects providing the application checkpointing feature.
The first, the one I suggest 'cause I have better luck running it, is CRIU
that performs checkpoint/restore mainly in userspace, and requires the kernel option CONFIG_CHECKPOINT_RESTORE enabled to work.
Checkpoint/Restore In Userspace, or CRIU (pronounced kree-oo, IPA: /krɪʊ/, Russian: криу), is a software tool for Linux operating system. Using this tool, you can freeze a running application (or part of it) and checkpoint it to a hard drive as a collection of files. You can then use the files to restore and run the application from the point it was frozen at. The distinctive feature of the CRIU project is that it is mainly implemented in user space.
The latter is DMTCP; quoting from their main page:
DMTCP (Distributed MultiThreaded Checkpointing) is a tool to transparently checkpoint the state of multiple simultaneous applications, including multi-threaded and distributed applications. It operates directly on the user binary executable, without any Linux kernel modules or other kernel modifications.
There is also a nice Wikipedia page on the argument: Application_checkpointing
The answers mentioning ctrl-z are really talking about stopping the process with a signal, in this case SIGTSTP. You can issue a stop signal with kill:
kill -STOP <pid>
That will suspend execution of the process. It won't immediately free the memory used by it, but as memory is required for other processes the memory used by the stopped process will be gradually swapped out.
When you want to wake it up again, use
kill -CONT <pid>
The more complicated solutions, like CryoPID, are really only needed if you want the stopped process to be able to survive a system shutdown/restart - it doesn't sound like you need that.
Linux Kernel has now partially implemented the checkpoint/restart futures:https://ckpt.wiki.kernel.org/, the status is here.
Some useful information are in the lwn(linux weekly net):
http://lwn.net/Articles/375855/ http://lwn.net/Articles/412749/ ......
So the answer is "YES"
The issue is restoring the streams - files and sockets - that the program has open.
When your whole OS hibernates, the local files and such can obviously be restored. Network connections don't, but then the code that accesses the internet is typically more error checking and such and survives the error conditions (or ought to).
If you did per-program hibernation (without application support), how would you handle open files? What if another process accesses those files in the interim? etc?
Maintaining state when the program is not loaded is going to be difficult.
Simply suspending the threads and letting it get swapped to disk would have much the same effect?
Or run the program in a virtual machine and let the VM handle suspension.
Short answer is "yes, but not always reliably". Check out CryoPID:
http://cryopid.berlios.de/
Open files will indeed be the most common problem. CryoPID states explicitly:
Open files and offsets are restored.
Temporary files that have been
unlinked and are not accessible on the
filesystem are always saved in the
image. Other files that do not exist
on resume are not yet restored.
Support for saving file contents for
such situations is planned.
The same issues will also affect TCP connections, though CryoPID supports tcpcp for connection resuming.
I extended Cryopid producing a package called Cryopid2 available from SourceForge. This can
migrate a process as well as hibernating it (along with any open files and sockets - data
in sockets/pipes is sucked into the process on hibernation and spat back into these when
process is restarted).
The reason I have not been active with this project is I am not a kernel developer - both
this (and/or the original cryopid) need to get someone on board who can get them running
with the lastest kernels (e.g. Linux 3.x).
The Cryopid method does work - and is probably the best solution to general purpose process
hibernation/migration in Linux I have come across.
The short answer is "yes." You might start by looking at this for some ideas: ELF executable reconstruction from a core image (http://vx.netlux.org/lib/vsc03.html)
As others have noted, it's difficult for the OS to provide this functionality, because the application needs to have some error checking builtin to handle broken streams.
However, on a side note, some programming languages and tools that use virtual machines explicitly support this functionality, such as the Self programming language.
This is sort of the ultimate goal of clustered operating system. Mathew Dillon puts a lot of effort to implement something like this in his Dragonfly BSD project.
adding another workaround: you can use virtualbox. run your applications in a regular virtual machine and simply "save the machine state" whenever you want.
I know this is not an answer, but I thought it could be useful when there are no real options.
if for any reason you don't like virtualbox, vmware and Qemu are as good.
Ctrl-Z increases the chances the process's pages will be swapped, but it doesn't free the process's resources completely. The problem with freeing a process's resources completely is that things like file handles, sockets are kernel resources the process gets to use, but doesn't know how to persist on its own. So Ctrl-Z is as good as it gets.
There was some research on checkpoint/restore for Linux back in 2.2 and 2.4 days, but it never made it past prototype. It is possible (with the caveats described in the other answers) for certain values of possible - I you can write a kernel module to do it, it is possible. But for the common value of possible (can I do it from the shell on a commercial Linux distribution), it is not yet possible.
There's ctrl+z in linux, but i'm not sure it offers the features you specified. I suspect you asked this question since it doesn't

Resources