I have centos image in virtualbox. When I do curl [url] | tee -a [file] where [url] is the url for a large file, the system start to kill all new proccesses and I get Killed answer in console for any command but kill and cd. How can I disable OOM daemon?
The OOM Killer is your friend, why would you want to disable it? When the system is running out of memory, the kernel must start killing processes in order to stay operational. So lets be honest, you need the OOM Killer.
Instead, you might consider configuring the OOM Killer with some configuration that suits your needs; yet your current problems may persist.
In the light of the facts, it may be better to implement a more efficient way of doing these tasks you are doing.
If you don't like "your friend", the OOM killer, to kill innocent processes, a short answer is:
sysctl -w vm.overcommit_memory=2
More verbose answers and recommended reading:
Effects of configuring vm.overcommit_memory
How to disable the oom killer in linux?
Turn off the Linux OOM killer by default?
Related
When I use nvidia-smi, I found nearly 20GB GPU Memory is missing somewhere (total listed processes took 17745MB, meanwhile Memory-Usage is 37739MB):
Then I use nvitop, you can see No Such Process has actually taken my GPU resources. However, I cannot kill this PID:
>>> sudo kill -9 118238
kill: (118238): No such process
How can I get rid of this ghost process without interupting others?
I have found the solution in this answer: https://stackoverflow.com/a/59431785/6563277.
First, I run sudo fuser -v /dev/nvidia* to see all processes are using my GPU RAM that nvidia-smi has failed to show.
Then, I saw some "ghost" Python processes. And after killing it, the GPU RAM was free up.
Frequently facing the issue of the kswapd0 running in one of the linux machines, what could be the reason for that, by looking more at the issue, understood that it will be because of the less memory, I tried the below options to avoid it:
echo 1 > /proc/sys/vm/drop_caches
cat /proc/sys/vm/drop_caches
sudo cat /proc/sys/vm/swappiness
sudo sysctl vm.swappiness=60
but it does not yield fruitful results, what could be the best method to avoid it, or its something some action needs to be taken on the RAM memory of the machine, Any suggestions on this ?
Every time we observe , all the running apps are killed automatically and kswapd0 occupies the complete cpu and memory.
I'm trying to test a behaviour which is hard to reproduce in a controlled environment.
Use case:
Linux system; usually Redhat EL 5 or 6 (we're just starting with RHEL 7 and systemd, so it's currently out of scope).
There're situations where I need to restart a service. The script we use for stopping the service usually works quite well; it sends a SIGTERM to the process, which is designed to handle it; if the process doesn't handle the SIGTERM within a timeout (usually a couple of minutes) the script sends a SIGKILL, then waits a couple minutes more.
The problem is: in some (rare) situations, the process doesn't exit after a SIGKILL; this usually happens when it's badly stuck on a system call, possibly because of a kernel-level issue (corrupt filesystem, or not-working NFS filesystem, or something equally bad requiring manual intervention).
A bug arose when the script didn't realize that the "old" process hadn't actually exited and started a new process while the old was still running; we're fixing this with a stronger locking system (so that at least the new process doesn't start if the old is running), but I find it difficult to test the whole thing because I haven't found a way to simulate an hard-stuck process.
So, the question is:
How can I manually simulate a process that doesn't exit when sending a SIGKILL to it, even as a privileged user?
If your process are stuck doing I/O, You can simulate your situation in this way:
lvcreate -n lvtest -L 2G vgtest
mkfs.ext3 -m0 /dev/vgtest/lvtest
mount /dev/vgtest/lvtest /mnt
dmsetup suspend /dev/vgtest/lvtest && dd if=/dev/zero of=/mnt/file.img bs=1M count=2048 &
In this way the dd process will stuck waiting for IO and will ignore every signal, I know the signals aren't ignore in the latest kernel when processes are waiting for IO on nfs filesystem.
Well... How about just not sending SIGKILL? So your env will behave like it was sent, but the process didn't quit.
Once a proces is in "D" state (or TASK_UNINTERRUPTIBLE) in a kernel code path where the execution can not be interrupted while a task is processed, which means sending any signals to the process would not be useful and would be ignored.
This can be caused due to device driver getting too many interrupts from the hardware, getting too many incoming network packets, data from NIC firmware or blocked on a HDD performing I/O. Normally if this happens very quickly and threads remain in this state for very short span of time.
Therefore what you need to be doing is look at the syslog and sar reports during the time when the process was stuck in D-state. If you find stack traces in the log, try to search kernel.bugzilla.org for similar issues or seek support from the Linux vendor.
I would code the opposite way. Have your server process write its pid in e.g. /var/run/yourserver.pid (this is common practice). Have the starting script read that file and test that the process does not exist e.g. with kill of signal 0, or with
yourserver_pid=$(cat /var/run/yourserver.pid)
if [ -f /proc/$yourserver_pid/exe ]; then
You could improve that by readlink /proc/$yourserver_pid/exe and comparing that to /usr/bin/yourserver
BTW, having a process still alive a few seconds after a SIGKILL is a serious situation (the common case when it could happen is if the process is stuck in a D state, waiting for some NFS server), and you probably should detect and syslog it (e.g. with logger in your script).
I also would try to first send SIGTERM, wait a few seconds, send SIGQUIT, wait a few seconds, and at last send SIGKILL and only a few seconds later test that the server process has gone
A bug arose when the script didn't realize that the "old" process hadn't actually exited and started a new process while the old was still running;
This is the bug in the OS/kernel level, not in your service script. The situation is rare and is hard to simulate because the OS is supposed to kill the process when SIGKILL signal happens. So I guess your goal is to let your script work well under a buggy kernel. Is that correct?
You can attach gdb to the process, SIGKILL won't remove such process from processlist but it will flag it as zombie, which might still be acceptable for your purpose.
void#tahr:~$ ping 8.8.8.8 > /tmp/ping.log &
[1] 3770
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 S 0:00 ping 8.8.8.8
void#tahr:~$ sudo gdb -p 3770
...
(gdb)
Other terminal
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 t 0:00 ping 8.8.8.8
sudo kill -9 3770
...
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 Z 0:00 [ping] <defunct>
First terminal again
(gdb) quit
When doing
tail -f /var/log/apache2/access.log
It shows logs and then
Killed
I have to re-execute tail -f to see new logs.
How do I make tail -f continually display logs without killing itself?
The first thing I'd do is try --follow instead of -f. Your problem could be happening because your log file is being rotated out. From the man page:
With --follow (-f), tail defaults to following the file descriptor, which means that even if a tail'ed file is renamed, tail will continue to track its end. This default behavior is not desirable when you really want to track the actual name of the file, not the file descriptor (e.g., log rotation). Use --follow=name in that case. That causes tail to track the named file in a way that accommodates renaming, removal and creation.*
tail -f should not get killed.
Btw, tail does not kill itself, it is killed by something. For example system is out of memory or resource limit is too restrictive.
Please figure out what kills your tail, using for example gdb or strace. Also check your environment, at least ulimit -a and dmesg for any clues.
If your description is correct, and tail actually displays
Killed
then it is probably not happening as a result of log rotation. Log rotation will causes tail to stop displaying new lines, but even if the file is deleted, tail will not be killed.
Rather, some other process on the system, or perhaps the kernel, is sending it a signal 9 (SIGKILL). Possible causes for this include:
A user in another terminal issuing a command such as kill -9 1234 or pkill -9 tail
Some other tool or daemon (although I can't think of any that would do this)
The kernel itself can send SIGKILL to your process. One scenario under which it would do this is if the OOM (Out of memory) killer kicked in. This happens when all RAM and swap in the system is used. The kernel will select a process which is using a lot of memory and kill it. If this was happening it would be visible in syslog, but it is quite unlikely that tail would use that much memory.
The kernel can send you SIGKILL if RLIMIT_CPU (the limit on the amount of CPU time your process has used) is exceeded. If you leave tail running for long enough, and you have a ulimit set, then this can happen. To check for this (and other resource limitations) use ulimit -a
In my opinion, either the first or last of these explanations seems most likely.
You need to use tail -F logfile , it will not get terminate if log file rotate.
I have started nginx and when I stop like root
/etc/init.d/nginx stop
after that I type
ps aux | grep nginx
and get response like tcp LISTEN 2124 nginx WORKER
kill -9 2124 # tried with kill -QUIT 2124, kill -KILL 2124
and after I type again
ps aux | grep nginx
and get response like tcp LISTEN 2125 nginx WORKER
and so on.
How to kill this immortal Chuck Norris worker ?
After kill -9 there's nothing more to do to the process - it's dead (or doomed to die). The reason it sticks around is because either (a) it's parent process hasn't waited for it yet, so the kernel holds the process table entry to keep it's status until the parent does so, or (b) the process is stuck on a system call into the kernel that is not finishing (which usually means a buggy driver and/or hardware).
If the first case, getting the parent to wait for the child, or terminating the parent, should work. Most programs don't have a clear way to make them "wait for a child", so that may not be an option.
In the second case, the most likely solution is to reboot. There may be tools that could clear such a condition, but that's not common. Depending on just what that kernel processing is doing, it may be possible to get it to unblock by other means - but that requires knowledge of that processing. For example, if the process is blocked on a kernel lock that some other process is somehow holding indefinitely, terminating that other process could aleviate the problem.
Note that the ps command can distinguish these two states as well. These show up in the 'Z' state. See the ps man page for more info: http://linux.die.net/man/1/ps. They may also show up with the text "defunct".
I had the same issue.
In my case gitlab was the responsible to bring the nginx workers.
when i completelly removed gitlab from my server i got able to kill the nginx workers.
ps -aux | grep "nginx"
Search for the workers and check at the first column who is bringing them up.
kill or unistall the responsible and kill the workers again, they will stop spawning ;D
I was having similar issue.
Check if you are using any auto-healer like Monit or Supervisor which runs the worker whenever you try to stop them. If Yes Disable them.
My workers were being spawned due to changes I forget i made in update-rc.d in Ubuntu.
So I installed sysv-rc-conf which gives a clean interface control of what processes are on reboot, you can disable from there and I assure you no Chuck Noris Resurrection :D