How to check if a process is in hang state (Linux) [closed] - linux

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
Is there any command in Linux through which i can know if the process is in hang state.

Is there any command in Linux through which i can know if the process is in hang state.
There is no command, but once I had to do a very dumb hack to accomplish something similar. I wrote a Perl script which periodically (every 30 seconds in my case):
run ps to find list of PIDs of the watched processes (along with exec time, etc)
loop over the PIDs
start gdb attaching to the process using its PID, dumping stack trace from it using thread apply all where, detaching from the process
a process was declared hung if:
its stack trace didn't change and time didn't change after 3 checks
its stack trace didn't change and time was indicating 100% CPU load after 3 checks
hung process was killed to give a chance for a monitoring application to restart the hung instance.
But that was very very very very crude hack, done to reach an about-to-be-missed deadline and it was removed a few days later, after a fix for the buggy application was finally installed.
Otherwise, as all other responders absolutely correctly commented, there is no way to find whether the process hung or not: simply because the hang might occur for way to many reasons, often bound to the application logic.
The only way is for application itself being capable of indicating whether it is alive or not. Simplest way might be for example a periodic log message "I'm alive".

you could check the files
/proc/[pid]/task/[thread ids]/status

What do you mean by ‘hang state’? Typically, a process that is unresponsive and using 100% of a CPU is stuck in an endless loop. But there's no way to determine whether that has happened or whether the process might not eventually reach a loop exit state and carry on.
Desktop hang detectors just work by sending a message to the application's event loop and seeing if there's any response. If there's not for a certain amount of time they decide the app has ‘hung’... but it's entirely possible it was just doing something complicated and will come back to life in a moment once it's done. Anyhow, that's not something you can use for any arbitrary process.

Unfortunately there is no hung state for a process. Now hung can be deadlock. This is block state. The threads in the process are blocked. The other things could be live lock where the process is running but doing the same thing again and again. This process is in running state. So as you can see there is no definite hung state.
As suggested you can use the top command to see if the process is using 100% CPU or lot of memory.

Related

Jobs being killed for an unidentifiable reason [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
On Linux, I try to run a fortran executable (or even recompile and then run) and the job is killed immediately. The process just says "Killed". Now, if I copy the whole directory, the program will run just fine in the "new" directory -- but never in the original. This is happening repeatedly, but not universally, and seems random to me. Even though I have a work-a-round, I am still wondering why this happens at all.
Run your program with strace to find out what it is doing before it gets killed. Just speculating: But could it be allocating a huge amount of memory? If system memory is exhausted the out-of-memory killer usually kills the process that uses memory most aggressively. Check /var/log/syslog to see if the OOM killer was kicking in.
Also see What killed my process and why? and Will Linux start killing my processes without asking me if memory gets short?.

MarkLogic Filesystem Log entry [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
I am seeing some slow Marklogic cluster logs like below
2020-01-14 05:55:22.649 Info: Slow background cluster.background.clusters, 5.727 sec
2020-01-14 05:55:22.649 Info: Slow background cluster.background.hosts.AssignmentManager, 5.581 sec
I suspect MarkLogic filesystem is running slow and does not able to keep up with MarkLogic. I am seeing below log entry also:-
2020-01-14 05:55:53.380 Info: Linux file system mount option 'barrier' is default; recommend faster 'nobarrier' if storage has non-volatile write cache
I want to know what is the meaning of the above log entry in MarkLogic? How can I be sure that filesystem is having slowness problems or not?
The meaning of "slow messages" is that a background activity takes longer time than expected. It is an indicator of starvation.
From your question it's impossible to say what is causing it. Typically, it's related to underlying physical infrastructure where MarkLogic is running. MarkLogic doesn't have its filesystem or other resources - it uses the OS's filesystem, memory etc. and if available physical resources are not enough for MarkLogic to serve the requested load, background operations will take longer time than expected. This will always be reflected in the log.
You can read more here:
Understanding "slow background" messages
https://help.marklogic.com/Knowledgebase/Article/View/508/0/understanding-slow-infrastructure-notifications
29 August 2019 10:54 AM
Introduction
In more recent versions of MarkLogic Server, "slow background" error log messages were added to note and help diagnose slowness.
Details
For "Slow background" messages, the system is timing how long it took to do some named background activity. These activities should not take long and the "slow background" message is an indicator of starvation. The activity can be slow because:
it is waiting on a mutex or semaphore held by some other slow thread;
the operating system is stalling it, possibly because it is thrashing because of low memory.
Looking at the "slow background" messages in isolation is not sufficient to understand the reason - we just know a lot of time passed since the last time we read the time of day clock. To understand the actual cause, additional evidence will need to be gathered from the time of the incident.
Notes:
In general, we do not time how long it takes to acquire a mutex or semaphore as reading the clock is usually more expensive than getting a mutex or semaphore.
We do not time things that usually take about a microsecond.
We do time things that usually take about a millisecond.
Related Articles
Knowledgebase: Understanding Slow Infrastructure Notifications
Knowledgebase: (Understanding slow 'journal frame' entries in the ErrorLog)[https://help.marklogic.com/Knowledgebase/Article/View/460/0/understanding-slow-journal-frame-entries-in-the-errorlog]
Knowledgebase: (Hung Messages in the ErrorLog)[https://help.marklogic.com/Knowledgebase/Article/View/35/0/hung-messages-in-the-errorlog]

Watchdog for a single process [Linux] [duplicate]

This question already has answers here:
How to use Linux software watchdog?
(9 answers)
Closed 6 years ago.
I need to make sure that a chosen process is not hanged. I thought I'd program this process to write to some /proc file that will be periodically monitored by some other process/module. If there is no change in the file for some time the application would be considered hanged. Just as a watchdog in uC.
However I don't know if this is the best approach. As I'm not really much into deep Linux engineering I thought it is better to ask which way is the easiest before starting to learn writing modules, /proc filesystem, etc. Ha!
I've found some information on Monit (https://mmonit.com/monit/). Maybe this would be better?
What would you recommend to be the best way to implement the "watchdog" functionality here?
Thanks a lot!
Paweł
An OS independent solution is to create a watchdog thread that runs periodically and supports one or more software watchdogs, which are simply implemented as status bits or bytes. The process in question is responsible for patting the watchdog (clearing the status). The watchdog thread is a loop which checks the status. If it has been cleared, it sets it. If it has not been cleared, it alarms. You can adjust the timing so that the status is not checked each time through the loop.
This solution is quite flexible. You can also tie it into the hardware watchdog, patting the hw watchdog only if all software watchdogs have been patted.

Recovering from fork bomb by having a kernel patch allowing to run only recovery process [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
WAS: Reading another question on SO that was migrated to SU : https://superuser.com/questions/435690/does-linux-have-any-measures-to-protect-against-fork-bombs, i was thinking of a solution at kernel level.
I read one proposal at LWN ( http://lwn.net/Articles/435917/ ) but this proposal focuse on fork bomb detection to be able to prevent it.
I would focus on recovery, since detection basically means that system is not usable; what will soon be detected by any user of the system.
I broaden the context to non fork bomb only : what if your system is unresponsive and you can't get a decent console to it but still don't want to reboot it even cleanly.
So THE QUESTION :
Is it possible to tell kernel by some SysReq command to enter in a recover shell that will run only one process ( and refuse to fork it ) with intent to kill faulty processes; has this feature been ever implemented ? If no, then why ?
Remark i am not speaking of SysReq+i that send SIGKILL to all process, but something that behaves like a SIGSTOP to all processes, it can be another kernel kexec alongside the first allowing to inspect and resume it.
You can always limit, for a non-root user, the maximal number of processes with setrlimit(2) syscall with RLIMIT_NPROC.
You could use the bash ulimit builtin (or limit if using zsh as your shell). You can also /etc/security/limits.conf and/or /etc/pam.d/ to limit it "system-wide" (but tuning the limit user by user if so wanted, etc.). PAM is very powerful for that.
I don't think you need some risky kernel patch. You just want to administer your machine with care.
And you don't care about root fork bombs: if a malicious (or stupid) user gets root access, your Linux system is doomed anyway (even without root fork bombs). Nobody care about them because by definition root is trusted and needs to behave carefully & cleverly. (likewise, root can /bin/rm -rf / but that is usually stupid, as stupid as a root fork bomb, hence no protections exist against both mistakes...)
And a kernel patch would be difficult : you want the root to be able to run several processes (at least, the recovery shell and the child command, possibly piped), not only one. !Kernel patches can be brittle and then crash the entire system....
Of course you are free to patch your kernel, since it is free software. However, making an interesting patch and getting the kernel community attracted by it is also a social issue (and a much harder thing to achieve). Good luck. LKLM is a better place to discuss that.
PS.Sending SIGSTOP to every non init process won't help much w.r.t. a root fork bomb: you won't be able to type any shell command, because your shell would also and always be stopped!
PPS. The LWN article quoted in the question had comments mentionning cgroup-s which could be relevant.

What is the difference between CFQ, Deadline, and NOOP? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I'm recompiling my kernel, and I want to choose an I/O scheduler. What's the difference between these?
If you compile them all, you can select at boot time or per-device which scheduler to use. No need to pick at compile time, unless you are targeting an embedded device where every byte counts. See Documentation/block/switching-sched.txt for details on switching per-device or system-wide at boot.
The CFQ scheduler allows you to set priorities via the ionice(1) tool or the ioprio_set(2) system call. This allows giving precedence to some processes or forcing others to do their IO only when the system's block devices are relatively idle. The queues are implemented by segregating the IO requests from processes into queues, and handling the requests from each queue similar to CPU scheduling. Details on configuring it can be found in Documentation/block/cfq-iosched.txt.
The deadline scheduler by contrast looks at all writes from all processes at once; it sorts the writes by sector number, and writes them all in linear fashion. The deadlines means that it tries to write each block before its deadline expires, but within those deadlines, is free to re-arrange blocks as it sees fit. Details on configuring it can be found in Documentation/block/deadline-iosched.txt.
Probably very little in practice.
In my testing, I found that in general NOOP is a bit better if you have a clever RAID controller. Others have reported similar results, but your workload may be different.
However, you can select them at runtime (without reboot) so don't worry about it at compile-time.
My understanding was that the "clever" schedulers (CFQ and deadline) are only really helpful on traditional "spinning disc" devices which don't have a RAID controller.

Resources