Health check for application - linux

I wish to know what are the methods exist to check the Health of a process. Considering that on a system
10000 process are running and you have to make sure that in case any of these process goes down we need to make the process UP.

Use the Process ID (PID) and poll whether the process is still alive or is dead periodically; and if it's dead, then revive it.
However, if you have 10000 process, you will probably hit the OS's process limit first. I suggest redesigning your program so you don't need that much processes in the first place.

Re-spawning processes that go down is usually handled by having specific launcher programs to exec() the program and wait for a SIGCHILD to indicate the child process ended.
For boot time applications (servers etc) daemons like upstart can do this for you automatically.

While others are pointing out that applications already exists (which you really should use unless you have a clear reason not to) I'll throw out a random idea for a custom solution.
If you control all N processes then make them all have one shared memory area N bits large (so, 10000 processes ~ 1KB, not bad). When starting each process give it a number, i, ranging from 0 to N. Every T seconds have each process will set bit i in the shared memory to 1. A monitoring process can check that all N bits are 1 every k*T seconds, resetting them all to 0 in the process.
This is still O(n), which you won't avoid, but the primitives are all really fast and should scale fine up to the OS thread limit.
An alternate idea for obtaining i would be just to use the PID, but then the shared memory will have to be larger (probably will still be OK though; for example, the Linux PID range is small).

there is an utility called monit which does what you are looking for. But it is for certain important processes in Linux.. all 10000 processes are important !!!

Related

GC not able to collect back memory using fork-emulation on Windows

Let me begin by saying I do not have in depth knowledge of Perl so please pardon me if there is something obvious that I have missed :)
In the system (running in Windows environment) that I am looking at, we have a perl process which has to download ~5000-6000 files. Since each file can be independently downloaded, we forked separate threads for each file. The thread is supposed to download the file and die. On running the process, I noticed that the memory of the process goes up to ~1.7 GB and then dies due to the memory limit of each process.
On searching and asking a few people, I came across this concept of circular referencing due to which the garbage collector will not free up memory. I searched a bit and found the Devel-Cycle package which can find out if there are any cycles in the object. I got this package and added a line to check if the main object in the process has any cycles. find_cycle came back with the following statement for each thread.
DBD::Oracle::db FIRSTKEY failed: handle 2 is owned by thread 256004 not current thread c0ea29c (handles can't be shared between threads and your driver may need a CLONE method added) at C:/Program Files/Perl/site/lib/Devel/Cycle.pm line 151.
I got to know that DB handles cannot be shared between threads. I looked at the code again and realised that after the fork happens, the child process does actually create a new DB handle (which I guess is why the process still continues to run fine till it reaches the memory limit). I guess there might be more db handles from the parent in the object that are not used by the child but are still referenced.
Questons that I have -
Is the circular reference the only reason for the problem or could there be other issues causing the process to use so much memory?
Could the sharing of the handle cause the blow up in memory (in other words is the shared DB handle causing the GC to not free up space)?
If it is indeed the shared DB handle, I guess I can just say $dbHandle = 0 to get rid of the reference (if $dbHabndle is referencing that particular handle). Am I correct here?
I am trying to go through the code to see where else there is a reference to the parent DB handle (and found at least one more reference). Is there any other way I can do this? Is there a method to print out all the properties of an object?
EDIT:
Not all threads (due to the perl fork call in windows) are spawned at the same time. It spawns a max of n number of threads (where n is a configurable number). Once a thread has finished its execution, the process spawns another thread. At this moment n is set to 10, however I had changed n to 1 (so only one extra thread is running at one time), and I still hit the memory limit.
edit: Turns out, this does not solve the Ops problem. Still might be helpful for a future reader.
We do not really know a lot about your situation and your program sounds quite complex to just fork it 6000 times to me. But i will still attempt to answer, please correct me if my assumptions are wrong.
It appears you are on Windows. It is important to note, that Windows has no fork() system call. And as you specifically note that you "fork", i just assume that you actually use that Perl command. On windows, this will try to emulate fork() as best as it can but what that basically means is, that all the forked processes you see, are in fact just threads within the original process, just pretending to be processes to you. To do this, they copy the complete interpreter state. See http://perldoc.perl.org/perlfork.html for more information. Especially the following part seems to apply to you:
Resource limits
In the eyes of the operating system, pseudo-processes created via the fork() emulation are simply threads in the same process. This means that any process-level limits imposed by the operating system apply to all pseudo-processes taken together. This includes any limits imposed by the operating system on the number of open file, directory and socket handles, limits on disk space usage, limits on memory size, limits on CPU utilization etc.
If you fork so many pseudo processes, you need a lot of memory as you also have to copy the interpreter state as often. And depending on the complexity of your program and how it is structured, that may very well be a non-trivial amount of memory.
And as http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778%28v=vs.85%29.aspx tells us, the 1.7GB you mentioned, is not far away from the 2GB that some Windows versions impose on you as memory limit for a single process.
My wild guess would be, that you in fact just hit that limit by spawning all those many many threads, each with its own copy of the interpreter state and everything.
You will probably be off a lot better using some threading library instead of asking Perl to emulate individual processes for you. Needless to mention (i hope) that you do not really gain any advantage by having 6000 threads over lets say 16. If you try to have all of them do something at the same time, you will in fact most likely experience slowdowns, depending on how the threading is handled.
In addition to the comments already provided, I want to emphasize the point DeVadder made regarding the behavior of fork in Windows and that Perl threading is likely a better solution but are you sure that the DBD module is safe to be used by multiple processes / forks / threads, etc without setting some extra parameters?
I had a similar error when using the DBI module to access a SQLite DB in multi-processed code using the threads module. It was solved by setting the 'use_immediate_transaction' option for the database handle provided by DBI to 1. If you aren't familiar with how Perl threads work, they aren't threads, they create a copy of the interpreter and everything you have in memory at the time of their creation, but even if I made the database handle separately in each "thread" I would get 'database locked' and various other errors. Without some of these extra options DBD may not function correctly in a multiprocessed environment.
Also, why make 6000 forks, use thread::queue and the threads module, make a worker pool of a few workers (one per core?) and recycle the workers. You are doing alot of overhead every fork for no gain.

How to identify if a long-running process died?

I'm working on a daemon that communicates with several processes. The daemon can't monitor the processes all the time, but it must be able to properly identify if a process dies to release scare resources it holds for it.
The processes can communicate with the daemon, giving it some information at the start, but not vice versa. So the daemon can't just ask a process its identity.
The simplest form would be to use just their PID. But eventually another process could be assigned the same PID without my tool noticing.
A better approach would be to use PID plus the time the process started. A new process with the same PID would have a distinct start time. But I couldn't find a way how to get the process start time in a POSIX way. Using ps or looking at /proc/<pid>/stat seems not portable enough.
A more complicated idea that seems POSIX-compliant would be:
Each process creates a temporary file.
Locks it using flock
Tells my daemon "my identity is connected with this file".
Any time the daemon can check the temporary file. If it's locked, the process is alive. If it's not, the process is dead.
But this seems unnecessarily complicated.
Is there a better, or standard way?
Edit: The daemon must be able to resume after a restart, so it's not possible to keep a persistent connection for each process.
But I couldn't find a way how to get the process start time in a POSIX way.
Try the standard "etime" format specifier: LC_ALL=C ps -eo etime= $PIDS
In fairness, I would probably construct my own table of live processes rather that relying on the process table and elapsed time. That's fundamentally your file-locking approach, though I'd probably aggregate all the lockfiles together in a known place and name them by PID, e.g., /var/run/my-app/8819.lock. Indeed, this might even be retrofitted on to the long-running processes, since file locks on file descriptors can be inherited across exec().
(Of course, if the long-running processes I cared about had a common parent, then I'd rather query the common parent, who can be a reliable authority on which processes are running and which are not.)
The standard way is the unnecessarily complicated one. That' life in a POSIX-compliant environment...
Other methods than the file exist and have various benefits/tradeoffs - most of the "standard" IPC mechanisms would work for this as well - a socket, pipe, message queue, shared memory... Basically pick one mechanism that allows your application to announce to the daemon that it has started (and maybe that it's exiting, for an orderly shutdown). In between, it could send periodic "I'm still here" messages and the daemon could notice when it doesn't get one, or the daemon could poll periodically or something... There's quite a few ways to accomplish what you want, but without knowing more about the exact architecture you're trying to achieve, it's difficult to point at the "one best way"...

the number of pthread_mutex in running system

I have a strange question. I have to calculate the number of
pthread_mutex in running system, for example, debian, ubuntu,system in
microcontroller and etc. I have to do it without LD_PRELOAD,
interrupting, overloading of functions and etc. I have to calculate it
in random time.
Do somebody have idea how I can do it? Can you see me way?
for the count the threads:
ps -eLf will give you a list of all the threads and processes currently running on the system.
However you ask for a list of all threads that HAVE executed on the system, presumably since some arbitrary point in the past - are you sure that is what you mean? You could run ps as a cron job and poll the system every X minutes, but you would miss threads that were born and died between jobs. You would also have a huge amount of data to deal with.
For count the mutex it's impossible

Is it possible to "hang" a Linux box with a SCHED_FIFO process?

I want to have a real-time process take over my computer. :)
I've been playing a bit with this. I created a process which is essentially a while (1) (never blocks nor yields the processor) and used schedtool to run it with SCHED_FIFO policy (also tried chrt). However, the process was letting other processes run as well.
Then someone told me about sched_rt_runtime_us and sched_rt_period_us. So I set the runtime to -1 in order to make the real-time process take over the processor (and also tried making both values the same), but it didn't work either.
I'm on Linux 2.6.27-16-server, in a virtual machine with just one CPU. What am I doing wrong?
Thanks,
EDIT: I don't want a fork bomb. I just want one process to run forever, without letting other processes run.
There's another protection I didn't know about.
If you have just one processor and want a SCHED_FIFO process like this (one that never blocks nor yields the processor voluntarily) to monopolize it, besides giving it a high priority (not really necessary in most cases, but doesn't hurt) you have to:
Set sched_rt_runtime_us to -1 or to the value in sched_rt_period_us
If you have group scheduling configured, set /cgroup/cpu.rt_runtime_us to -1 (in case
you mount the cgroup filesystem on /cgroup)
Apparently, I had group scheduling configured and wasn't bypassing that last protection.
If you have N processors, and want your N processes to monopolize the processor, you just do the same but launch all of them from your shell (the shell shouldn't get stuck until you launch the last one, since it will have processors to run on). If you want to be really sure each process will go to a different processor, set its CPU affinity accordingly.
Thanks to everyone for the replies.
I'm not sure about schedtool, but if you successfully change the scheduler using sched_setscheduler to SCHED_FIFO, then run a task which does not block, then one core will be entirely allocated to the task. If this is the only core, no SCHED_OTHER tasks will run at all (i.e. anything except a few kernel threads).
I've tried it myself.
So I speculate that either your "non blocking" task was blocking, or your schedtool program failed to change the scheduler (or changed it for the wrong task).
Also You can make you process a SCHED_FIFO with highest priority of 1. So the process would run forever and it wont be pre-empted.

monitor and kill runaway processes using 100% IO?

i have a few processes that have to be run at high priority (chrt 98) that will occasionally decide to hard-lock and peg 1 core at 100% (not a huge deal) but more importantly it will use all the IO on a system, so much that its impossible to log into the machine via ssh to kill it or perform any task on the machine that isn't loaded into ram. If i happen to have something like htop already running i am able to end the process fine. Is there any type of utility/way to monitor for this type of runaway process and kill anything that uses 100% of system IO for more than X amount of time? Thanks!
Can't you start the program with nice (and with a lower priority)? This way at least you should be able to ssh into the box and kill it easily.
The better solution would off course be to fix the behaviour of the offending process (details needed).
This serverfault thread also seems to contain what you ask for specifically.
Assuming that it's disk IO that the app is consuming, can you just move the filesystems it's accessing onto separate disks? That way you'll have IO to spare on the disks which the OS is installed on, and should be able to log in and manage (i.e. kill!) the process.
As another poster said, running your process with nice is the way to go, but you did mention that you want to run it at a high priority, which is odd... be aware that if you're running a process at the highest priority and it's pegged, your monitoring system might not even be able to kill it, unless your monitor is at a higher priority still. Anyway....
god, as well as several other process managment tools, can easily kill a process if it's misbehaving in any of several ways.. config looks like this - you set checks at a particular interval, and then you can say "after five checks, nuke it if it's been above 98% CPU usage consistently":
restart.condition(:cpu_usage) do |c|
c.above = 98.percent
c.times = 5
end
Another, different take that you might have a look at is chpst from the runit system - it allows you to elegantly set bounds on things (but for CPU limiting, nice is still the tool I'd reach for first).

Resources