Meaning of "Sleeping" process [duplicate] - linux

What causes these sleeping processes that I see in top? If I were to call PHP's sleep() function, would that add to the sleeping count I see in top? Are there any disadvantages to having a high number in sleeping?

A process is sleeping when it is blocked, waiting for something. For example, it might have called read() and is waiting on data to arrive from a network stream.
sleep() is indeed one way to have your process sleep for a while. Sleeping is, however, the normal state of all but heavily compute-bound processes - sleeping is essentially what a process does when it isn't doing anything else. It's the normal state of affairs for most of your processes to be sleeping - if that's not the case, it tends to indicate that you need more CPU horsepower.

A sleeping process is like suspended process.
A process sleeps when:
It's doing an I/O operation (blocking for I/O)
When you order it to sleep by sleep()
The status of any process can be:
Ready: when it ready for execution and it's in the queue waiting the processor call with specific priority
Sleeping: When it was running and it was blocked for I/O operation or when executing sleep()
Running: When the processor executes a process it becomes running.
Status Meaning
R Runnable
T Stopped
P Waiting on Pagein
D Waiting on I/O
S Sleeping < 20 seconds
I Idle - sleeping >20 seconds
Z Zombie or defunct

They are processes which aren't running on the CPU right now. This is not necessarily a bad thing.
If you have huge numbers (10,000 on a server system, for example) of processes sleeping, the amount of memory etc used to keep track of them may make the system less efficient for non-sleeping processes.
Otherwise, it's fine.
Most normal server systems have 100 to 1000 much of the time; this is not a big deal.
Just because they're not doing anything just now doesn't mean they won't, very soon. Keeping them in memory, ready, reduces latency when they are required.

To go into a bit more detail here, the S state means the process is waiting on a timer or a slow device, while the D state means it is waiting on a fast device.
What constitutes a fast device vs a slow device is not terribly well defined, but generally, all serial, network, and terminal devices are slow devices, while disks are fast devices.

Related

OS Round Robin thread scheduling

Let's assume you have an OS that tries to run threads in round-robin scheduling.
I know there are two instances when the OS will try to switch between multiple threads: (there could be more...)
When the current thread actually yields the CPU earlier on its own.
When the OS receives a timer interrupt.
The question is let's say the OS has a max compute-bound time of say 5ms. (the OS receives a timer interrupt every 5ms)this assumption means that each thread can own a CPU core for a maximum of 5ms.
What happens if a process/thread finishes its time slice earlier than 5ms? Will this cause the next thread to be scheduled to have a compute-bound time lesser than 5ms since the next timer interrupt will occur and the thread will have no choice but to give up the CPU?
Specific Example:
What happens if a process/thread finishes its time slice earlier than 5ms let's say 2ms?
I know another thread will be scheduled, but will that thread have a full-time slice of 5ms or will this next thread only have 3ms before the next timer interrupt occurs?
This question is likely dependent of the OS (not provided). On mainstream OSs, a yielding tasks typically waits for a resource or for a given time. The operating system will not reschedule it unless it becomes ready (completed IO operation, available lock, timeout, etc.). When a new task is ready, the OS scheduler is free to either wait for the end of a time-slice or reschedule the previous task. It is common for the task to be rescheduled so to increase the responsiveness of multithreaded applications (waiting for few milliseconds when a code tries to hold a lock that is already taken is not reasonable). That being said, this behavior is often not implemented so directly. Windows make use of priority boosts so to do that. On Linux, CFS tries to make the schedule fair so that all tasks have a balanced time on all available resources (eg. cores). The priority of the target tasks also matters. The OS of some famous gaming consoles uses a round-robin scheduler by default and they only schedule lower-priority tasks if there is no high-priority task. On such system, when a higher-priority task becomes ready, the current one is directly interrupted (with not delay except the overhead of a context switch).
Put it shortly, the OS does not necessary have to wait for a timer interrupt to do context switches. Also, yes, time-slices are generally never left empty so they are reused by other tasks. Whether the scheduled tasks can be scheduled in a full time slice is dependent of the actual OS scheduler. Also note that a thread do not "give up the CPU": user-land threads have not real control on this. In practice, either a schedule-like kernel call is done during a system call (causing a context switch of the current task), or a system interrupt causes a kernel code to be executed that typically do this schedule-like kernel call at the end of a time slice.
There are a lot of ways threads can yield, e.g. posting or waiting on a semaphore, input, output, etc. If a thread yields, or it's scheduling timer times out (5ms), the OS will rummage through the list of threads to see what else can be run.
That rummaging literally involves running through the list of threads and seeing what their status is.
Some threads may be listed as "preempted" (i.e. they hadn't yielded, the OS 5ms scheduling timer timed out and the OS had suspended them in favour of another) in which case one of those can simply be reinstated (registers restored, program counter set, CPU picks up from that point). Round Robin scheduling is simply an extra piece of information, namely "When did this thread last get run?", the OS favouring the thread that's not been run for longer than all the others.
Others will be listed as waiting on I/O (so those can't be run), and yet more will be listed as waiting on locks like semaphores (so those can't be run either).
Note that something like a sem_post() is also a yield, giving the OS a chance to do this rummaging, perhaps finding a thread that is listed as waiting on the semaphore that's just been posted.
If an OS determines that across all processes there are no runnable threads at all (nothing waiting on semaphores, everything hung up waiting for I/O), and there is nothing for it to do itself, what happens next depends on the CPU. For some older CPUs, the OS would literally have to enter an infinite loop waiting for some interrupt to fire from some device. More modern CPUs have an instruction which will suspend execution until some interrupt fires.
Basically, the OS scheduler is part interrupt service routine (responding to timer or device interrupts), and part "ordinary" code that simply manages lists of threads when threads voluntarily yield.

What could delay pthread_join() after threads have exited successfully?

My main thread creates 8 worker threads (on a machine with a 4 core, 8 thread CPU), and then waits for them to complete with pthread_join(). The threads all exit successfully, and the pthread_join() successfully completes. However, I log the times that the threads exit and the time that pthread_join() completes for the last thread; the threads all exit essentially simultaneously (not surprising -- they are servicing a queue of work to be done), and the pthread_join() sometimes takes quite a long time to complete -- I have seen times in excess of 15 minutes after the last worker thread has exited!
More information: The worker threads are all set at the highest allowable round-robin scheduling priority (SCHED_RR); I have tried setting the main thread (waiting on the pthread_join()s) to the same thing and have also tried setting it to the highest SCHED_FIFO priority (where so far I have only seen it take as long as 27 seconds to complete; more testing is needed). My test is very CPU and memory intensive and takes about 90 -- 100 minutes to complete; during that time it is generally using all 8 threads at close to 100% capacity, and fairly quickly gets to where it is using about 90% of the 256 GB of RAM. This is running on a Linux (Fedora) OS at run level 3 (so no graphics or Window Manager -- essentially just a terminal -- because at the usual run level 5, a process using that much memory gets killed by the system).
An earlier version that took closer to 4 hours to complete (I have since made some performance improvements...) and in which I did not bother explicitly setting the priority of the main thread once took over an hour and 20 minutes for the pthread_join() to complete. I mention it because I don't really think that the main thread priority should be much of an issue -- there is essentially nothing else happening on the machine, it is not even on the network.
As I mentioned, all the threads complete with EXIT_SUCCESS. And in lighter weight tests, where the processing is over in seconds, I see no such delay. And so I am left suspecting that this is a scheduler issue. I know very little about the scheduler, but informally the impression I have is that here is this thread that has been waiting on a pthread_join() for well over an hour; perhaps the scheduler eventually shuffles it off to a queue of "very unlikely to require any processing time" tasks, and only checks it rarely.
Okay, eventually it completes. But ultimately, to get my work done, I have to run about 1000 of these, and some are likely to take a great deal longer than the 90 minutes or so that the case I have been testing takes. So I have to worry that the pthread_join() in those cases might delay even longer, and with 1000 iterations, those delays are going to add up to real time...
Thanks in advance for any suggestions.
In response to Nate's excellent questions and suggestions:
I have used top to spy on the process when it is in this state; all I can report is that it is using minimal CPU (maybe an occasional 2%, compared to the usual 700 - 800% that top reports for 8 threads running flat out, modulo some contention for locked resources). I am aware that top has all kinds of options I haven't investigated, and will look into how to run it to display information about the state of the main thread. (I see: I can use the -H option, and look in the S column... will do.) It is definitely not a matter of all the memory being swapped out -- my code is very careful to stay below the limit of physical memory, and does some disk I/O of its own to save and restore information that can't fit in memory. As a result little to no virtual memory is in use at any time.
I don't like my theory about the scheduler either... It's just the best I have been able to come up with so far...
As far as how I am determining when things happen: The exiting code does:
time_t now;
time(&now);
printf("Thread exiting, %s", ctime(&now));
pthread_exit(EXIT_SUCCESS);
and then the main thread does:
for (int i = 0; i < WORKER_THREADS; i++)
{
pthread_join(threads[i], NULL);
}
time(&now);
printf("Last worker thread has exited, %s", ctime(&now));
I like the idea of printing something each time pthread_join() returns, to see if we're waiting for the first thread to complete, the last thread to complete, or one in the middle, and will make that change.
A couple of other potentially relevant facts that have occurred to me since my original posting: I am using the GMP (GNU Multiprecision Arithmetic) library, which I can't really imagine matters; and I am also using a 3rd party (open source) library to create "canonical graphs," and that library, in order to be used in a multithreaded environment, does use some thread_local storage. I will have to dig into the particulars; still, it doesn't seem like cleaning that up should take any appreciable amount of time, especially without also using an appreciable amount of CPU.

Does a thread waiting on IO also block a core?

In the synchronous/blocking model of computation we usually say that a thread of execution will wait (be blocked) while it waits for an IO task to complete.
My question is simply will this usually cause the CPU core executing the thread to be idle, or will a thread waiting on IO usually be context switched out and put into a waiting state until the IO is ready to be processed?
A CPU core is normally not dedicated to one particular thread of execution. The kernel is constantly switching processes being executed in and out of the CPU. The process currently being executed by the CPU is in the "running" state. The list of processes waiting for their turn are in a "ready" state. The kernel switches these in and out very quickly. Modern CPU features (multiple cores, simultaneous multithreading, etc.) try to increase the number of threads of execution that can be physically executed at once.
If a process is I/O blocked, the kernel will just set it aside (put it in the "waiting" state) and not even consider giving it time in the CPU. When the I/O has finished, the kernel moves the blocked process from the "waiting" state to the "ready" state so it can have its turn ("running") in the CPU.
So your blocked thread of execution blocks only that: the thread of execution. The CPU and the CPU cores continue to have other threads of execution switched in and out of them, and are not idle.
For most programming languages, used in standard ways, then the answer is that it will block your thread, but not your CPU.
You would need to explicitely reserve a CPU for a particular thread (affinity) for 1 thread to block an entire CPU. To be more explicit, see this question:
You could call the SetProcessAffinityMask on every process but yours with a mask that excludes just the core that will "belong" to your process, and use it on your process to set it to run just on this core (or, even better, SetThreadAffinityMask just on the thread that does the time-critical task).
If we assume it's not async, then I would say, in that case, your thread owning the thread would be put to the waiting queue for sure and the state would be "waiting".
Context-switching wise, IMO, it may need a little bit more explanation since the term context-switch can mean/involve many things (swapping in/out, page table updates, register updates, etc). Depending on the current state of execution, potentially, a second thread that belongs to the same process might be scheduled to run whilst the thread that was blocked on the IO operation is still waiting.
For example, then context-switching would most likely be limited to changing register values on the CPU regarding core (but potentially the owning process might even get swapped-out if there's no much memory left).
no,in java , block thread did't participate scheduling

Linux thread sleep vs read

In my application there is a Linux thread that needs to be active every 10 ms,
thus I use usleep (10*1000). Result: thread never wakes up after 10 ms but always after 20 ms. OK, it is related to scheduler timeslice, CONFIG_HZ etc.
I was trying to use usleep(1*1000) (that is 1 ms) but the result was the same. Thread always wakes up after 20 ms.
But in the same application the other thread handles network events (UDP packets) that came in every 10 ms. There is blocking 'recvfrom' (or 'select') and it wakes up every 10 ms when there is incoming packet.
Why it is so ? Does select has to put the thread in 'sleep' when there are no packets? Why it behave differently and how can I cause my thread to be active every 10 ms (well more or less) without external network events?
Thanks,
Rafi
You seem to be under the common impression that these modern preemptive multitaskers are all about timeslices and quantums.
They are not.
They are all about software and hardware interrupts, and the timer hardware interrupt is only one of many that can set a thread ready and change the set of running threads. The hardware interrupt from a NIC that causes a network driver to run is an example of another one.
If a thread is blocked, waiting for UDP datagrams, and a datagram becomes avaialable because of a NIC interrupt running a driver, the blocked thread will be made ready as soon as the NIC driver has run because the driver will signal the thread and request an immediate reschedule on exit. If your box is not overloaded with higher-rpiority ready threads, it will be set running 'immediately' to handle the datagram that is now available. This mechanism provides high-performance I/O and has nothing to do with any timers.
The timer interrupt runs periodically to support sleep() and other system-call timeouts. It runs at a fairly low frequency/high interval, (like 1/10ms), because it is another overhead that should be minimised. Running such an interrupt at a higher frequency would reduce timer granularity at the expense of increased interrupt-state and rescheduling overhead that is not justified in most desktop installations.
Summary: your timer operations are subject to 10ms granularity but your datagram I/O responds quickly.
Also, why does you thread need to be active every 10ms? What are you polling for?

Linux Process States

In Linux, what happens to the state of a process when it needs to read blocks from a disk? Is it blocked? If so, how is another process chosen to execute?
When a process needs to fetch data from a disk, it effectively stops running on the CPU to let other processes run because the operation might take a long time to complete – at least 5ms seek time for a disk is common, and 5ms is 10 million CPU cycles, an eternity from the point of view of the program!
From the programmer point of view (also said "in userspace"), this is called a blocking system call. If you call write(2) (which is a thin libc wrapper around the system call of the same name), your process does not exactly stop at that boundary; it continues, in the kernel, running the system call code. Most of the time it goes all the way up to a specific disk controller driver (filename → filesystem/VFS → block device → device driver), where a command to fetch a block on disk is submitted to the proper hardware, which is a very fast operation most of the time.
THEN the process is put in sleep state (in kernel space, blocking is called sleeping – nothing is ever 'blocked' from the kernel point of view). It will be awakened once the hardware has finally fetched the proper data, then the process will be marked as runnable and will be scheduled. Eventually, the scheduler will run the process.
Finally, in userspace, the blocking system call returns with proper status and data, and the program flow goes on.
It is possible to invoke most I/O system calls in non-blocking mode (see O_NONBLOCK in open(2) and fcntl(2)). In this case, the system calls return immediately and only report submitting the disk operation. The programmer will have to explicitly check at a later time whether the operation completed, successfully or not, and fetch its result (e.g., with select(2)). This is called asynchronous or event-based programming.
Most answers here mentioning the D state (which is called TASK_UNINTERRUPTIBLE in the Linux state names) are incorrect. The D state is a special sleep mode which is only triggered in a kernel space code path, when that code path can't be interrupted (because it would be too complex to program), with the expectation that it would block only for a very short time. I believe that most "D states" are actually invisible; they are very short lived and can't be observed by sampling tools such as 'top'.
You can encounter unkillable processes in the D state in a few situations. NFS is famous for that, and I've encountered it many times. I think there's a semantic clash between some VFS code paths, which assume to always reach local disks and fast error detection (on SATA, an error timeout would be around a few 100 ms), and NFS, which actually fetches data from the network which is more resilient and has slow recovery (a TCP timeout of 300 seconds is common). Read this article for the cool solution introduced in Linux 2.6.25 with the TASK_KILLABLE state. Before this era there was a hack where you could actually send signals to NFS process clients by sending a SIGKILL to the kernel thread rpciod, but forget about that ugly trick.…
While waiting for read() or write() to/from a file descriptor return, the process will be put in a special kind of sleep, known as "D" or "Disk Sleep". This is special, because the process can not be killed or interrupted while in such a state. A process waiting for a return from ioctl() would also be put to sleep in this manner.
An exception to this is when a file (such as a terminal or other character device) is opened in O_NONBLOCK mode, passed when its assumed that a device (such as a modem) will need time to initialize. However, you indicated block devices in your question. Also, I have never tried an ioctl() that is likely to block on a fd opened in non blocking mode (at least not knowingly).
How another process is chosen depends entirely on the scheduler you are using, as well as what other processes might have done to modify their weights within that scheduler.
Some user space programs under certain circumstances have been known to remain in this state forever, until rebooted. These are typically grouped in with other "zombies", but the term would not be correct as they are not technically defunct.
A process performing I/O will be put in D state (uninterruptable sleep), which frees the CPU until there is a hardware interrupt which tells the CPU to return to executing the program. See man ps for the other process states.
Depending on your kernel, there is a process scheduler, which keeps track of a runqueue of processes ready to execute. It, along with a scheduling algorithm, tells the kernel which process to assign to which CPU. There are kernel processes and user processes to consider. Each process is allocated a time-slice, which is a chunk of CPU time it is allowed to use. Once the process uses all of its time-slice, it is marked as expired and given lower priority in the scheduling algorithm.
In the 2.6 kernel, there is a O(1) time complexity scheduler, so no matter how many processes you have up running, it will assign CPUs in constant time. It is more complicated though, since 2.6 introduced preemption and CPU load balancing is not an easy algorithm. In any case, it’s efficient and CPUs will not remain idle while you wait for the I/O.
As already explained by others, processes in "D" state (uninterruptible sleep) are responsible for the hang of ps process. To me it has happened many times with RedHat 6.x and automounted NFS home directories.
To list processes in D state you can use the following commands:
cd /proc
for i in [0-9]*;do echo -n "$i :";cat $i/status |grep ^State;done|grep D
To know the current directory of the process and, may be, the mounted NFS disk that has issues you can use a command similar to the following example (replace 31134 with the sleeping process number):
# ls -l /proc/31134/cwd
lrwxrwxrwx 1 pippo users 0 Aug 2 16:25 /proc/31134/cwd -> /auto/pippo
I found that giving the umount command with the -f (force) switch, to the related mounted nfs file system, was able to wake-up the sleeping process:
umount -f /auto/pippo
the file system wasn't unmounted, because it was busy, but the related process did wake-up and I was able to solve the issue without rebooting.
Assuming your process is a single thread, and that you're using blocking I/O, your process will block waiting for the I/O to complete. The kernel will pick another process to run in the meantime based on niceness, priority, last run time, etc. If there are no other runnable processes, the kernel won't run any; instead, it'll tell the hardware the machine is idle (which will result in lower power consumption).
Processes that are waiting for I/O to complete typically show up in state D in, e.g., ps and top.
Yes, the task gets blocked in the read() system call. Another task which is ready runs, or if no other tasks are ready, the idle task (for that CPU) runs.
A normal, blocking disc read causes the task to enter the "D" state (as others have noted). Such tasks contribute to the load average, even though they're not consuming the CPU.
Some other types of IO, especially ttys and network, do not behave quite the same - the process ends up in "S" state and can be interrupted and doesn't count against the load average.
Yes, tasks waiting for IO are blocked, and other tasks get executed. Selecting the next task is done by the Linux scheduler.
Generally the process will block. If the read operation is on a file descriptor marked as non-blocking or if the process is using asynchronous IO it won't block. Also if the process has other threads that aren't blocked they can continue running.
The decision as to which process runs next is up to the scheduler in the kernel.

Resources