I have some data acquisition application running under Linux 2.6.37 on DM8148 with TI Linux. I have two threads:
thread named IDE, scheduled as SCHED_RR, prio 114 (75), which collects data from HW FIFO arriving at 200KiB/s into 30MiB ring buffer each 2ms:
while(1) {
sleep(ms);
while(DataInFIFO) {
CollectToRingBuffer();
SignalToWriter(); }
}
thread WriterIDE, scheduled as SCHED_RR, prio 113 (74), writing this ring buffer in to the USB disk-on-key.
while(1) {
WaitForSignal();
writeToFileOnDOK();
}
I know from measures of write() function that sometimes this USB writing may "hang" for some 1.5 and even 2 seconds, trying to write to the DOK. But I was sure that as I gave the collector task 30MiB, which is enough for 150s, everything will be OK.
No! It is not!
I put the time measuring code. And what I see is, that when writer hangs for a long time (f.e.1342ms), then entering the collector thread time also is very large (306ms). This causes HW FIFO overflow and data inconsistency.
I checked the spread of threads priority in the system (ps command) - nothing is real-time, except me. All system tasks are scheduled as OTHER (TS in ps output), even kernel USB threads. Only IRQ tasks are FF, but even them are of lower priority.
I don't know where to go from here... :-(
Related
I know that threads cannot actually run in parallel on the same core, but in a regular desktop system there is normally hundreds or even thousands of threads. Which is of course much more than today's average of 4 core CPU's. So the system actually running some thread for X time and then switches to run another thread for Y amount of time an so on.
My question is, how does the system decide how much time to execute each thread?
I know that when a program is calling sleep() on a thread for an amount of time, the operation system can use this time to execute other threads, but what happens when a program does not call sleep at all?
E.g:
int main(int argc, char const *argv[])
{
while(true)
printf("busy");
return 0;
}
When does the operating system decide to suspend this thread and excutre another?
The OS keeps a container of all those threads that can use CPU execution, (usually such threads are described as being'ready'). On most desktop systems, this is a very small fraction of the total number of threads. Most threads in such systems are waiting on either I/O, (this includes sleeping - waiting on timer I/O), or inter-thread signaling; such threads cannot use CPU execution and so the OS does not dispatch them onto cores.
A software syscall, (eg. a request to open a file, a request to sleep or wait for a signal from another thread), or a hardware interrupt from a peripheral device, (eg. a disk controller, NIC, KB, mouse), may cause the set of ready threads to change and so initiate a scheduling run.
When run, the shceduler decides on what set of ready threads to assign to the available cores. The algorithm it uses is a compromise that tries to optimize overall performance by balancing the need for expensive context-switches with the need for responsive I/O. The kernel CAN stop any thread on any core an preempt it, but it would surely prefer not to:)
So:
My question is, how does the system decide how much time to execute
each thread?
Essentially, it does not. If the set of ready threads is not greater than the number of cores, there is no need to stop/control/influence a CPU-intensive loop - it can be allowed to run on forever, taking up a whole core.
Note that your example is very poor - the printf() call will request output from the OS and, if not immediately available, the OS will block your seemingly 'CPU only' thread until it is.
but what happens when a program does not call sleep at all?
It's just one more thread. If it is purely CPU-intensive, then whether it runs continually depends upon the loading on the box and the number of cores available, as already described. It can, of course, get blocked by requesting I/O or electing to wait for a signal from another thread, so removing itself from the set of ready threads.
Note that one I/O device is a hardware timer. This is very useful for timing out system calls and providing Sleep() functionality. It usually does have a side-effect on those boxes where the number of ready threads is larger than the number of cores available to run them, (ie. the box is overloaded or the task/s it runs have no limits on CPU use). It can result in sharing out the available cores around the ready threads, so giving the illusion of running more threads than it's actually physically capable of, (try not to get hung up on Sleep() and the timer interrupt - it's one of many interrupts that can change thread state).
It is this behaviour of the timer hardware, interrupt and driver that gives rise to the apalling 'quantum', 'time-sharing', 'round-robin' etc. etc.etc. confusion and FUD that surrounds the operation of modern preemptive kernels.
A preemptive kernel, and it's drivers etc, is a state-machine. Syscalls from running threads and hardware interrupts from peripheral devices go in, a set of running threads comes out.
It depends which type of scheduling your OS is using for example lets take
Round Robbin:
In order to schedule processes fairly, a round-robin scheduler generally employs time-sharing, giving each job a time slot or quantum(its allowance of CPU time), and interrupting the job if it is not completed by then. The job is resumed next time a time slot is assigned to that process. If the process terminates or changes its state to waiting during its attributed time quantum, the scheduler selects the first process in the ready queue to execute.
There are others scheduling algorithms as well you will find this link useful:https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/5_CPU_Scheduling.html
The operating system has a component called the scheduler that decides which thread should run and for how long. There are essentially two basic kinds of schedulers: cooperative and preemptive. Cooperative scheduling requires that the threads cooperate and regularly hand control back to the operating system, for example by doing some kind of IO. Most modern operating systems use preemptive scheduling.
In preemptive scheduling the operating system gives a time slice for the thread to run. The OS does this by setting a handler for a CPU timer: the CPU regularly runs a piece of code (the scheduler) that checks if the current thread's time slice is over, and possibly decides to give the next time slice to a thread that is waiting to run. The size of the time slice and how to choose the next thread depends on the operating system and the scheduling algorithm you use. When the OS switches to a new thread it saves the state of the CPU (register contents, program counter etc) for the current thread into main memory, and restores the state of the new thread - this is called a context switch.
If you want to know more, the Wikipedia article on Scheduling has lots of information and pointers to related topics.
I have a real-time process sending occasional communication over RS232 to a high speed camera. I have several other real-time processes occupying a lot of CPU time, doing image processing on several GPU boards using CUDA. Normally the serial communication is very fast, with a message and response taking about 50 ms every time. However, when the background processes are busy doing image processing, the serial communication slows way down, often taking multiple seconds (sometimes more than 10 seconds).
In summary, during serial communication, Process A is delayed if Process B, C, etc., are very busy, even though process A has the highest priority:
Process A (real-time, highest priority): occasional serial communication
Processes B, C, D, etc. (real-time, lower priority): heavy CPU and GPU processing
When I change the background processes to be SCHED_OTHER (non-real-time) processes, the serial communication is fast; however, this isn't a solution for me, because the background processes need to be real-time processes (when they are not, the GPU processing doesn't keep up adequately with the high speed camera).
Apparently the serial communication is relying on some non-real-time process in the system, which is being pre-empted by my real-time background processes. I think if I knew which process was being used for serial communication, I could increase its priority and solve the problem. Does anyone know whether serial communication relies on any particular process running on the system?
I'm running RHEL 6.5, with the standard kernel (not PREEMPT_RT). It has dual 6-core CPUs.
At Erki A's suggestion, I captured an strace. Apparently it is a select() system call which is slow (the "set roi2" is the command to the camera, and the "Ok!" at the end is the response from the camera):
write(9, "set roi2"..., 26) = 26 <0.001106>
ioctl(9, TCSBRK, 0x1) = 0 <0.000263>
select(10, [9], NULL, NULL, {2, 0}) = 1 (in [9], left {0, 0}) <2.252840>
read(9, "Ok!\r\n", 4096) = 5 <0.000092>
The slow select() makes it seem like the camera itself is slow to respond. However, I know that isn't true, because of how the speed is impacted by changing the background process priorities. Is select() in this case dependent on a certain other process running?
If I skip the select() and just do the read(), the read() system call is the slow one.
Depending on your serial device/driver, the serial communications are most likely relying on a kernel worker thread (kworker) to shift the incoming serial data from the interrupt service routine buffers to the line discipline buffers. You could increase the priority of the kernel worker thread, however, worker threads process the shared work queue. So increasing the priority of the worker thread will increase the priority of the serial processing along with a whole bunch of other stuff that possibly doesn't need the priority boost.
You could modify the serial driver to use a dedicated high priority work queue rather than a shared one. Another option would be to use a tasklet, however, both these require driver level modifications.
I suspect the most straight forward solution would be to set the com port to low latency mode, either from the command line via the setserial command:
setserial /dev/ttySxx low_latency
or programatically:
struct serial_struct serinfo;
fd = open ("/dev/ttySxx");
ioctl (fd, TIOCGSERIAL, &serinfo);
serinfo.flags |= ASYNC_LOW_LATENCY;
ioctl (fd, TIOCSSERIAL, &serinfo);
close (fd);
This will cause the serial port interrupt handler to transfer the incoming data to the line discipline immediately rather than deferring the transfer by adding it to a work queue. In this mode, when you call read() from your application, you will avoid the possibility of the read() call sleeping, which it would otherwise do, if there is work in the work queue to flush. This sleep is probably the cause of your intermittent delays.
You can use strace to see where it locks up. If it is more than 10 seconds, it should be easy to see.
In my application there is a Linux thread that needs to be active every 10 ms,
thus I use usleep (10*1000). Result: thread never wakes up after 10 ms but always after 20 ms. OK, it is related to scheduler timeslice, CONFIG_HZ etc.
I was trying to use usleep(1*1000) (that is 1 ms) but the result was the same. Thread always wakes up after 20 ms.
But in the same application the other thread handles network events (UDP packets) that came in every 10 ms. There is blocking 'recvfrom' (or 'select') and it wakes up every 10 ms when there is incoming packet.
Why it is so ? Does select has to put the thread in 'sleep' when there are no packets? Why it behave differently and how can I cause my thread to be active every 10 ms (well more or less) without external network events?
Thanks,
Rafi
You seem to be under the common impression that these modern preemptive multitaskers are all about timeslices and quantums.
They are not.
They are all about software and hardware interrupts, and the timer hardware interrupt is only one of many that can set a thread ready and change the set of running threads. The hardware interrupt from a NIC that causes a network driver to run is an example of another one.
If a thread is blocked, waiting for UDP datagrams, and a datagram becomes avaialable because of a NIC interrupt running a driver, the blocked thread will be made ready as soon as the NIC driver has run because the driver will signal the thread and request an immediate reschedule on exit. If your box is not overloaded with higher-rpiority ready threads, it will be set running 'immediately' to handle the datagram that is now available. This mechanism provides high-performance I/O and has nothing to do with any timers.
The timer interrupt runs periodically to support sleep() and other system-call timeouts. It runs at a fairly low frequency/high interval, (like 1/10ms), because it is another overhead that should be minimised. Running such an interrupt at a higher frequency would reduce timer granularity at the expense of increased interrupt-state and rescheduling overhead that is not justified in most desktop installations.
Summary: your timer operations are subject to 10ms granularity but your datagram I/O responds quickly.
Also, why does you thread need to be active every 10ms? What are you polling for?
In my program, whose rss is 65G, when call fork, sys_clone->dup_mm->copy_page_range will consume more than 2 seconds. In this case, one cpu will 100% sys when execute fork, at the same time, one thread cannot get cpu time until fork finish. The machine has 16 CPUs, the other CPUs is idle.
So my question is one cpu was busy on fork, why the scheduler don't migrate the process waiting on this cpu to other idle cpu? In general, when and how the scheduler migrate process between cpus?
I search this site, and the existing threads cannot answer my question.
How Linux scheduler schedules processes on multi-core processors?
Can a multi-core processor run multiple processes at the same time?
rss is 65G, when call fork, sys_clone->dup_mm->copy_page_range will consume more than 2 seconds
While doing fork (or clone) the vmas of existing process should be copied into vmas of new process. dup_mm function (kernel/fork.c) creates new mm and do actual copy. There are no direct calls to copy_page_range, but I think, static function dup_mmap may be inlined into dup_mm and it has calls to copy_page_range.
In the dup_mmap there are several locks locked, both in new mm and old oldmm:
356 down_write(&oldmm->mmap_sem);
After taking the mmap_sem reader/writer semaphore, there is a loop over all mmaps to copy their metainformation:
381 for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next)
Only after the loop (it is long in your case), mmap_sem is unlocked:
465 out:
468 up_write(&oldmm->mmap_sem);
While the rwlock mmap_sep is down by writer, no any other reader or writer can do anything with mmaps in oldmm.
one thread cannot get cpu time until fork finish
So my question is one cpu was busy on fork, why the scheduler don't migrate the process waiting on this cpu to other idle cpu?
Are you sure, that other thread is ready to run and not wanting to do anything with mmaps, like:
mmaping something new or unmapping something not needed,
growing or shrinking its heap (brk),
growing its stack,
pagefaulting
or many other activities...?
Actually, the wait-cpu thread is my IO thread, which send/receive package from client, in my observation, the package always exist, but the IO thread cannot receive it.
You should check stack of your wait-cpu thread (there is even SysRq for this), and kind of I/O. mmaping of file is the variant of I/O which will be blocked on mmap_sem by fork.
Also you can check the "last used CPU" of the wait-cpu thread, e.g. in the top monitoring utility, by enabling the thread view (H key) and adding "Last used CPU" column to output (fj in older; f scroll to P, enter in newer). I think it is possible that your wait-cpu thread already was on the other CPU, just not allowed (not ready) to run.
If you are using fork only to make exec, it can be useful to:
either switch to vfork+exec (or just to posix_spawn). vfork will suspend your process (but may not suspend your other threads, it is dangerous) until new process will do exec or exit, but execing may be faster than waiting for 65 GB of mmaps to be copied.
or not doing fork from the multithreaded process with several active threads and multi-GB virtual memory. You can create small (without multi-GB mmaped) helper process, communicate with it using ipc or sockets or pipes and ask it to fork and do everything you want.
So I'm learning about threads at the moment and I'm wondering how some things are handled. For example, say I have a program where one thread listens for input and another performs some calculation on a single processor. When the calculation thread is running, what happens if the user should press a button intended for the input thread? Won't the input get ignored by the input thread until it is switched to that specific thread?
It depends a good deal on how the input mechanism is implemented. One easy-but-very-inelegant way to implement I/O is continuous polling... in that scenario, the input thread might sit in a loop, reading a hardware register over and over again, and when the value in the register changes from 0 to 1, the input thread would know that the button is pressed:
void inputThread()
{
while(1)
{
if (some_register_indicates_the_button_is_pressed()) react();
}
}
The problem with this method is that it's horribly inefficient -- the input thread is using billions of CPU cycles just checking the register over and over again. In a multithreaded system running this code, the thread scheduler would switch the CPU between the busy-waiting input thread and the calculation thread every quantum (e.g. once every 10 milliseconds) so the input thread would use half of the CPU cycles and the calculation thread would use the other half. In this system, if the input thread was running at the instant the user pressed the button, the input would be detected almost instantaneously, but if the calculation thread was running, the input wouldn't be detected until the next time the input thread got to run, so there might be as much as 10mS delay. (Worse, if the user released the button too soon, the input thread might never notice it was pressed at all)
An improvement over continuous polling is scheduled polling. It works the same as above, except that instead of the input thread just polling in a loop, it polls once, then sleeps for a little while, then polls again:
void inputThread()
{
while(1)
{
if (some_register_indicates_the_button_is_pressed()) react();
usleep(3000); // sleep for 30 milliseconds
}
}
This is much less inefficient that the first case, since every time usleep() is called, the thread scheduler puts the input thread to sleep and the CPU is made immediately available for any other threads to use. usleep() also sets a hardware timer, and when that hardware timer goes off (30 milliseconds later) it raises an interrupt. The interrupt causes the CPU to leave off whatever it was doing and run the thread-scheduling code again, and the thread-scheduling code will (in most cases) realize that its time for usleep() to return, and wake up the input thread so it can do another iteration of its loop. This still isn't perfect: the inputThread is still using a small amount of CPU on an ongoing basis -- not much, but if you do many instances of this it starts to add up. Also, the problem of the thread being asleep the whole time the button is held down is still there, and potentially even more likely.
Which leads us to interrupt-driven I/O. In this model, the input thread doesn't poll at all; instead it tells the OS to notify it when the button is pressed:
void inputThread()
{
while(1)
{
sleep_until_button_is_pressed();
react();
}
}
The OS's notification facility, in turn, has to set things up so that the OS is notified when the button is pressed, so that the OS can wake up and notify the input thread. The OS does this by telling the button's control hardware to generate an interrupt when the button is pressed; once that interrupt goes off, it works much like the timer interrupt in the previous example; the CPU runs the thread scheduler code, which sees that it's time to wake up the input thread, and lets the input thread run. This mechanism has very nice properties: (1) the input thread gets woken up ASAP when the button is pressed (there's no waiting around for the calculation thread to finish its quantum first), and (2) the input thread doesn't eat up any CPU cycles at all, except when the button is pushed. Because of these advantages, it's this sort of mechanism that is used in modern computers for any non-trivial I/O.
Note that on a modern PC or Mac, there's much more going on than just two threads and a hardware button; e.g. there are dozens of hardware devices (keyboard, mouse, video card, hard drive, network card, sound card, etc) and dozens of programs running at once, and it's the operating system's job to mediate between them all as necessary. Despite all that, the general principles are still the same; let's say that in your example the button the user clicked wasn't a physical button but an on-screen GUI button. In that case, something like the following sequence of events would occur:
User's finger presses the left mouse button down
Mouse's internal hardware sends a mouse-button-pressed message over the USB cable to the computer's USB controller
Computer's USB controller generates an interrupt
Interrupt causes the CPU to break out of the calculation thread's code and run the OS's scheduler routine
The thread scheduler sees that the USB interrupt line indicates a USB event is ready, and responds by running the USB driver's interrupt handler code
USB driver's interrupt handler code reads in the event, sees that it is a mouse-button-pressed event, and passes it along to the window manager
Window manager knows which window has the focus, so it knows which program to forward the mouse-button-pressed event to
Window manager tells the OS to wake up the input thread associated with that window
Your input thread wakes up and calls react()
If you're running on a single processor system, then yes.
Short answer: yes, threads always interact. The problems start to appear when they interact in a non-predictable way. Every thread in a process has access to the entire process memory space, so changing memory in one thread may spoil the data for another thread.
Well, there are multiple ways the thread can comunicate with each other. One of them is having global variable and use it as a buffer for communication beteen threads.
When you asked about button there must be a thread containing event loader loop. Within this thread, input won't be ignored according to my experience.
You can see some of my threads about this topic:
Here, I was interested how to make 3 thread application that do communicate through events.
The thread waiting for user input will be made ready 'immediately'. On most OS, threads that were waiting on I/O and have become ready are given a temporary priority boost and, even on a single-core CPU, will 'immediately' preempt another thread that was running at the same priority.
So, if a single-core CPU is running a calculation and another, waiting, thread of the same priority gets input, it will probably run straightaway.