unfair linux thread scheduling in single process - linux

I have a process with two threads.
First thread is doing async work - it waits for IO on descriptors and timer events in epoll_wait.
Second thread is doing a lot of IO/memory work - it reads data from disk, process it in memory, allocates a lot of new memory, write it to disk and so on.
The problem is that first thread blocks in epoll_wait for much longer time, then was requested in timeout for epoll_wait(e.g. timeout was specified as 1500 ms and actually return from epoll_wait occures in 10 seconds).
This behavior I can reliably reproduce in virtual machine(VirtualBox with Ubuntu 16.04).
Example of behavior from GDB:
Thread 2.1 "se.real" hit Breakpoint 1, boost::asio::detail::epoll_reactor::run (this=0x826ebe0, block=true, ops=...) at /opt/com/include/boost/158/boost/asio/detail/impl/epoll_reactor.ipp:392
392 in /opt/com/include/boost/158/boost/asio/detail/impl/epoll_reactor.ipp
16:36:38.986826839
$17 = 1945
Thread 2.1 "se.real" hit Catchpoint 3 (call to syscall epoll_wait), 0xf7fd8be9 in __kernel_vsyscall ()
16:36:38.992081396
<INSIDE KERNEL>
Thread 2.1 "se.real" hit Catchpoint 3 (returned from syscall epoll_wait), 0xf7fd8be9 in __kernel_vsyscall ()
16:36:54.681444938
Breakpoint 1 is set to instruction prior to call epoll_wait, printed argument is timeout argument value(1945 ms).
Printed time is time from shell date +"%T.%N" command.
Catchpoint 3 is syscall catchpoint for epoll_wait syscall(first for enter, second for return).
We can easily see that we have spent in kernel for ~ 16 seconds, when 1945 ms were requested.
I have gathered perf record with -e 'sched:*' events from another reproduction. And I perfectly see:
se.real 4277 [001] 113049.144027: sched:sched_switch: prev_comm=se.real prev_pid=4277 prev_prio=120 prev_state=t|K ==> next_comm=strace next_pid=4142 next_prio=120
se.real 4277 [001] 113056.407952: sched:sched_stat_runtime: comm=se.real pid=4277 runtime=153767 [ns] vruntime=409222246640 [ns]
No any other sched event for thread 4277(first thread with async IO and epoll_wait) for ~7 seconds. In mean time there are a lot of sched activity between this events. This activity includes both second thread(thread with a lot of IO/memory work), swapper/kswapd, and other userspace processes.
The question is what I can do to give a chance for running first thread?
Update: changing scheduling policy to SCHED_FIFO for process doesn't solve problem - I'm still able stably reproduce the issue.

Related

Perf stat counts context-switches in what way?

perf stat displays some interesting statistics that can be gathered from examining hardware and software counters.
In my research, I couldn't find any reliable information about what counts as a context-switch in perf stat. In spite of my efforts, I was unable to understand the kernel code in its entirety.
Suppose my InfiniBand network application calls a blocking read system call in the event mode 2000 times and perf stat counts 1,241 context switches. The context-switches refer to either the schedule-in process or the schedule-out process, or both?
The __schedule() function (kernel/sched/core.c) increments the switch_count counter whenever prev != next.
It seems that perf stats' context-switches include involuntary switches as well as voluntary switches.
It seems to me that only deschedule events are counted if the current context runs the schedule code and increases the nvcsw and nivcsw counters in the task_struct.
output from perf stat -- my_application:
1,241 context-switches
Meanwhile, if I only count the sched:sched_switch event the output is close to the expected number.
output from perf stat -e sched:sched_switch -- my_application:
2,168 sched:sched_switch
Is there a difference between context-switches and the sched_switch- event?
I think you only get a count for context-switches if a different task actually runs on a core that was running one of your threads. A read() that blocks, but resumes before any user-space code from any other task runs on the core, probably won't count.
Just entering the kernel at all for a system-call clearly doesn't count; perf stat ls only counts one context-switch in a largish directory for me, or zero if I ls a smaller directory like /. I get much higher counts, like 711 for a recursive ls of a directory that I hadn't accessed recently, on a magnetic HDD. So it spent significant time waiting for I/O, and maybe running bottom-half interrupt handlers.
The fact that the count can be odd means it's not counting both deschedule and re-schedule separately; since I'm looking at counts for a single-threaded process that eventually exited, if it was counting both the count would have to be even.
I expect the counting is done when schedule() decides that current should change to point to a new task that isn't this one. (current is the Linux kernel's per-core variable that points to the task_struct of the current task, e.g. a user-space thread.) So every time that happens to a thread that's part of your process, you get 1 count.
Indeed, the OP helpfully tracked down the source code; it's in __schedule in kernel/sched/core.c. For example in Linux 6.1
static void __sched notrace __schedule(unsigned int sched_mode)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
// and some other declarations I omitted
...
cpu = smp_processor_id();
rq = cpu_rq(cpu); // stands for run queue
prev = rq->curr;
...
switch_count = &prev->nivcsw; // either Num InVoluntary CSWs I think
...
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
...
switch_count = &prev->nvcsw; // or Num Voluntary CSWs
}
next = pick_next_task(rq, prev, &rf);
...
if (likely(prev != next)) {
...
++*switch_count; //// INCREMENT THE SELECTED COUNTER
...
trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);
// then make some function calls to actually do the context switch
...
}
I would guess the context-switches perf event sums both involuntary and voluntary switches away from a thread. (Assuming that's what nv and niv stand for.)

What is the necessity of calling set_current_state with TASK_RUNNING after the thread is woken up?

As far as I understood, after a Linux Kernel thread goes to interruptible sleep, it could be woken up by two things:
by a wake_up family function call or
by a signal.
I have seen the following pattern in the kernel. I am wondering what is the necessity of calling set_current_state(TASK_RUNNING) at line 8? Isn't it already in TASK_RUNNING state right now?
1 set_current_state(TASK_INTERRUPTIBLE);
2 spin_lock(&list_lock);
3 if(list_empty(&list_head)) {
4 spin_unlock(&list_lock);
5 schedule();
6 spin_lock(&list_lock);
7 }
8 set_current_state(TASK_RUNNING);
9
10 /* Rest of the code ... */
11 spin_unlock(&list_lock);
If list_empty(&list_head) is false it will not call schedule() and go to sleep. In that case it needs to set its own state back to TASK_RUNNING to prevent inadvertent sleeps.
This is to avoid the lost wake-up problem. Read this.

Excesive Linux Latency

Do you think that a latency of 50 msec are normal in Linux System?
I have a program with many threads, one thread is controlling the movement of an object with a motor and photocells.
I have made many thing to get minimun latency, but always get 50 msec that cause a position error in the object.
Things I did:
- nice function to -20
- Thread priority of photeocell control thread: SCHED FIFO, 99
- Kernel configuration: CONFING_PREEMPT=y
- mlockall (MCL_CURRENT | MCL_FUTURE);
Many times, I lose 50 msec waiting for a photocell. I think that the problema is not another of
my thread, but process in the kernel.
Is posible reduced this latency? Is posible to know who is getting this 50 msec extra?
The thread that is controlling photocells make many "read" functions. Can this generate problems?
/**********/
At now the situation is:
There is only one thread running an infinite empty loop, only looking for time at start od the loop an at the end of the loop.
No acces to disk, no acces to GPIO, no serial ports, nothing.
The loop spend 50 milisecond many of the times.
I have not set cpuaffinity, my processor has only one nucleus.
I have been making test in my program.
This is the code in the main function, before the program star the threads, that cause de 50 mseg latency:
struct sched_param lsPrio;
lsPrio.sched_priority = 1;
if (sched_setscheduler (0, SCHED_FIFO, &lsPrio) != 0)
printf ("FALLO sched_set\n");
if I comment this lines the latency is reduced about 1 mseg.
Why this lines cause latency?

Process / thread scheduling on Linux: X server not running on other cpu cores?

Am unable to understand (what I think) is a peculiar situation wrt process/thread scheduling on Linux.
[Env: Ubuntu 12.10 , kernel ver 3.5.0-... ]
A 'test' application (call it sched_pthread), will have a total of three threads - 'main' + two others; main() will spawn two new threads:
Thread 1 [main()]:
Runs as SCHED_NORMAL (or SCHED_OTHER). It:
Creates two threads (Thread 2 and Thread 3 below); they will automatically inherit the
scheduling policy and priority of main.
Prints the character “m” to the terminal in a loop.
Terminates.
Thread 2 [t2]:
Sleeps for 2 seconds.
Changes it's scheduling policy to SCHED_FIFO, setting it's real­time priority to the
value passed on the command line.
Prints the character “2” to the terminal in a loop.
Terminates.
Thread 3 [t3]:
Changes it's scheduling policy to SCHED_FIFO, setting it's real­time priority to the
value passed on the command line plus 10.
Sleeps for 4 seconds.
Prints the character “3” to the terminal in a loop.
Terminates.
We run it as root.
As per scheduling policy, we should first see main() print 'm' for about 2s, then it should
get preempted by t2 (as it awakens after 2s) and we should see '2' appearing on the terminal for about 2s, after which t3 wakes up (it was asleep for 4s); it should now preempt everyone else & emit '3' to the display; after it dies, we should see '2's until p2 dies, then 'm's until main() dies.
So okay, this works: when i test it in console mode (no X server).
Of course, i take care to run it as:
sudo taskset 02 ./sched_pthrd 8
so that in effect it runs on only 1 processor core.
When I run the same thing in graphical mode (with X), after the initial 'm's by main(), there is a long-ish pause (a few seconds) during which nothing appears on the screen; then all of a sudden we get the 2's and 3's and m's slapped onto the screen!
This can be explained : the X server (Xorg) was preempted by the SCHED_FIFO threads and hence could not 'paint' pixels on the screen.
However - here's the question at last - : how come the Xorg process was not scheduled / migrated onto some other core (so that it can continue updating the screen in parallel with the RT threads)??
taskset verifies that the cpu affinity mask of Xorg is 'f' (1111b) (i have 4 cores on my laptop).
Any ideas??
Here's the source code:
https://dl.dropboxusercontent.com/u/9301413/code_shared/so_sched_pthrd.c
-or-
http://goo.gl/PLHBrC
TIA!
-Kaiwan.

How are system calls interrupted by signal?

My understanding is as following :
the blocking syscall would normally place the process in the 'TASK_INTERRUPTIBLE' state so that when a signal is delivered, the kernel places the process into 'TASK_RUNNING' state. And the process will be scheduled to run when the next timer tick happens , so that the syscall is interrupted .
But I did a small test , it failed . I worte a usermode process , which called sleep(). And I changed the process's state into TASK_RUNNING in kernel , but sleep() did not be interrupted at all and the process was still sleeping.
Then I tryed wake_up_process(process) , it failed.
Then I tryed set_tsk_thread_flag(process,TIF_SIGPENDING), it failed.
Then I tryed set_tsk_thread_flag(process,TIF_SIGPENDING) and wake_up_process(process), succeeded !! sleep() was interrupted and the process started to run .
So it's not that simple. Does anyone know how exactly are system calls interrupted by signal ?
Check out __send_signal from signal.c. It calls complete_signal near the end, which eventually calls this little function:
void signal_wake_up_state(struct task_struct *t, unsigned int state)
{
set_tsk_thread_flag(t, TIF_SIGPENDING);
/*
* TASK_WAKEKILL also means wake it up in the stopped/traced/killable
* case. We don't check t->state here because there is a race with it
* executing another processor and just now entering stopped state.
* By using wake_up_state, we ensure the process will wake up and
* handle its death signal.
*/
if (!wake_up_state(t, state | TASK_INTERRUPTIBLE))
kick_process(t);
}
And that's how you do it. Note that it is not enough to set the thread flag: you have to use a wakeup function to ensure the process is scheduled.

Resources