I'm running in a driver's context in linux kernel - this driver writes a value to a register - an operation which takes some time (~5 msec). I would like to sleep during that time in order to give away the CPU to other threads - but it is very important to me to have the CPU back immediatelly after I wake up (there's a short timeout which I must not exceed).
Same question goes for taking a mutex - say I'm blocking on a mutex (and triggering a re-schedule) - how can insure that I'll get the CPU back immediatelly when that mutex is released?
Is there a way to do this? what does it involve? (setting priority to the process? special scheduling mode? changing kernel config?)
I'll rephrase the question about the mutex since it's a bit more complicated:
I have a mutex which is used by important threads (important because of that timeout limit). I would like to take this mutex, knowing that if I will block on it and get reschduled, the lock will be be released quickly (because these threads will have a high priority), and immediatelly after that, my blocked thread will be able to run (and not some other, unrelated program).
This way I can save CPU time while not risking a timeout violation.
I currently use busy waiting in order to avoid rescheduling (my kernel is non preemptive) but I don't like this solution.
You said you are observing delays while writing. I think in this situation you can use
schedule_timeout function. Device drivers use this technique while writing to register so that they dont lockup the system. Recently, I have come across a problem where writing to register is causing delays; I am planning to do schedule_timeout in my case too.
Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?

Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.). It seems that if the operation on the kernel-level thread is all through system calls, the answer to my question will be true. However, I've searched a lot to find whether the management of kernel-level thread is all through system call but find nothing. And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time. So it seems impossible for programmers to write some explicit system calls to manage threads. I'm appreciative of any ideas.
Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.).
It's not that simple. To understand, think about what causes task switches. Here's a (partial) list:
a device told a device driver that an operation completed (some data arrived, etc) causing a thread that was waiting for the operation to unblock and then preempt the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
enough time passed; either causing an "end of time slice" task switch, or causing a sleeping thread to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the thread accessed virtual memory that isn't currently accessible, triggering the kernel's page fault handler which finds out that the current task has to wait while the kernel fetches data from from swap space or from a file (if the virtual memory is part of a memory mapped file), or has to wait for kernel to free up RAM by sending other pages to swap space (if virtual memory was involved in some kind of "copy on write"); causing a task switch because the currently running task can't continue. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
a new process is being created, and its initial thread preempts the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread asked kernel to do something with a file and kernel got "VFS cache miss" that prevents the request from being performed without any task switches. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to a different process to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to the same process to unblock and preempt. For this case you're running user-space code when you find out that a task switch is needed, so in theory user-space task switching is faster, but in practice it can just as easily be an indicator of poor design (using too many threads and/or far too much lock contention).
a new thread is being created for the same process; and the new thread preempts the currently running thread. For this case you're running user-space code when you find out that a task switch is needed, so in user-space task switching is faster; but only if kernel isn't informed (e.g. so that utilities like "top" can properly display details for threads) - if kernel is informed anyway then it doesn't make much difference where the task switch happens.
For most software (which doesn't use very many threads); doing task switches in the kernel is faster. Of course it's also (hopefully) fairly irrelevant for performance (because time spent switching tasks should be tiny compared to time spend doing other work).
And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time.
Yes; but possibly not for the reason you think.
Another problem with user-space threading (besides making most task switches slower) is that it can't support global thread priorities without becoming a severe security disaster. Specifically; a process can't know if its own thread is higher or lower priority than a thread belonging to a different process (unless it has information about all threads for the entire OS, which is information that normal processes shouldn't be trusted to have); so user-space threading leads to wasting CPU time doing unimportant work (for one process) when there's important work to do (for a different process).
Another problem with user-space threading is that (for some CPUs - e.g. most 80x86 CPUs) the CPUs are not independent, and there may be power management decisions involved with scheduling. For examples; most 80x86 CPUs have hyper-threading (where a core is shared by 2 logical processors), where a smart scheduler may say "one logical processor in the core is running a high priority/important thread, so the other logical processor in the same core should not run a low priority/unimportant thread because that would make the important work slower"; most 80x86 CPUs have "turbo boost" (with similar "don't let low priority threads ruin the turbo-boost/performance of high priority thread" possibilities); and most CPUs have thermal management (where scheduler might say "Hey, these threads are all low priority, so let's underclock the CPU so that it cools down and can go faster later (has more thermal headroom) when there's high priority/more important work to do!").
Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?
If system calls were as fast as normal procedure calls, then the performance differences between user-space threading and kernel threading would disappear (but all the other problems with user-space threading would remain). However, the reason why system calls are slower than normal procedure calls is that they pass through a kind of "isolation barrier" (that isolates kernel's code and data from malicious user-space code); so to make system calls as fast as normal procedure calls you'd have to get rid of the isolation (effectively turning the kernel into a kind of "global shared library" that can be dynamically linked) but without that isolation you'll have an extreme security disaster. In other words; to have any hope of achieving acceptable security, system calls must be slower than normal procedure calls.
Your basic premise is wrong. System calls are much slower than procedure calls in almost every interesting architecture.
The perceived cpu throughput is based on pipelining, speculative execution and fetching. The syscall stops the pipeline, invalidates the speculative execution and halts the speculative fetching, is a store and instruction barrier, and may flush the write fifo.
So, the processor slows down to its ‘spec’ speed around the syscall, accelerating back up until the syscall return, whereupon it does about the exact same thing.
Attempts to optimise this area have given rise to lots of papers named after fictional James Bond organizations, and not conciliatory enough apologies from not embarrassed enough cpu product managers. Google spectre as an example, then follow the associated links.
The other cost of syscall
A bit over 30 years ago, some smart guys wrote a paper about least privilege. Conceptually, it is a stunner. The basic premise is that whatever your program is doing, it should do it with the least privilege possible.
If your program is inverting arrays, according to the notion of least privilege, it should not be able to disable interrupts. Disabling interrupts can cause a very difficult to diagnose system failure. Simple user code should not have this ability.
The notion of user and kernel modes of execution evolved from early computer systems, and (with the possible exception of the iax32 / 80286 ) are increasingly showing their inadequacy in the connected computer environment. At one point in time you could say "this is a single user system"; but the IoT dweebs have made everything multi-user.
Least privilege insists that all code should execute with the minimum privilege required to complete the task at hand. Thus, nothing should be in the kernel that absolutely doesn't need to be. If you think that is a radical thought, in Ken Thompson's 1977(?) paper on the UNIX kernel he states exactly the same thing.
C# When thread switching will most probably occur?

I was wondering when .Net would most probably switch from a thread to another?
I understand we can't predict when this will happen exactly, but is there any intelligence in this? For example, when a thread is executed will it try to wait for a method to returns or a loop to finish before switching?
I'm not an expert on .NET, but in general scheduling is handled by the kernel.
Either your thread's timeslice has expired (threads/processes only get a certain amount of CPU time)
Your thread has blocked for IO.
Some other obscure reason, like waiting for an IPC message, a network packet or something.
Threads can be preempted at any point along their execution path, be it in a loop or returning from a function. This in general isn't handled by the underlying VM (.NET or JVM) but is controlled by the OS.
Of course there is 'intelligence', of a sort:). The set of running threads can only change upon an interrupt, either:
An actual hardware interrupt from a peripheral device, eg. disk, NIC, KB, mouse, timer.
A software interrupt, (ie. a system call), that can change the state of thread/s. This encompasses sleep calls and calls to wait/signal on inter-thread synchro objects, as well as I/O calls that request data that is not immediately available.
If there is no interrupt, the OS cannot change the set of running threads because it is not entered. The OS does not know or care about loops, function/methods calls, (except those that make system calls as above), gotos or any other user-level flow-control mechanisms.
I read your question now, it may not be rellevant anymore, but after reading the above answers, i want to just to make sure:
Threads are managed (or as i know) by the process they belong to. There is nothing to do with the Operation System(and that's is the main reason why working with multithreads is more faster than working with multiprocess, because there are data sharing between threads and the switching between them is occuring faster than the context switch wich occure between process by the Short-Term-Scheduler).
(NOTE: There are two types of threads: USER_MODE' threads and KERNEL_MODE' threadss, and each os can have both of them or just on of them. Anyway a thread that working in a user application environment is considered as a USER_MODE' thread and managed by the process it's belong to.)
multi-threading in fedora

I've written a multi-thread program with ptgread. My CPU is dual core. But the program does not run as parallel. I attached system monitoring as following.
My question is, does support fedora13 multi-threading?
Your question is incomplete so this answer may not be effective. Will revise with more information.
However, few tips you should be able work out.
Are any threads waiting for the other?
IS there a dead-lock amongst the threads where both threads are effectively sleeping?
Are there too many I/O involved? (wait on sockets, read, write on disk, even heavy printfs includes this)
Does any of the thread has long sleeps (usleep, nanosleep anyone..)
If there are any of the above condition true, even if the CPU is available, because active instruction set need to wait till effective back log is done.
Second limitation of your question is the measurement. You have chart that is system through put. Even if you have one CPU, the thread switching can be so transparent because the thread switch within matter of few (10s or 100s) of millisecond. And if each of your thread is running on same CPU - you can never say see when these threads switched. Infact the graph you are seeing is shared not only by your 2 threads - but so many processes that are running in system.
prevent linux thread from being interrupted by scheduler

How do you tell the thread scheduler in linux to not interrupt your thread for any reason? I am programming in user mode. Does simply locking a mutex acomplish this? I want to prevent other threads in my process from being scheduled when a certain function is executing. They would block and I would be wasting cpu cycles with context switches. I want any thread executing the function to be able to finish executing without interruption even if the threads' timeslice is exceeded.
How do you tell the thread scheduler in linux to not interrupt your thread for any reason?
Can't really be done, you need a real time system for that. The closes thing you'll get with linux is to
set the scheduling policy to a realtime scheduler, e.g. SCHED_FIFO, and also set the PTHREAD_EXPLICIT_SCHED attribute. See e.g. here , even now though, e.g. irq handlers and other other stuff will interrupt your thread and run.
However, if you only care about the threads in your own process not being able to do anything, then yes, having them block on a mutex your running thread holds is sufficient.
The hard part is to coordinate all the other threads to grab that mutex whenever your thread needs to do its thing.
You should architect your sw so you're not dependent on the scheduler doing the "right" thing from your app's point of view. The scheduler is complicated. It will do what it thinks is best.
Context switches are cheap. You say
I would be wasting cpu cycles with context switches.
but you should not look at it that way. Use the multi-threaded machinery of mutexes and blocked / waiting processes. The machinery is there for you to use...
You can't. If you could what would prevent your thread from never releasing the request and starving other threads.
The best you can do is set your threads priority so that the scheduler will prefer it over lower priority threads.
Why not simply let the competing threads block, then the scheduler will have nothing left to schedule but your living thread? Why complicate the design second guessing the scheduler?
Look into real time scheduling under Linux. I've never done it, but if you indeed do NEED this this is as close as you can get in user application code.
Please point me the tools or the way to monitor which thead in running in the millisecond level

Please point me the tools or the way to monitor which thead in running in the millisecond level? Thanks.
Suppose I have 3 thread running , and I want infomation like below:
0 - 20ms thread1
20 - 40ms thread2
40 - 50ms thread1
50 - 70ms thread3
NOTES: I perfer to solve this problem without hacking into kernel.
in MIPS platfrom with 2.6.21 Linux Kernel
command TOP can provide some information about thread but not too much.
You can use LTTng to trace scheduling activity (along with lots of other things!) with a suitably configured kernel.
That said, I looked at your nabble link - your real problem seems to be that your write thread is blocking the read thread, right? One thing to consider trying would be to use a database that supports concurrent reads and writes. Or use a locking protocol to block the write thread when the read thread is active.
For example, you could have a mutex, condvar, and a want_read value. Before each write, the write thread takes the mutex and checks the wants_read value. If it's nonzero, it blocks on the condvar. Meanwhile, the read thread will increment wants_read under the mutex when it begins, and, when done, decrements it and broadcasts on the condvar. This should cause the write thread to block as soon as is safe when the read thread wants in.
For your specific problem you mentioned in comment, thread without usleep will make a thread busy which will take much of the processor resource. Then you will get a slow database search response.
For general thing if you want check the thread schedule sequence, and do not want to bother install lttng, you can a trick I used. I add some simple syscall like open, close, time with invalid parameter to the thread's key path (which is low overhead compare to printf, and printf sometimes involved with thread lock), and then you can use strace tool to track all these threads. Check the strace log, you can see when they are scheduled in when other thread are sheduled in. Then you will get a general idea what the thread take most of the time to do, and which thread take most of the system's time.
Lttng is definitely the best tool for such problem only if you can get it work.
Intel Concurrency Checker will work on windows and linux. I haven't used it, so I don't know a lot of details, but I have heard that it will do performance measurements. It might be worth a try.
