How does Go preempt goroutines in windows? - multithreading

I read that goroutines are now preemptible. The preemption is done via a sysmon goroutine that sends stop signals to goroutines which have used up their time slice. On posix systems, I believe this is done through pthread_kill. My question is: how does this work in windows since windows doesn't support thread signaling? I had assumed that the go runtime may have been using a posix thread library like pthreads4w, however I just saw that even in pthreads4w, pthread_kill doesn't support sending signals.

The comments in runtime/preempt.go give an overview of how preemption works in the runtime. Specifically to do with asynchronous preemption:
Preemption at asynchronous safe-points is implemented by suspending the thread using an OS mechanism (e.g., signals) and inspecting its state to determine if the goroutine was at an asynchronous safe-point.
So how does async preemption work on windows? As mentioned in the original proposal for non-cooperative preemption of goroutines:
Other considerations
... signaled preemption is quite easy to support in Windows because it provides SuspendThread and GetThreadContext ...
The Windows SuspendThread function can be used to suspend a thread by its handle, and GetThreadContext can be used to get the processor state of the thread. The specific usages of these functions are implemented in runtime/os_windows.go

Related

How are threads/processes parked and woken in Linux, prior to futex?

Before the futex system calls existed in Linux, what underlying system calls were used by threading libraries like pthreads to block/sleep a thread and to subsequently wake those threads from userland?
For example, if a thread tries to acquire a mutex, the userland implementation will block the thread (perhaps after a short spinning interval), but I can't find the syscalls that are used for this (other than futex which are a relatively recent creation).
Before futex and current implementation of pthreads for Linux, the NPTL (require kernel 2.6 and newer), there were two other threading libraries with POSIX Thread API for Linux: linuxthreads and NGPT (which was based on Gnu Pth. LinuxThreads was the only widely used libpthread for years (and it can still be used in some strange & unmaintained micro-libc to work on 2.4; other micro-libc variants may have own builtin implementation of pthread-like API on top of futex+clone). And Gnu Pth is not thread library, it is single process thread with user-level "thread" switching.
You should know that there are several Threading Models when we check does the kernel knows about some or all of user threads (how many CPU cores can be used with adding threads to the program; what is the cost of having the thread / how many threads may be started). Models are named as M:N where M is userspace thread number and N is thread number schedulable by OS kernel:
"1:1" ''kernel-level threading'' - every userspace thread is schedulable by OS kernel. This is implemented in Linuxthreads, NPTL and many modern OS.
"N:1" ''user-level threading'' - userspace threads are planned by the userspace, they all are invisible to the kernel, it only schedules one process (and it may use only 1 CPU core). Gnu Pth (GNU Portable Threads) is example of it, and there are many other implementations for some computer architectures.
"M:N" ''hybrid threading'' - there are some entities visible and schedulable by OS kernel, but there may be more user-space threads in them. And sometimes user-space threads will migrate between kernel-visible threads.
With 1:1 model there are many classic sleep mechanisms/APIs in Unix like select/poll and signals and other variants of IPC APIs. As I remember, Linuxthreads used separate processes for every thread (with fully shared memory) and there was special manager "thread" (process) to emulate some POSIX thread features. Wikipedia says that SIGUSR1/SIGUSR2 were used in Linuxthreads for some internal communication between threads, same says IBM "The synchronization of primitives is achieved by means of signals. For example, threads block until awoken by signals.". Check also the project FAQ http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html#H.4 "With LinuxThreads, I can no longer use the signals SIGUSR1 and SIGUSR2 in my programs! Why?"
LinuxThreads needs two signals for its internal operation. One is used to suspend and restart threads blocked on mutex, condition or semaphore operations. The other is used for thread cancellation.
On ``old'' kernels (2.0 and early 2.1 kernels), there are only 32 signals available and the kernel reserves all of them but two: SIGUSR1 and SIGUSR2. So, LinuxThreads has no choice but use those two signals.
With "N:1" model thread may call some blocking syscall and block everything (some libraries may convert some blocking syscalls into async, or use some SIGALRM or SIGVTALRM magic); or it may call some (very) special internal threading function which will do user-space thread switching by rewriting machine state register (like switch_to in linux kernel, save IP/SP and other regs, restore IP/SP and regs of other thread). So, kernel does not wake any user thread directly from userland, it just schedules whole process; and user space scheduler implement thread synchronization logic (or just calls sched_yield or select when there is no threads to work).
With M:N model things are very complicated... Don't know much about NGPT... There is one paragraph about NGPT in POSIX Threads and the Linux Kernel, Dave McCracken, OLS2002,330 page 5
There is a new pthread library under development called NGPT. This library is based on the GNU Pth library, which is an M:1 library. NGPT extends Pth by using multiple Linux tasks, thus creating an M:N library. It attempts to preserve Pth’s pthread compatibility while also using multiple Linux tasks for concurrency, but this effort is hampered by the underlying differences in the Linux threading model. The NGPT library at present uses non-blocking wrappers around blocking system calls to avoid
blocking in the kernel.
Some papers and posts: POSIX Threads and the Linux Kernel, Dave McCracken, OLS2002,330, LWN post about NPTL 0.1
The futex system call is used extensively in all synchronization
primitives and other places which need some kind of
synchronization. The futex mechanism is generic enough to support
the standard POSIX synchronization mechanisms with very little
effort. ... Futexes also allow the implementation of inter-process
synchronization primitives, a sorely missed feature in the old
LinuxThreads implementation (Hi jbj!).
NPTL design pdf:
5.5 Synchronization Primitives
The implementation of the synchronization primitives such as mutexes, read-write
locks, conditional variables, semaphores, and barriers requires some form of kernel
support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application.
Fortunately some new functionality was added to the kernel to implement all kinds
of synchronization primitives: futexes [Futex]. The underlying principle is simple but
powerful enough to be adaptable to all kinds of uses. Callers can block in the kernel
and be woken either explicitly, as a result of an interrupt, or after a timeout.
Futex stands for "fast userspace mutex." It's simply an abstraction over mutexes which is considered faster and more convenient than traditional mutex mechanisms because it implements the wait system for you. Before and after futex(), threads were put to sleep and awoken via a change in their process state. The process states are:
Running state
Sleeping state
Un-interruptible sleeping state (i.e. blocking for a syscall like read() or write()
Defunct/zombie state
When a thread is suspended, it is put into (interruptible) 'sleep' state. Later, it can be woken via the wake_up() function, which operates on its task structure within the kernel. As far as I can tell, wake_up is a kernel function, not a syscall. The kernel doesn't need a syscall to wake or sleep a task; it (or a process) simply changes the task structure to reflect the state of the process. When the Linux scheduler next deals with that process, it treats it according to its state (again, the states are listed above).
Short story: futex() implements a wait system for you. Without it, you need a data structure that's accessible from the main thread and from the sleeping thread in order to wake up a sleeping thread. All of this is done with userland code. The only thing you might need from the kernel is a mutex--the specifics of which do include locking mechanisms and mutex datastructures, but don't inherently wake or sleep the thread. The syscalls you're looking for don't exist. Essentially, most of what you're talking about can be achieved from userspace, without a syscall, by manually keeping track of data conditions that determine whether and when to sleep or wake a thread.

Where does the wait queue for threads lies in POSIX pthread mutex lock and unlock?

I was going through concurrency section from REMZI and while going through mutex section, and I got confused about this:
To avoid busy waiting, mutex implementations employ park() / unpark() mechanism (on Sun OS) which puts a waiting thread in a queue with its thread ID. Later on during pthread_mutex_unlock() it removes one thread from the queue so that it can be picked by the scheduler. Similarly, an implementation of Futex (mutex implementation on Linux) uses the same mechanism.
It is still unclear to me where the queue lies. Is it in the address space of the running process or somewhere inside the kernel?
Another doubt I had is regarding condition variables. Do pthread_cond_wait() and pthread_cond_signal() use normal signals and wait methods, or do they use some variant of it?
Doubt 1: But, it is still unclear to me where actually does the queue lies. Is it in the address space of the running process or somewhere inside kernel.
Every mutex has an associated data structure maintained in the kernel address space, in Linux it is futex. That data structure has an associated wait queue where threads from different processes can queue up and wait to be woken up, see futex_wait kernel function.
Doubt 2: Another doubt I had is regarding condition variables, does pthread_cond_wait() and pthread_cond_signal() use normal signal and wait methods OR they use some variant of it.
Modern Linux does not use signals for condition variable signaling. See NPTL: The New Implementation of Threads for Linux for more details:
The addition of the Fast Userspace Locking (futex) into the kernel enabled a complete reimplementation of mutexes and other synchronization mechanisms without resorting to interthread signaling. The futex, in turn, was made possible by the introduction of preemptive scheduling to the kernel.

How is preemptive scheduling implemented for user-level threads in Linux?

With user-level threads there are N user-level threads running on top of a single kernel thread. This is in contrast to pthreads where only one user thread runs on a kernel thread.
The N user-level threads are preemptively scheduled on the single kernel thread. But what are the details of how that is done.
I heard something that suggested that the threading library sets things up so that a signal is sent by the kernel and that is the mechanism to yank execution from an individual user-level thread to a signal handler that can then do the preemptive scheduling.
But what are the details of how state such as registers and thread structs are saved and/or mutated to make this all work? Is there maybe a very simple of user-level threads that is useful for learning the details?
To get the details right, use the source! But this is what I remember from when I read it...
There are two ways user-level threads can be scheduled: voluntarily and preemptively.
Voluntary scheduling: threads must call a function periodically to pass the use of the CPU to another thread. This function is called yield() or schedule() or something like that.
Preemptive scheduling: the library forcefully removes the CPU from one thread and passes it to another. This is usually done with timer signals, such as SIGALARM (see man ualarm for the details).
About how to do the real switch, if your OS is friendly and provides the necessary functions, that is easy. In Linux you have the makecontext() / swapcontext() functions that make swapping from one task to another easy. Again, see the man pages for details.
Unfortunately, these functions are removed from POSIX, so other UNIX may not have them. If that's the case, there are other tricks that can be done. The most popular was the one calling sigaltstack() to set up an alternate stack for managing the signals, then kill() itself to get to the alternate stack, and longjmp() from the signal function to the actual user-mode-thread you want to run. Clever, uh?
As a side note, in Windows user-mode threads are called fibers and are fully supported also (see the docs of CreateFiber()).
The last resort is using assembler, that can be made to work almost everywhere, but it is totally system specific. The steps to create a UMT would be:
Allocate a stack.
Allocate and initialize a UMT context: a struct to hold the value of the relevant CPU registers.
And to switch from one UMT to another:
Save the current context.
Switch the stack.
Restore the next context in the CPU and jump to the next instruction.
These steps are relatively easy to do in assembler, but quite impossible in plain C without support from any of the tricks cited above.

Pthread Concepts

I'm studying threads and I am not sure if I understand some concepts. What is the difference between preemption and yield? So far I know that preemption is a forced yield but I am not sure what it actually means.
Thanks for your help.
Preemption is when one thread stops another thread from running so that it may run.
To yield is when a thread voluntarily gives up processor time.
Have a gander at these...
http://en.wikipedia.org/wiki/Preemption_(computing)
http://en.wikipedia.org/wiki/Thread_(computing)
The difference is how the OS is entered.
'yield' is a software interrupt AKA system call, one of the many that may result in a change in the set of running threads, (there are lots of other system calls that can do this - blocking reads, synchronization calls). yield() is called from a running thread and may result in another ready, (but not running), thread of the same priority being run instead of the calling thread - if there is one.
The exact behaviour of yield() is somewhat hardware/OS/language-dependent. Unless you are developing low-level lock-free thread comms mechanisms, and you are very good at it, it's best to just forget about yield().
Preemption is the act of interrupting one thread and dispatching another in its place. It can only occur after a hardware interrupt. When hardware interrupts, its driver is entered. The driver may decide that it can usefully make a thread ready, (eg. a thread is blocked on a read() call to the driver and the driver has accumulated a nice, big buffer of data). The driver can do this by signaling a semaphore and exiting via. the OS, (which provides an entry point for just such a purpose). This driver exit path causes a reschedule and, probably, makes the read thread running instead of some other thread that was running before the interrupt - the other thread has been preempted. Essentially and simply, preemption occurs when the OS decides to interrupt-return to a different set of threads than the one that was interrupted.
Yield: The thread calls a function in the scheduler, which potentially "parks" that thread, and starts another one. The other thread is one which called yield earlier, and now appears to return from it. Many functions can have yielding semantics, such as reading from a device.
Preempt: an external event comes into the system: some kind of interrupt (clock, network data arriving, disk I/O completing ...). Whichever thread is running at that time is suspended, and the machine is running operating system code the interrupt context. When the interrupt is serviced, and it's time to return from the interrupt, a scheduling decision can be made to keep the interrupted thread parked, and instead resume another one. That is a preemption. If/when that original thread gets to run again, the context which was saved by the interrupt will be activated and it will pick up exactly where it left off.
Scheduling systems which rely on yield exclusively are called "cooperative" or "cooperative multitasking" as opposed to "preemptive".
Traditional (read: old, 1970's and 80's) Unix is cooperatively multitasked in the kernel, with a preemptive user space. The kernel routines are trusted to yield in a reasonable time, and so preemption is disabled when running kernel code. This greatly simplifies kernel coding and improves reliability, at the expense of performance, especially when multiple processors are introduced. Linux was like this for many years.

Non-preemptive Pthreads?

Is there a way to use pthreads without a scheduler, so context switch occurs only if a thread explicitly yields, or is blocked on a mutex/cond? If not, is there a way to minimize the scheduling overhead, so that forced context switches will occur as rarely as possible?
The question refers to the Linux gcc/g++ implementation of POSIX threads.
You can use Pth (a.k.a. GNU Portable Threads), a non-preemptive thread library. Configuring it with --enable-pthread will create a plug-in replacement for pthreads. I just built and tested this on my Mac and it works fine for a simple pthreads program.
From the README:
Pth is a very portable POSIX/ANSI-C based library for Unix platforms
which provides non-preemptive priority-based scheduling for multiple
threads of execution (aka `multithreading') inside event-driven
applications. All threads run in the same address space of the server
application, but each thread has its own individual program-counter,
run-time stack, signal mask and errno variable.
The thread scheduling itself is done in a cooperative way, i.e., the
threads are managed by a priority- and event-based non-preemptive
scheduler. The intention is, that this way one can achieve better
portability and run-time performance than with preemptive scheduling.
The event facility allows threads to wait until various types of
events occur, including pending I/O on filedescriptors, asynchronous
signals, elapsed timers, pending I/O on message ports, thread and
process termination, and even customized callback functions.
Additionally Pth provides an optional emulation API for POSIX.1c
threads (`Pthreads') which can be used for backward compatibility to
existing multithreaded applications.
If you have a process running in normal user land, context switches will naturally happen as part of the system operation - there is always another process that needs the CPU time. Preemptive context switches between your threads are quite well optimized by the OS already and are bound to be necessary sometimes.
If you really happen to have problems with excessive context switching, you are best off tweaking the Linux scheduler first, which is off-topic here. pthread_setschedprio and pthread_setschedparam can set some hints, but are limited to setting priorities, and the interpretation of these priorities is implementation-defined, i.e. up to the Linux scheduler.

Resources