Does OpenBSD support parallel Kernel access

Does OpenBSD support parallel Kernel access - multithreading

I tried to figure out if multiple processes or threads can execute concurrent syscalls, without one of them sleeping.
That's to say: Does OpenBSD use something like a Big Kernel Lock.
One would expect, that parallel Kernel access is possible. I tried to look into the syscall interface (code-reading and kernel debugging) and didn't find anything that would strike me as BKL.
However, when I look into the fork syscall implementation, it appears to me, that some global data is accessed without locking (e.g. nprocesses). I was wondering, if the scheduler (?), somehow, prevents parallel syscalls, or if I am overlooking something.
So: Does OpenBSD support parallel Kernel access and how about other BSDs?

Indeed, OpenBSD has a rather archaic model, which uses priority levels, distinct for each subsystem. See spl(9).
The mechanism originally allowed some preemption, but only from higher priority interrupts. On modern implementations of course the priority levels are implemented by mutexes.
The scheduler uses splsched.
So, there's several locks, and syscalls take place in parallel (across different CPUs) but serialize due to these locks at certain points, depending on the subsystem boundaries they're crossing. In other words, never two threads will be running code from the same subsystem concurrently; of course this could change at any moment if a lock is split or replaced.
Other systems:
This is inherited from NetBSD, so it's about the same.
FreeBSD has transitioned to a more granular approach, with some parts being lockless, much like Linux.
DragonflyBSD improves upon FreeBSD by providing serialization tokens for synchronization, and an inherently lockless approach to key mechanisms, like memory allocation (both userspace and kernel).

Related

Do any operating systems utilize user threads only?

We're reading a basic/simple guide to Operating Systems in my CS class. The text gives multiple examples of OSs that use 1:1 threading, and some that formerly did hybrid/ M:N. But there are no examples of user threads/N:1.
This isn't a homework question, I'm just genuinely curious if this is or was a thing. Have any OSs utilized exclusively user threads? Or is there any software or programming language that does? It seems like with the right scheduling it could be very fast? Thank you!
Spent forever on Google and can't find any explicit answer to this!

Do any operating systems utilize user threads only?
No (and not in the way you're expecting, but by definition). Whatever a program feels like doing in user-space is none of the operating system's business and can not be considered something the OS itself does.
Essentially there's 3 cases:
the OS is a single-tasking OS (and user-space programs use libraries or whatever to provide threading if/when they want it). E.g. MS-DOS.
the OS is a multi-tasking OS, where the OS only knows about processes (and user-space programs use libraries or whatever to provide threading if/when they want it). E.g. early Unix.
the OS/kernel provides threads (leading to 1:1 or M:N).
It seems like with the right scheduling it could be very fast?
User-space threading isn't "very fast", it's significantly worse for most things. The reasons are:
it can't work when there's multiple CPUs (so the nice 8-core CPU you're currently using becomes 87.5% wasted). You need a "M:N threading" at a minimum to avoid this performance disaster.
it breaks thread priorities badly - e.g. CPU/s wasting time doing unimportant work while important work isn't being done, because one process doesn't know anything about threads that belong to any other process (or their priorities). The scheduler must be aware of all threads to avoid this performance disaster (and if one process knows about all threads belonging to all other processes it becomes a security disaster).
almost all thread switches are caused by devices (threads having to wait for disk, network, keyboard, "wall clock time", ... causing scheduler to have to find some other thread to run; and things a thread was waiting for occurring causing the thread to be able to run again and possibly preempt less important work that was running at the time); and all devices involve the kernel (even for micro-kernels where kernel is needed to pass messages, etc); so almost all thread switches involve the kernel. By doing threading in user-space you just end up with kernel wasting time notifying user-space (so user-space can do some scheduling) instead of kernel doing the scheduling itself (without wasting time on notifications).
User-space threading is better for rare situations where kernel doesn't have to be involved anyway, which is limited to:
thread creation and termination; but only if memory (for thread state, thread stack, thread local storage) is pre-allocated and recycled, and only if "thread recycling" isn't done (e.g. pre-create kernel threads and put them back in a "free thread pool" instead of telling kernel to terminate and create them again later).
locking (e.g. mutexes) where all threads using the lock belong to the same process; where 1 kernel thread (and no need for locks) is still better than "multiple user-space threads (sharing 1 kernel thread) fighting for the same lock with extra pointless overhead".

Do all types of interprocess/interthread communication need system calls?

In Linux,
do all types of interprocess communication need system calls?
Types of interprocess communication are such as
Pipes
Signals
Message Queues
Semaphores
Shared Memory
Sockets
Do all types of interthread communication need system calls?
I would like to know if all interprocess communications and interthread communications involve switching from user mode to kernel mode so that the OS kernel will run to perform the communications? Since system calls all involve such switch, I asked if the communications need system calls.
For example, "Shared memory" can be used for both interprocess and interthread communcations, but i am not sure if it requires system calls or involvement of OS kernel to take over the cpu to perform something.
Thanks.

For interprocess communication I am pretty sure you cannot avoid system calls.
For interthread communication I cannot give you a definitive answer, but my educated guess would be "yes-and-no". You see, you can communicate between threads using thread-safe queues, and the only thing that a thread-safe queue needs in order to work is a lock. If a lock is unavailable at the moment that a thread wants to obtain it, then of course the system must be involved in order to put the thread in a waiting mode. But if the lock is available to obtain, then the thread should be able to proceed without the need for any system call.
That's what I would guess, and I would be quite disappointed to find out that things do not actually work this way, because that would mean that code which I have up until now been considering pretty innocent in fact has a tremendous additional hidden overhead.

Yes, every IPC was set by some syscalls(2).
It might happen that some IPC was set by a previous program (e.g. the program in the same process before execve), for example when running a pipeline like ls | ./yourprog it is the shell which has called pipe(2), not yourprog.
Since threads -in the same process- (by definition) share a common address space they can communicate using some shared data. However, they often need some syscall for synchronization (e.g. with mutexes), see e.g. futex(7) - because you want to avoid spinlocks (i.e. wasting CPU power for waiting). But in practice you should use pthreads(7)
In practice you cannot use shared memory (like shm_overview(7)) without synchronization (e.g. with semaphores, see sem_overview(7)). Notice that cache coherence is tricky and makes memory model sometimes non-intuitive (and processor specific).

At least, you do not need a system call for each read/write to shared memory. Setting up shared memory will for sure and synchronizing threads/processes will often involve system calls.
You could use flags in shared memory for synchronization, but note that read and write of flags may not be atomic actions.
(For example if you set up a location in shared memory to be 0 in the beginning and then check for it to be non-zero, while the other process sets it to non-zero when ready for something)

Linux system call for creating process and thread

I read in a paper that the underlying system call to create processes and threads is actually the same, and thus the cost of creating processes over threads is not that great.
First, I wanna know what is the system call that creates
processes/threads (possibly a sample code or a link?)
Second, is
the author correct to assume that creating processes instead of
threads is inexpensive?
EDIT:
Quoting article:
Replacing pthreads with processes is surprisingly inexpensive,
especially on Linux where both pthreads and processes are invoked
using the same underlying system call.

Processes are usually created with fork, threads (lightweight processes) are usually created with clone nowadays. However, anecdotically, there exist 1:N thread models, too, which don't do either.
Both fork and clone map to the same kernel function do_fork internally. This function can create a lightweight process that shares the address space with the old one, or a separate process (and many other options), depending on what flags you feed to it. The clone syscall is more or less a direct forwarding of that kernel function (and used by the higher level threading libraries) whereas fork wraps do_fork into the functionality of the 50 year old traditional Unix function.
The important difference is that fork guarantees that a complete, separate copy of the address space is made. This, as Basil points out correctly, is done with copy-on-write nowadays and therefore is not nearly as expensive as one would think.
When you create a thread, it just reuses the original address space and the same memory.
However, one should not assume that creating processes is generally "lightweight" on unix-like systems because of copy-on-write. It is somewhat less heavy than for example under Windows, but it's nowhere near free.
One reason is that although the actual pages are not copied, the new process still needs a copy of the page table. This can be several kilobytes to megabytes of memory for processes that use larger amounts of memory.
Another reason is that although copy-on-write is invisible and a clever optimization, it is not free, and it cannot do magic. When data is modified by either process, which inevitably happens, the affected pages fault.
Redis is a good example where you can see that fork is everything but lightweight (it uses fork to do background saves).

The underlying system call to create threads is clone(2) (it is Linux specific). BTW, the list of Linux system calls is on syscalls(2), and you could use the strace(1) command to understand the syscalls done by some process or command. Processes are usually created with fork(2) (or vfork(2), which is not much useful these days). However, you could (and some C standard libraries might do that) create them with some particular form of clone. I guess that the kernel is sharing some code to implement clone, fork etc... (since some functionalities, e.g. management of the virtual address space, are common).
Indeed, process creation (and also thread creation) is generally quite fast on most Unix systems (because they use copy-on-write machinery for the virtual memory), typically a small fraction of a millisecond. But you could have pathological cases (e.g. thrashing) which makes that much longer.
Since most C standard library implementations are free software on Linux, you could study the source code of the one on your system (often GNU glibc, but sometimes musl-libc or something else).

What are the thread limitations when working on Linux compared to processes for network/IO-bound apps?

I've heard that under linux on multicore server it would be impossible to reach top performance when you have just 1 process but multiple threads because Linux have some limitations on the IO, so that 1 process with 8 threads on 8-core server might be slower than 8 processes.
Any comments? Are there other limitation which might slow the applications?
The applications is a network C++ application, serving 100s of clients, with some disk IO.
Update: I am concerned that there are some more IO-related issues other than the locking I implement myself... Aren't there any issues doing simultanious network/disk IO in several threads?

Drawbacks of Threads
Threads:
Serialize on memory operations. That is the kernel, and in turn the MMU must service operations such as mmap() that perform page allocations.
Share the same file descriptor table. There is locking involved making changes and performing lookups in this table, which stores stuff like file offsets, and other flags. Every system call made that uses this table such as open(), accept(), fcntl() must lock it to translate fd to internal file handle, and when make changes.
Share some scheduling attributes. Processes are constantly evaluated to determine the load they're putting on the system, and scheduled accordingly. Lots of threads implies a higher CPU load, which the scheduler typically dislikes, and it will increase the response time on events for that process (such as reading incoming data on a socket).
May share some writable memory. Any memory being written to by multiple threads (especially slow if it requires fancy locking), will generate all kinds of cache contention and convoying issues. For example heap operations such as malloc() and free() operate on a global data structure (that can to some degree be worked around). There are other global structures also.
Share credentials, this might be an issue for service-type processes.
Share signal handling, these will interrupt the entire process while they're handled.
Processes or Threads?
If you want to make debugging easier, use threads.
If you are on Windows, use threads. (Processes are extremely heavyweight in Windows).
If stability is a huge concern, try to use processes. (One SIGSEGV/PIPE is all it takes...).
If threads aren't available, use processes. (Not so common now, but it did happen).
If your threads share resources that can't be use from multiple processes, use threads. (Or provide an IPC mechanism to allow communicating with the "owner" thread of the resource).
If you use resources that are only available on a one-per-process basis (and you one per context), obviously use processes.
If your processing contexts share absolutely nothing (such as a socket server that spawns and forgets connections as it accept()s them), and CPU is a bottleneck, use processes and single-threaded runtimes (which are devoid of all kinds of intense locking such as on the heap and other places).
One of the biggest differences between threads and processes is this: Threads use software constructs to protect data structures, processes use hardware (which is significantly faster).
Links
pthreads(7)
About Processes and Threads (MSDN)
Threads vs. Processes

it really should make no difference but is probably about design.
A multi process app may have to do less locking but may use more memory. Sharing data between processes may be harder.
On the other hand multi process can be more robust. You can call exit() and quit the child safely mostly without affecting others.
It depends how dependent the clients are. I usually recommend the simplest solution.

What interprocess locking calls should I monitor?

I'm monitoring a process with strace/ltrace in the hope to find and intercept a call that checks, and potentially activates some kind of globally shared lock.
While I've dealt with and read about several forms of interprocess locking on Linux before, I'm drawing a blank on what to calls to look for.
Currently my only suspect is futex() which comes up very early on in the process' execution.
Update0
There is some confusion about what I'm after. I'm monitoring an existing process for calls to persistent interprocess memory or equivalent. I'd like to know what system and library calls to look for. I have no intention call these myself, so naturally futex() will come up, I'm sure many libraries will implement their locking calls in terms of this, etc.
Update1
I'd like a list of function names or a link to documentation, that I should monitor at the ltrace and strace levels (and specifying which). Any other good advice about how to track and locate the global lock in mind would be great.

If you can start monitored process in valgrind, then there are two projects:
http://code.google.com/p/data-race-test/wiki/ThreadSanitizer
and Helgrind
http://valgrind.org/docs/manual/hg-manual.html
Helgrind is aware of all the pthread
abstractions and tracks their effects
as accurately as it can. On x86 and
amd64 platforms, it understands and
partially handles implicit locking
arising from the use of the LOCK
instruction prefix.
So, this tools can detect even atomic memory accesses. And they will check pthread usage

flock is another good one

There are many system calls can be used for locking: flock, fcntl, and even create.
When you are using pthreads/sem_* locks they may be executed in user space so you'll never
see them in strace as futex is called only for pending operations. Like when you actually
need to wait.
Some operations can be done in user space only - like spinlocks - you'll never see them
unless they do some waits for timer - backoff so you may see only stuff like nanosleep when one lock waits for other.
So there is no "generic" way to trace them.

on systems with glibc ~ >= 2.5 (glibc + nptl) you can use process shared
semaphores (last parameter to sem_init), more precisely, posix unnamed semaphores
posix mutexes (with PTHREAD_PROCESS_SHARED to pthread_mutexattr_setpshared)
posix named semaphores (got from sem_open/sem_unlink)
system v (sysv) semaphores: semget, semop
On older systems with glibc 2.2, 2.3 with linuxthreads or on embedded systems with uClibc you can use ONLY system v (sysv) semaphores for iterprocess communication.
upd1: any IPC and socker must be checked.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string