Concurrently run socket and other computation in C++ - multithreading

I'm sorry if this question is too easy to solve.
I would like to implement the following scenarios in C++.
There exists a collection of functions to be evaluated like f_1, f_2, etc.
While evaluating f_i, the program is sending and receiving something to or from another host.
When f_i finishes, there is some return value.
So the program should immediately move to socket part to send the value or receive something from another machine.
But at the same time, computation of f_j which is not evaluated now should start.
I know multi-threading may solve this problem.
But, how a one thread knows if a computation in some specific thread finishes?
If the socket is replaced by File I/O, I think we can do same thing.
It would be really appreciate if you suggest me a way to solve this or some reference to do that.

You should probably have at least one I/O thread with an event loop to handle your sockets.
This I/O thread can dispatch computations to a thread pool. Once the computation finishes you should let the I/O thread know that it should send the computation result. There a few methods to do that, one simple method is for the compute thread to allocate the computation result on the heap and write a pointer to it into a pipe. The I/O thread event loop notices that there is data in the read end of the pipe available, reads the pointer to the result and starts sending it in non-blocking fashion.

Related

Using "finish" in one thread in gdb, it never returns. Does that mean the thread is stuck?

I'm trying to debug a thread issue. In most cases, it's easy, but with this one something gets stuck and I'm having difficulties finding the threads that causes the issue. (It happens after several hours and writing logs breaks the timings so... it's difficult as I can't just change the code to help me find the culprit).
Today, I was thinking to:
Hit Ctrl-C
I tried to determine which thread it is:
thread 9
where
That looked like a good bet. A thread waiting for data on a FIFO...
I decided to verify my theory by entering:
finish
The program started again, but gdb never stopped...
Am I correct thinking that proves that this FIFO never receives any data and that's where my process is stuck? (i.e. when data is received that function returns)
Am I correct thinking that proves that this FIFO never receives any data
Yes: if your description is accurate, then it is likely that thread 9 is stuck waiting for data.
and that's where my process is stuck?
That we can't tell. A thread waiting on a FIFO could be expecting data from an external source (it is somewhat unusual and inefficient to have a FIFO that transfers data within a single processs), and if the other end of the FIFO is connected to some other process, then the fact that a thread was stuck there proves nothing.
P.S. One of the common ways that mixing FIFOs and threads causes hangs is also using fork or other functions which create subprocesses -- it is exceedingly difficult to make such code correct in multithreaded environment.

Will non-blocking I/O be put to sleep during copying data from kernel to user?

I ask this question because I am looking at multiplexing I/O in Go, which is using epollwait.
When an socket is ready, a goroutine will be waked up and begin to read socket in non-blocking mode. If the read system call still will be blocked during copying data from kernel to user, I assume the kernel thread the gorouine attached to will be put to sleep as well.
I am not sure of that, hoping someone can help correct me if I am wrong.
I fail to quite parse what you've written.
I'll try to make a sheer guess and conjure you might be overseeing the fact that the write(2) and read(2) syscalls (and those of their ilk such as send(2) and recv(2)) on the sockets put into non-blocking mode are free to consume (and return, respectively) less data than requested.
In other words, a write(2) call on a non-blocking socket told to write 1 megabyte of data will consume just as much data currently fits into the assotiated kernel buffer and return immediately, signalling it consumed only as much data. The next immediate call to write(2) will likely return EWOULDBLOCK.
The same goes for the read(2) call: if you pass it a buffer large enough to hold 1 megabyte of data, and tell it to read that number of bytes, the call will only drain the contents of the kernel buffer and return immediately, signaling how much data it actually copied. The next immediate call to read(2) will likely return EWOULDBLOCK.
So, any attempt to get or put data to the socket succeeds almost immediately: either after the data had been shoveled between the kernel's buffer and the user space or right away—with the EAGAIN return code.
Sure, there's supposedly a possibility for an OS thread to be suspended right in the middle of performing such a syscall, but this does not count as "blocking in a syscall."
Update to the original answer in response to the following comment of the OP:
<…>
This is what I see in book
"UNIX Network Programming" (Volume 1, 3rd), chapter 6.2:
A synchronous I/O operation causes the requesting process
to be blocked until that I/O operation completes. Using these
definitions, the first four I/O models—blocking, nonblocking, I/O
multiplexing, and signal-driven I/O—are all synchronous because the
actual I/O operation (recvfrom) blocks the process.
It uses "blocks" to describe nonblocking I/O operation. That makes me confused.
I still don't understand why the book uses "blocks the process" if the process is actually not blocked.
I can only guess that the book's author intended to highlight that the process is indeed blocked since entering a syscall and until returning from it. Reads from and writes to a non-blocking socket do block to transfer the data, if available, between the kernel and the user space. We colloquially say this does not block because we mean "it does not block waiting and doing nothing for an indeterminate amount of time".
The book's author might contrast this to the so-called asynchronous I/O (called "overlapping" on Windows™)—where you basically give the kernel a buffer with/for data and ask it to do away with it completely in parallel with your code—in the sense the relevant syscall returns right away and the I/O is carried out in background (with regard to your user-space code).
To my knowledge, Go does not use kernel's async I/O facilities on neither platform it supports. You might look there for the developments regarding Linux and its contemporary io_uring subsystem.
Oh, and one more point. The book might (at that point through the narrative at least) be discussing a simplified "classic" scheme where there are no in-process threads, and the sole unit of concurrency is the process (with a single thread of execution). In this scheme, any syscall obviously blocks the whole process. In contrast, Go works only on kernels which support threads, so in a Go program a syscall never blocks the whole process—only the thread it's called on.
Let me take yet another stab at explaining the problem as—I perceive—the OP stated it.
The problem of serving multiple client requests is not new—one of the more visible first statements of it is "The C10k problem".
To quickly recap it, a single threaded server with blocking operations on the sockets it manages is only realistically able to handle a single client at a time.
To solve it, there exist two straightforward approaches:
Fork a copy of the server process to handle each incoming client connection.
On an OS which supports threads, fork a new thread inside the same process to handle each incoming client.
They have their pros and cons but they both suck with regard to resource usage, and—which is more important—they do not play well with the fact most clients have relatively low rate and bandwidth of I/O they perform with regard to the processing resources available on a typical server.
In other words, when serving a typical TCP/IP exchange with a client, the serving thread most of the time sleeps in the write(2) and read(2) calls on the client socket.
This is what most people mean when talking about "blocking operations" on sockets: if a socket is blocking, and operation on it will block until it can actually be carried out, and the originating thread will be put to sleep for an indeterminate amount of time.
Another important thing to note is that when the socket becomes ready, the amount of work done is typically miniscule compared to the amount of time slept between the wakeups.
While the tread sleeps, its resources (such as memory) are effectively wasted, as they cannot be used to do any other work.
Enter "polling". It combats the problem of wasted resources by noticing that the points of readiness of networked sockets are relatively rare and far in between, so it makes sense to have lots of such sockets been served by a single thread: it allows to keep the thread almost as busy as theoretically possible, and also allows to scale out when needed: if a single thread is unable to cope with the data flow, add another thread, and so on.
This approach is definitely cool but it has a downside: the code which reads and writes data must be re-written to use callback style instead of the original plain sequential style. Writing with callbacks is hard: you usuaully have to implement intricate buffer management and state machines to deal with this.
The Go runtime solves this problem by adding another layer of scheduling for its execution flow units—goroutines: for goroutines, operations on the sockets are always blocking, but when a goroutine is about to block on a socket, this is transparently handled by suspending only the goroutine itself—until the requested operation will be able to proceed—and using the thread the goroutine was running on to do other work¹.
This allows to have the best of both approaches: the programmer may write classic no-brainer sequential callback-free networking code but the threads used to handle networking requests are fully utilized².
As to the original question of blocking, both the goroutine and the thread it runs on are indeed blocked when the data transfer on a socket is happening, but since what happens is data shoveling between a kernel and a user-space buffer, the delay is most of the time small, and is no different to the classic "polling" case.
Note that performing of syscalls—including I/O on non-pollable descriptors—in Go (at leas up until, and including Go 1.14) does block both the calling goroutine and the thread it runs on, but is handled differently from those of pollable descriptors: when a special monitoring thread notices a goroutine spent in a syscall more that certain amount of time (20 µs, IIRC), the runtime pulls the so-called "processor" (a runtime thing which runs goroutines on OS threads) from under the gorotuine and tries to make it run another goroutine on another OS thread; if there is a goroutine wanting to run but no free OS thread, the Go runtime creates another one.
Hence "normal" blocking I/O is still blocking in Go in both senses: it blocks both goroutines and OS threads, but the Go scheduler makes sure the program as a whole still able to make progress.
This could arguably be a perfect case for using true asynchronous I/O provided by the kernel, but it's not there yet.
¹ See this classic essay for more info.
² The Go runtime is certainly not the first one to pioneer this idea. For instance, look at the State Threads library (and the more recent libtask) which implement the same approach in plain C; the ST library has superb docs which explain the idea.

How can I block a single thread for 3 different events (semaphore, pthread condition, and blocking socket recv)?

I have a multi-threaded system in which a main thread has to wait in blocking state for one of the following 4 events to happen:
inter-process semaphore (sem_wait())
pthread condition (pthread_cond_wait())
recv() from socket
timeout expiring
Ideally I'd like a mechanism to unblock the main thread when any of the above occurs, something like a ppoll() with suitable timeout parameter. Non-blocking and polling is out of the picture due to the impact on the CPU usage, while having separate threads blocking on different events is not ideal due to the increased latency (one thread unblocking from one of the events should eventually wake up the main one).
The code will be almost exclusively compiled under Linux with gcc toolchain, if that helps, but some portability would be good, if at all possible.
Thanks in advance for any suggestion
The mechanisms for waiting on multiple types of objects on Unix-like systems are not that great. In general, the idea is to, wherever possible, use file descriptors for IPC rather than multiple different IPC mechanisms.
From your comment, it sounds like you can edit or change the condition variable, but not the code that signals the semaphore. So what I'd recommend is something like the following.
Change the condition variable to either a pipe (for more portability) or an eventfd(2) object (Linux-specific). The notifying thread writes to the pipe whenever it wants to signal the main thread. This will allow you to select(2) or poll(2) or whatever in the main thread on both that pipe and the socket.
Because you're stuck with the semaphore, I think the best option would be to create another thread, whose sole purpose is to wait for the semaphore using sem_wait(), and then write to another pipe or eventfd(2) object when it is notified by whatever process is doing sem_post(). In the main thread, just add this other file descriptor to your select(2) set.
So you'll have three descriptors: one for the socket, one taking the place of the condition variable, and one which is written to when the semaphore is incremented. You can then wait on all three using your favorite I/O multiplexing method, and include directly whatever timeout you'd like.

How Do Callbacks work in Non-blocking Design?

Looked at a few other questions but didn't quite find what I was looking for. Im using Scala but my questions is very high level and so is hopefully agnostic of any languages.
A regular scenario:
Thread A runs a function and there is some blocking work to be done (say a DB call).
The function has some non-blocking code (eg. Async block in Scala) to cause some sort of 'worker' Thread B (in a different pool) to pick up the I/O task.
The method in Thread A completes returning a Future which will eventually contain the result and Thread A is returned to its pool to quickly pick up another request to process.
Q1. Some thread somewhere usually has to wait?
My understanding of non-blocking architectures is that the common approach is to still have some Thread waiting/blocking on the I/O work somewhere - its just a case of having different pools which have access to different cores so that a small number of request processing threads can manage a large number of concurrent requests without ever waiting on a CPU core.
Is this a correct general understanding?
Q2. How the callback works ?
In the above scenario - Thread B that is doing the I/O work will run the callback function (provided by Thread A) if/when the I/O work has completed - which completes the Future with some Result.
Thread A is now off doing something else and has no association any more with the original request. How does the Result in the Future get sent back to the client socket? I understand that different languages have different implementations of such a mechanism but at a high level my current assumption is that (regardless of the language/framework) some framework/container objects must always be doing some sort of orchestration so that when a Future task is completed the Result gets sent back to the original socket handling the request.
I have spent hours trying to find articles which will explain this but every article seems to just deal with real low-level details. I know Im missing some details but i am having difficulty asking my question because Im not quite sure which parts Im missing :)
My understanding of non-blocking architectures is that the common approach is to still have some Thread waiting/blocking on the I/O work somewhere
If a thread is getting blocked somewhere, it is not really a non-blocking architecture. So no, that's not really a correct understanding of it. That doesn't mean that this is necessarily bad. Sometimes you just have to deal with blocking (using JDBC, for example). It would be better to push it off into a fixed thread pool designated for blocking, rather than allowing the entire application to suffer thread starvation.
Thread A is now off doing something else and has no association any more with the original request. How does the Result in the Future get sent back to the client socket?
Using Futures, it really depends on the ExecutionContext. When you create a Future, where the work is done depends on the ExecutionContext.
val f: Future[?] = ???
val g: Future[?] = ???
f and g are created immediately, and the work is submitted to a task queue in the ExecutionContext. We cannot guarantee which will actually execute or complete first in most cases. What you do with the values matters is well. Obviously if you use an Await to wait for the completion of the Futures, then we block the current thread. If we map them and do something with the values, then we again need another ExecutionContext to submit the task to. This gives us a chain of tasks that are asynchronously getting submitted and re-submitted to the executor for execution every time we manipulate the Future.
Eventually there needs to be some onComplete at the end of that chain to return the pass along that value to something, whether it's writing to stream, or something else. ie., it is probably out of the hands of the original thread.
Q1: No, at least not at the user code level. Hopefully your async I/O ultimately comes down to an async kernel API (e.g. select()). Which in turn will be using DMA to do the I/O and trigger an interrupt when it's done. So it's async at least down to the hardware level.
Q2: Thread B completes the Future. If you're using something like onComplete, then thread B will trigger that (probably by creating a new task and handing that task off to a thread pool to pick it up later) as part of the completing call. If a different thread has called Await to block on the Future, it will trigger that thread to resume. If nothing has accessed the Future yet, nothing in particular happens - the value sits there in the Future until something uses it. (See PromiseCompletingRunnable for the gritty details - it's surprisingly readable).

Semaphores & threads - what is the point?

I've been reading about semaphores and came across this article:
www.csc.villanova.edu/~mdamian/threads/posixsem.html
So, this page states that if there are two threads accessing the same data, things can get ugly. The solution is to allow only one thread to access the data at the same time.
This is clear and I understand the solution, only why would anyone need threads to do this? What is the point? If the threads are blocked so that only one can execute, why use them at all? There is no advantage. (or maybe this is a just a dumb example; in such a case please point me to a sensible one)
Thanks in advance.
Consider this:
void update_shared_variable() {
sem_wait( &g_shared_variable_mutex );
g_shared_variable++;
sem_post( &g_shared_variable_mutex );
}
void thread1() {
do_thing_1a();
do_thing_1b();
do_thing_1c();
update_shared_variable(); // may block
}
void thread2() {
do_thing_2a();
do_thing_2b();
do_thing_2c();
update_shared_variable(); // may block
}
Note that all of the do_thing_xx functions still happen simultaneously. The semaphore only comes into play when the threads need to modify some shared (global) state or use some shared resource. So a thread will only block if another thread is trying to access the shared thing at the same time.
Now, if the only thing your threads are doing is working with one single shared variable/resource, then you are correct - there is no point in having threads at all (it would actually be less efficient than just one thread, due to context switching.)
When you are using multithreading not everycode that runs will be blocking. For example, if you had a queue, and two threads are reading from that queue, you would make sure that no thread reads at the same time from the queue, so that part would be blocking, but that's the part that will probably take the less time. Once you have retrieved the item to process from the queue, all the rest of the code can be run asynchronously.
The idea behind the threads is to allow simultaneous processing. A shared resource must be governed to avoid things like deadlocks or starvation. If something can take a while to process, then why not create multiple instances of those processes to allow them to finish faster? The bottleneck is just what you mentioned, when a process has to wait for I/O.
Being blocked while waiting for the shared resource is small when compared to the processing time, this is when you want to use multiple threads.
This is of course a SSCCE (Short, Self Contained, Correct Example)
Let's say you have 2 worker threads that do a lot of work and write the result to a file.
you only need to lock the file (shared resource) access.
The problem with trivial examples....
If the problem you're trying to solve can be broken down into pieces that can be executed in parallel then threads are a good thing.
A slightly less trivial example - imagine a for loop where the data being processed in each iteration is different every time. In that circumstance you could execute each iteration of the for loop simultaneously in separate threads. And indeed some compilers like Intel's will convert suitable for loops to threads automatically for you. In that particular circumstances no semaphores are needed because of the iterations' data independence.
But say you were wanting to process a stream of data, and that processing had two distinct steps, A and B. The threadless approach would involve reading in some data then doing A then B and then output the data before reading more input. Or you could have a thread reading and doing A, another thread doing B and output. So how do you get the interim result from the first thread to the second?
One way would be to have a memory buffer to contain the interim result. The first thread could write the interim result to a memory buffer and the second could read from it. But with two threads operating independently there's no way for the first thread to know if it's safe to overwrite that buffer, and there's no way for the second to know when to read from it.
That's where you can use semaphores to synchronise the action of the two threads. The first thread takes a semaphore that I'll call empty, fills the buffer, and then posts a semaphore called filled. Meanwhile the second thread will take the filled semaphore, read the buffer, and then post empty. So long as filled is initialised to 0 and empty is initialised to 1 it will work. The second thread will process the data only after the first has written it, and the first won't write it until the second has finished with it.
It's only worth it of course if the amount of time each thread spends processing data outweighs the amount of time spent waiting for semaphores. This limits the extent to which splitting code up into threads yields a benefit. Going beyond that tends to mean that the overall execution is effectively serial.
You can do multithreaded programming without semaphores at all. There's the Actor model or Communicating Sequential Processes (the one I favour). It's well worth looking up JCSP on Wikipedia.
In these programming styles data is shared between threads by sending it down communication channels. So instead of using semaphores to grant another thread access to data it would be sent a copy of that data down something a bit like a network socket, or a pipe. The advantage of CSP (which limits that communication channel to send-finishes-only-if-receiver-has-read) is that it stops you falling into the many many pitfalls that plague multithreaded do programs. It sounds inefficient (copying data is inefficient), but actually it's not so bad with Intel's QPI architecture, AMD's Hypertransport. And it means hat the 'channel' really could be a network connection; scalability built in by design.

Resources