Blocking IO-Operation in Single core machine - multithreading

I am trying to understand what happens to a thread when it is waiting for a http response from remote server.
Let's say at one point in time n processes are running. The OS based on its thread scheduling algorithm will try to run every thread (Let's say on round robin fashion). Let's say one out of n thread has initiated a http request and waiting for response from remote server. Will this thread keep on getting its turn on cpu core? or is there some interrupt sort of mechanism which will notify the thread if it is ready to run? If interrupt sort of mechanism is present, then what is the benefit of using asynchronous programming? at-least from CPU utilization perspective.
Is the above thing language dependent? If yes, what is the difference between java vs nodejs vs python ...

I am trying to understand What happens to a thread when it is waiting
for a http response from remote server.
Well, the thread will wait for the underlying TCP socket to receive data. HTTP is a high level protocol that uses blocking/nonblocking TCP connection. as itself, the thread doesn't wait for an "HTTP response" but rather to some available data for the socket to read.
Will this thread keep on getting its turn on cpu core?
If the thread waits for a TCP socket to be readable, the OS doesn't schedule this thread to run until some data is received. then the OS will schedule the thread to run in some point in the future. blocked thread is never schedule to run - the OS doesn't see the reason to do so, considering the fact that the thread has nothing to do.
Is the above thing dependent on language? if yes what is the
difference between java vs nodejs vs python ...
No. Each OS provides a C/C++ API for application to consume. Windows provides Win32, while Linux provides POSIX. every programming language wraps and binds these APIs and every "high level" call (such as connecting a socket) will eventually call the operating system APIs.

My understanding is asynchronous keyword is used for your program to continue executing instead of waiting for the forked process to complete, even in single core processors, as was the case with early computers we were able to multitask, hence this could be deduced that the resource allocation was done by cpu while trying to be as judicious as it could be, so using async allows your thread of execution to execute without waiting for the blocking task to get completed, otherwise, even though cpu will take turns in executing a thread but since your program is a single thread it will block.

Related

When a workerThread is created in nodejs, does it utilize the same core in which nodejs process is running?

Let's assume i have a nodejs serverProgram with one api and it does some manipulations on the video file, sent via the http request.
const saveVideoFile=(req,res)=>{
processAndSaveVideoFile(); // can run for minimum of 10 minutes
res.send({status: "video is being processed"})
}
i decided to to make use of a workerThread to do this processing as my machine has 3 cores (core1,core2,core3) and there is no hyperthreading enabled here
Assume that my nodejs program is running on core1. When i fire up a single workerThread, will the workerThread run on core2/core3 or core1?
i read that workerThread is not the same as childProcess. ChildProcess will fork a new process which will facilitate the childProcess to choose from available free cores (core2 or core3).
i read that workerThread shares memory with the mainThread. Let's assume that i create 2 workerThreads (wt1,wt2). Will my nodejs program, wt1, wt2 run on the same core i.e core1 ?
Also, in nodejs we have eventloop (mainthread) and otherThreads doing the background operations i.e I/O. is it correct to assume that all of these are utilizing the resources available in a single core (core1). if this is the case, is creating and using additional workerThread's an overkill on the nodejs server?
Below is an excerpt from this blog
We can run things in parallel in Node.js. However, we need not to
create threads. The operating system and the virtual machine
collectively run the I/O in parallel and the JS code then runs in a
single thread when it is time to send the data back to the JavaScript
code.
i keep reading this same information about nodejs in many articles and video presentations. But what i do not understand is this,
The operating system and the virtual machine collectively run the I/O in parallel
How can the operating system run the I/O requests from nodejs program in parallel without using any of the childProcess or threads spawned from nodejs? if those I/O requests from nodejs program is running in parallel, does it mean that all 3 cores (core1,core2,core3) will be utilized?
There are lot of contents on nodejs, but it doesn't clear doubts related to my above questions. if you have idea on how these things actually work, please share the detail.
A worker thread in node.js is an actual OS thread running in a different instance of V8. As such, it's totally up to the operating system to decide how to allocate it among available CPU cores. If there are cores with available time, then it will not generally be run on the same core as the main nodejs thread when that thread is busy because the OS will allocate busy threads across the various cores.
But, again this is entirely up to the OS and is not something that nodejs controls and the exact strategy for which cores are used will vary by OS. But, in all modern operating systems, the design goal is that available cores are used for threads that are currently executing. Now, if there are more threads active at once than there are cores, the threads will be time-sliced and all the cores will be active.
Also, in nodejs we have eventloop (mainthread) and otherThreads doing the background operations i.e I/O. is it correct to assume that all of these are utilizing the resources available in a single core (core1). if this is the case, is creating and using additional workerThread's an overkill on the nodejs server?
No, it is not correct to assume those threads all use the same core.
A workerThread in nodejs has its own event loop. For the most part, it does not share memory. In fact, if you want to share memory, you have to very specifically allocated SharedMemory and pass that to the workerThread.
Is it overkill? Well, it depends upon what you're doing. There are very useful things to do with workerThreads and there are things that they would not be necessary for.
The operating system and the virtual machine collectively run the I/O in parallel
I/O in node.js is either asynchronous at the OS level (such as networking) or run in separate threads (such as disk I/O). That means it runs separately from the main thread in node.js that runs your Javascript and can run in parallel with it, synchronizing only at the completion of an event. "Parallel" in this case means that both make progress at the same time. If there are multiple cores, then they can truly be running at exactly the same time. If there was only one core, then the OS will timeslice between the various threads and they will be both make progress (in an interleaved fashion that will seem to be parallel, but really they are taking turns).
How can the operating system run the I/O requests from nodejs program in parallel without using any of the childProcess or threads spawned from nodejs? if those I/O requests from nodejs program is running in parallel, does it mean that all 3 cores (core1,core2,core3) will be utilized?
The OS has its own threads for managing things like a network interface or a disk interface. The job of those threads is to interface with the hardware and bring data to an appropriate application or take data from the application and send it to the hardware. These are OS-level threads that exists independent of node.js. Yes, other cores can be used by those OS-level threads. It is important to realize that many operations such as networking are inherently non-blocking. Thus, if you're waiting for some data to arrive on a network interface, you don't need to have a thread doing something the whole time.
I want to add that it appears in your questions that you've combined questions about a several different things. Mentioned in your questions are:
Worker Threads
Internal node.js threads
Operating system threads
These are all different things.
A worker thread is a new thread you can start to run specific pieces of Javascript in another thread so you can have more than one Javascript thread running at the same time. In node.js, this is done by creating a whole new instance of V8, setting up a whole new global environment and loaded modules environment and using almost entirely separate memory.
Internal node.js threads are used by node.js as part of implementing its event loop and its standard library. Specifically, disk I/O and some crypto operations are run in internal native threads and they communicate with your Javascript via events/callbacks through the event loop.
Operating system threads are threads that the OS uses to implement it's own system APIs. Since the OS is responsible for lots of things, these threads ca have many different uses. Depending upon native implementations, they may be used to facilitate things like disk I/O or networking I/O. These threads are the responsibility of the OS to create and use and are not directly controlled by node.js.
Some additional questions asked in comments:
what is the difference b/w workerThread & childProcess concept in nodejs? is childProcess = workerThread without sharedMemory ?
A child process can be any type of program - it does not have to be a node.js program. A worker thread is node.js code.
A worker thread can share memory if sharedMemory is specifically allocated and shared with the worker thread and if it is carefully managed for concurrency issues.
It is more efficient to copy memory back and forth between worker thread and main thread than with child process.
If main program exits, worker threads will exit. If main program exits, child process can be configured to exit or to continue.
If worker thread calls process.exit(), the main thread will exit too. If child program exits, it cannot cause main program to exit without main program's cooperation.
how nodejs is able to magically interact with the os level thread without nodejs itself creating any threads?, i need additional details on this, your explanation is the common one present in most places including the blog i shared?
nodejs just calls an OS API. It's the OS API that manages communicating with its own threads (if threads are needed for that specific OS API). How it does that communication internally is implementation dependent and will vary by OS. It will even vary by OS which OS APIs use threads and which don't.

ZeroMQ/Python - CPU affinity hickup?

I have the following strange situation.
We have a process, call it Distributor, that receives tasks over ZeroMQ/TCP from Client, and accumulates them in a queue. There is a Worker process, which talks with the Distributor over ZeroMQ/IPC. The Distributor forwards each incoming task to Worker, and waits for an answer. As soon as the Worker answers, it sends it another task (if there was one received in the mean time), and returns the answer to the Client (over a separate ZeroMQ/TCP connection). If a task was not processed within 10ms, it is dropped from the queue.
With 1 Worker, the system is capable to process ~3,500 requests/sec. The client sends 10,000 requests/sec, so 6,500 requests are dropped.
But - when I'm running some unrelated process on the server, which takes 100% CPU (a busy wait loop, or whatever) - then, strangely, the system can suddenly process ~7,000 requests/sec. When the process is stopped, it returns back to 3,500. The server has 4 cores.
The same happens when running 2, 3 or 4 Workers (connected to the same Distributor), with slightly different numbers.
The Distributor is written in C++. The Worker is written in Python, and uses pyzmq binding. The worker process is a simple arithmetic process, and does not depend on any external I/O other than Distributor.
There is a theory that this has to do with ZeroMQ using threads on separate CPUs when the server is free, and the same CPU when it's busy. If this is the case, I would appreciate an idea how to configure thread/CPU affinity of ZeroMQ so that it works correctly (without running a busy loop in background).
Is there any ZeroMQ setting that might explain / fix this?
EDIT:
This doesn't happen with a Worker written in C++.
This was indeed a CPU affinity problem. Turns out that using ZeroMQ in a setting where a worker processes an input and waits for the next one, if the context switch causes it to switch to another process, a lot of time is wasted on copying the ZeroMQ data.
Running the worker with
taskset -c 1 python worker.py
solves the problem.

QSerialPort - Is it possible to read() and write() on separate threads?

We have a DLL that provides an API for a USB device we make that can appear as a USB CDC com port. We actually use a custom driver on windows for best performance along with async i/o, but we have also used serial port async file i/o in the past with reasonable success as well.
Latency is very important in this API when it is communicating with our device, so we have structured our library so that when applications make API calls to execute commands on the device, those commands turn directly into writes on the API caller's thread so that there is no waiting for a context switch. The library also maintains a listening thread which is always waiting using wait objects on an async read for new responses. These responses get parsed and inserted into thread-safe queues for the API user to read at their convenience.
So basically, we do most of our writing in the API caller's thread, and all of our reading in a listening thread. I have tried porting a version of our code over to using QSerialPort instead of native serial file i/o for Windows and OSX, but I am running into an error whenever I try to write() from the caller's thread (the QSerialPort is created in the listening thread):
QObject: Cannot create children for a parent that is in a different thread.
which seems to be due to the creation of another QObject-based WriteOverlappedCompletionNotifier for the notifiers pool used by QSerialPortPrivate::startAsyncWrite().
Is the current 5.2 version of QSerialPort limited to only doing reads and writes on the same thread? This seems very unfortunate as the underlying operating systems do not have any such thread limitations for serial port file i/o. As far as I can tell, the issue mainly has to do with the fact that all of QSerialPort's notifier classes are based on QObject.
Does anyone have a good work around to this? I might try building my own QSerialPort that uses notifiers not based on QObject to see how far that gets me. The only real advantage QObject seems to be giving here is in the destruction of the notifiers when the port closes.
Minimal Impact Solution
You're free to inspect the QSerialPort and QIODevice code and see what would need to change to make the write method(s) thread-safe for access from one thread only. The notifiers don't need to be children of the QSerialPort at all, they could be added to a list of pointers that's cleaned up upon destruction.
My guess is that perhaps no other changes are necessary to the mainline code, and only mutex protection is needed for access to error state, but you'd need to confirm that. This would have lowest impact on your code.
If you care about release integrity, you should be compiling Qt yourself anyway, and you should be having it as a part of your own source code repository, too. So none of this should be any problem at all.
On the Performance
"those commands turn directly into writes on the API caller's thread so that there is no waiting for a context switch" Modern machines are multicore and multiple threads can certainly run in parallel without any context switching. The underlying issue is, though: why bother? If you need hard-realtime guarantees, you need a hard-realtime system. Otherwise, nothing in your system should care about such minuscule latency. If you're doing this only to make the GUI feel responsive, there's really no point to such overcomplication.
A Comms Thread Approach
What I do, with plenty of success, and excellent performance, is to have the communications protocol and the communications port in the same, dedicated thread, and the users in either the GUI thread, or yet other thread(s). The communications port is generally a QIODevice, like QTcpSocket, QSerialPort, QLocalSocket, etc. Since the communications protocol object is "just" a QObject, it can also live, with the port, in the GUI thread for demostration purposes - it's designed fully asynchronously anyway, and doesn't block for anything but most trivial of computations.
The communications protocol is queuing multiple requests for execution. Even on a single-core machine, once the GUI thread is done submitting all of the requests, the further execution is all in the communications thread.
The QSerialPort implementation uses asynchronous OS APIs. There's little to no benefit to further processing those async replies on separate threads. Those operations have very low overhead and you will not gain anything measurable in your latency by trying to do so. Remember: this is not your code, but merely code that pushes bytes between buffers. Yes, the context switch overhead may be there on heavily loaded or single-core systems, but unless you can measure the difference between its presence and absence, you're fighting imaginary problems.
It is possible to use any QObject from multiple threads, of course, as long as you serialize the access to it via the event queue mutex. This is done for you whenever you use the QMetaObject::invokeMethod or signal-slot connections.
So, add a trivial wrapper around QSerialPort that exposes the write as a thread-safe method. Internally, it should use a signal-slot connection. You can call this thread-safe write from any thread. The overhead in such a call is a mutex lock and 2+n malloc/free calls, where n is the non-zero number of arguments.
In your wrapper, you can also process the readyRead signal, and emit a signal with received data. That signal can be processed by a QObject living in another thread.
Overall, if you do the measurements correctly, and if your port thread's implementation is correct, you should find no benefit whatsoever to all this complication.
If your communications protocol does heavy data processing, this should be factored out. It could go into a separate QObject that can then run on its own thread. Or, it can be simply done using dedicated functors that are executed by QtConcurrent::run.
What if you use QSerialPort to open and configure the serial port, and QSocketNotifier to monitor for read activity (and other QSocketNotifier instances for write completion and error handling, if necessary)?
QSerialPort::handle should give you the file descriptor you need. On Windows, if that function returns a Windows HANDLE, you can use _open_osfhandle to get a file descriptor.
As a follow up, shortly after this discussion I did implement my own thread-safe serial port code for POSIX systems using select() and the like and it is working well on multiple threads in conjunction with Qt and non-Qt applications alike. Basically, I have abandoned using QtSerialPort at all.

When a process or thread gets blocked, does it wait forever for a notification or just sleep for a while?

I stumbled into this question when I was reading “JVMs typically implement blocking by suspending the blocked thread and rescheduling it later” from http://www.ibm.com/developerworks/java/library/j-jtp04186/?S_TACT=105AGX52&S_CMP=cn-a-j
When we say a process or thread gets blocked when doing IO operations (read, write) or getting access to some exclusive resource (lock, synchronized), when will it get to re-execute? are they constantly waiting until getting a notification from somewhere or does it simply quit its turn and run again after a while?
Has it anything to do with the specified platform? os or jvm?
That would devolve to the underlying OS that must provide threading support to the VM - has to be that way so that the Java app can co-exist harmoniously with all the other proceses and threads that are typically loaded on an OS - browsers, sidebars, anitvirus, video/audio players, Torrent clients, OS internal threads etc. etc.
The code of a blocked thread gets no CPU cycles at all. A thread in that state is just an unused-for-now stack allocation and an extra struct/class pointer in a container in the kernel, waiting for something else to change its state. If it remains blocked or an extended time, the stack may even get swapped out on a busy system.
So yes, they constantly waiting until getting a notification from somewhere.

Why operating system does'nt perform round-robin between the two threads when it only has 1 CPU?

Problem: the operating system does not perform round-robin between the two threads and the system just hangs.
Our system is implemented as a HTTP Native Code Module in c++ for IIS/W2008 R2 64-bit.
Scenario:
Request 1 arrives to the web server; a new thread (t1) is started by IIS. The thread is executing.
Request 2 arrives before request 1 is finished. The IIS starts a new thread (t2). This thread goes into a loop waiting for a shared resource to be available. This behavior is program by us. Since thread two (t2) goes into a loop it starts consuming 100% of the CPU.
Problem: the operating system does not perform round-robin between the two threads and the system just hangs. If it would switch the execution to the first thread, the shared resource would be released and the second thread could run as well.
Even stranger: This behavior does only occur when the machine as 1 CPU. If we add another CPU to the machine it works perfectly, switching between the two threads as expected. Nothings hang.
A workaround (and better programming too) that makes it work when having only 1 CPU is to put a "sleep(100)" in the loop when checking for availability of the shared resource.
Why does not the operating system perform round-robin between the two threads when it only has 1 CPU? Is it related to VMWare?
This thread goes into a loop waiting for a shared resource to be available
This sounds like the wrong way to synchronize things, you will need to do signaling between threads, so that the OS gets a hint to perform the context switch.

Resources