Order of request execution in Node.js [duplicate] - node.js

I did some search on the question, but it seems like people only emphasize on Non-blocking IO.
Let's say if I just have a very simple application to respond "Hello World" text to the client, it still needs time to finish the execution, no matter how quick it is. What if there are two request coming in at exactly the same time, how does Node.js make sure both requests will be processed with one thread?
I read the blog Understanding the node.js event loop which says "Of course, on the backend, there are threads and processes for DB access and process execution". That statement is regarding IO, but I also wonder if there is separate thread to handle the request queue. If that's the case, can I say that the Node.js single thread concept only applies to the developers who build applications on Node.js, but Node.js is actually running on multi-threads behind the scene?

The operating system gives each socket connection a send and receive queue. That is where the bytes sit until something at the application layer handles them. If the receive queue fills up no connected client can send information until there is space available in the queue. This is why an application should handle requests as fast as possible.
If you are on a *nix system you can use netstat to view the current number of bytes in the send and receive queues. In this example, there are 0 bytes in the receive queue and 240 bytes in the send queue (waiting to be sent out by the OS).
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 240 x.x.x.x:22 x.x.x.x:* LISTEN
On Linux you can check the default size and max allowed size of the send/receive queues with the proc file system:
Receive:
cat /proc/sys/net/core/rmem_default
cat /proc/sys/net/core/rmem_max
Send:
cat /proc/sys/net/core/wmem_max
cat /proc/sys/net/core/wmem_default

Related

What happens to the data that will be returned when thread is waiting for other threads

Let's say I create many many threads on one core CPU. Each thread does IO operation, for example it reads data from a database or other microservice.
What happens if I create thousands of threads that read something from a DB?
How this communication works?
I assume that in a thread we send some request to a DB or some HTTP call to other service. After that CPU is used by another thread. How is this communcation handled? Does the OS handle messages for other threads and waits until these threads will be used by CPU to pass them data?
Lets say I make 1000 calls in 1000 of threads and each response will be 1MB of data. Where is this data buffored until correct thread become active? (For example we are spawning tenth thread and already got a response fot the first one)
Or maybe someone could pass some nice articles about this topic?
Every time a thread makes an I/O request the OS (kernel) queues that I/O and puts the thread to sleep (assuming we're talking about a synchronous I/O call).
"Queues that I/O" means setting up some link between the socket through which the I/O is performed and the network card queue, and setting up an internal OS buffer to hold request and response data.
When a response arrives at the network card, the OS adds the data socket's buffer and, typically, wakes up the thread that made the associated I/O request.
Note that while an HTTP or DB query response can be 1 MB, it's usually done over a TCP/IP connection, which usually has a much lower MTU. The TCP/IP implementation will require the server to slice the response into packets and send multiple small packets.
If 1000 responses arrive at the same time, and the hardware cannot handle such a load, each server will have to send its packets more slowly, but the OS will still likely handle all such "streams" of responses in parallel.
I assume that in a thread we send some request to a DB or some HTTP call to other service. After that CPU is used by another thread. How is this communcation handled? Does the OS handle messages for other threads and waits until these threads will be used by CPU to pass them data?
It depends on the exact communication method used. Most commonly, it will be some kind of byte stream connection such as a TCP connection. In that case, the thread typically makes a blocking read operation that causes the kernel to mark that thread as waiting for I/O. It attaches the thread to a data structure associated with the TCP connection and does whatever is needed to make that I/O take effect.
When a response is received, the kernel code notices the thread waiting for activity. It then marks that original thread ready-to-run and the scheduler will eventually schedule it. When it runs, it resumes in the kernel's blocking I/O code, but this time there is data waiting for it, so it returns to user space and resumes execution.
Lets say I make 1000 calls in 1000 of threads and each response will be 1MB of data. Where is this data buffored until correct thread become active? (For example we are spawning tenth thread and already got a response fot the first one)
It depends on exactly what communication method is used. If it's a TCP connection, then there are buffers associated with that connection. If it uses shared memory, then the other process just writes into that shared memory page.

Can I use SO_REUSEPORT to distribute a single UDP flow to multiple receiver threads?

My Linux application needs to receive a single UDP flow with modestly-sized packets (~1 KB) at a rate on the order of ~600,000 packets per second. My current implementation is naive: it has a single thread that simply calls recv() repeatedly, placing the received data in a queue to be processed by another thread. Therefore, the receiver thread is only in charge of pulling in the packets.
In some initial testing that I've done, I'm only able to receive between 200,000-300,000 packets per second before the thread reaches full utilization of its CPU core. This obviously isn't good enough to meet the goal of ~600,000 packets per second.
Ideally, I would find some way of distributing the packet reception load across multiple threads. In looking for a solution to the problem, I came across the SO_REUSEPORT socket option, which allows multiple TCP/UDP threads to be bound to the same IP/port combination. At first, this seemed to be exactly what I wanted.
However, the article also points out this detail:
Incoming connections and datagrams are distributed to the server sockets using a hash based on the 4-tuple of the connection—that is, the peer IP address and port plus the local IP address and port. This means, for example, that if a client uses the same socket to send a series of datagrams to the server port, then those datagrams will all be directed to the same receiving server (as long as it continues to exist). This eases the task of conducting stateful conversations between the client and server.
Therefore, if I only have a single UDP flow, the above hashing implementation would yield all of the packets being directed to the same receiver thread, thwarting my attempt at parallelizing the work. Therefore, the question is: is there a way to receive a single flow of UDP packets from multiple threads, using SO_REUSEPORT or some other mechanism?
Note that my application can handle reordering of packets; the protocol that the datagrams are formatted with contains sequencing information that I can use to reorder them properly afterward.
If you didn't find the solution for last 3 years take a look at SO_ATTACH_REUSEPORT_CBPF. We had exactly the same issue and we solved it by attaching simple BPF program which distributes datagrams randomly mod n.

epoll: must I use multi-threading

I've got a basic knowledge from here about epoll. I know that epoll can monitor multiple FDs and handle them.
My question is: can a heavy event block the server so I must use multithreading?
For example, the epoll of a server is monitoring 2 sockets A and B. Now A starts to send lot of messages to the server so the server starts to read them. One second later, B starts to send messages too while A is still sending. In this case, Need I create a thread for these read actions? If I don't, does it mean that the server has no chance to get the messages from B until A finishes its sending?
If you can process incoming messages fast enough (no blocking calls, no heavy computations), you don't need a separate thread. Otherwise, you would benefit from going multi-threaded.
In any case, it helps to understand what happens when you have only one thread and you can't process messages fast enough. If you are working with TCP protocol, the machines sending you the data will simply reduce their transmission rate. When using UDP, some incoming packets will get dropped.

Winsock send() blocks server?

I have read that the send() function on Winsock blocks until the ACK from the last packet is recieved. Now I am playing with a server for a turn based role playing game. Everything is handled by one thread (for 64 sockets). A request is recieved, handled and a response written to the socket(s). This process cannot be interrupted.
Is it possible to handle, say 1000 clients (one thread for every 64 sockets) with this method?
Wouldn't it block the whole server if a send() takes too long to complete or the client maliciously does not send the ACK or the connection gets interrupted?
Shall I split the logic of networking and request handling into 2 threads? If so the thread handling the network transfers could still be blocked by a send() or recv().
Or would it be best to use overlapped I/O?
send() blocks only if the socket is running in blocking mode and the socket's outbound buffer fills up with queued data. If you are managing multiple sockets in the same thread, do not use blocking mode. If one receiver does not read data in a timely maner, it can cause all of the connections on that thread to be affected. Use non-blocking mode instead, then send() will report when a socket has entered a state where blocking would occur, then you can use select() to detect when the socket can accept new data again. A better option is to use overlapped I/O or I/O Completion Ports instead. Submit outbound data to the OS and let the OS handle all of the waiting for you, notifying you when the data has eventually been accepted/sent. Do not submit new data for a given socket until you receive that notification. For scalability to a large number of connections, I/O Completion Ports are generally a better choice.
No, it doesn't work like that. From the MSDN documentation on send:
The successful completion of a send function does not indicate that the data was successfully delivered and received to the recipient. This function only indicates the data was successfully sent.

UNIX socket magic. Recommended for high performance application?

I'm looking using to transfer an accept()ed socket between processes using sendmsg(). In short, I'm trying to build a simple load balancer that can deal with a large number of connections without having to buffer the stream data.
Is this a good idea when dealing with a large number (let's say hundreds) of concurrent TCP connections? If it matters, my system is Gentoo Linux
You can share the file descriptor as per the previous answer here.
Personally, I've always implemented servers using pre-fork. The parent sets up the listening socket, spawns (pre-forks) children, and each child does a blocking accept. I used pipes for parent <-> child communication.
Until someone does a benchmark and establishes how "hard" it is to send a file descriptor, this remains speculation (someone might pop up: "Hey, sending the descriptor like that is dirt-cheap"). But here goes.
You will (likely, read above) be better off if you just use threads. You can have the following workflow:
Start a pool of threads that just wait around for work. Alternatively you can just spawn a new thread when a request arrives (it's cheaper than you think)
Use epoll(7) to wait for traffic (wait for connections + interesting traffic)
When interesting traffic arrives you can just dispatch a "job" to one of the threads.
Now, this does circumvent the whole descriptor sending part. So what's the catch ? The catch is that if one of the threads crashes, the whole process crashes. So it is up to you to benchmark and decide what's best for your server.
Personally I would do it the way I outlined it above. Another point: if the workers are children of the process doing the accept, sending the descriptor is unnecessary.

Resources