Handling multiple requests with express - node.js

I have a background with Java and I am relatively new to node. I am trying to understand node in relation to the fact that it is single threaded, but can still handle multiple requests at the same time.
I have read about the single thread and the event loop, as well as the related stackoverflow questions, but I am still not sure I have understood it correctly, hence this question.
I have a simple http service that takes an id as an input. There can be multiple requests at almost the same time with the same id, and of course also other requests at almost the same time with other ids.
When the service is called, the following happens:
Lookup id in DB (in a blocking manner, i.e. await)
If the DB lookup did not find a result, insert id in DB
Let's say there are two requests at almost the same time, with the same id.
My question is whether the following is possible:
Request 1 makes the lookup in the DB -> no result
Request 2 makes the lookup in the DB -> no result
Request 1 inserts a new row
Request 2 insert a new row
The blocking manner of the lookup makes me guess the answer is "no, that is not possible", but then I read that the blocking does not block the single thread. What makes me want to answer "yes, it is possible", is because I do not understand how several requests can be handled, if the above is not possible.
Thanks,
-Louise

As far as I can determine the answer is "yes, that is possible". The "await" on the call to the DB ensures that the query has finished before we continue to the next line of code, but it does not block the thread.
The thread continues with other tasks while awaiting the DB operation to finish, and those other tasks might be handling another request. This means that a race condition can happen between multiple requests.

Related

How many thread really have in Node

As Node is using java-script engine internally it means Node has only one thread but i was checking online,some people saying two thread some saying one thread per core and some are saying something else,Can anyone clarify?
And if it is single threaded then how many request it can handle concurrently because if 1000 request are coming at the same time and say each request taking 100ms then the one at the 1001th will take 100 sec (100×1000=100,000 ms=100 sec).
So it means Node is not good for large number of user?
Based on your writing, I can tell you know something about programming, some logic and, some basic math, but you are getting into javascript (not java-script) now.
That background makes you an excellent fit for this new javascript on the server-side with nodejs, and I foresee you becoming great at it.
What is clear to me is that you are confusing parallelism with concurrency and this post here is going to be very useful for you.
But a TLDR could be something like this:
Parallelism: 2 or more Process/Threads running at the same time
Concurrency: The main single Process/Thread will not be waiting for an IO operation to end, it will keep doing other things and get back to it whenever the IO operation ends
IO (Input/Output) operations involve interactions with the OS (Operating System) by using the DISK or the NETWORK, for example.
In nodejs, those tasks are asynchronous, and that's why they ask for a callback, or they are Promise based.
const fs = require('fs')
function myCallbackFunc (err, data) {
if (err) {
return console.error(err)
}
console.log(data)
}
fs.readFile('./some-large-file', myCallbackFunc)
fs.readFile('./some-tiny-file', myCallbackFunc)
A simple way you could theoretically say that in the example above you'll have a
main thread taking care of your code and another one (which you don't control at all) observing the asynchronous requests, and the second one will call the myCallBackFunc whenever the IO process, which will happen concurrently, ends.
So YES, nodejs is GREAT for a large number of requests.
Of course, that single process will still share the same computational power with all the concurrent requests it is taking care.
Meaning that if within the callback above you are doing some heavy computational tasks that take a while to execute, the second call would have to wait for the first one to end.
In that case, if you are running it on your own Server/Container, you can make use of real parallelism by forking the process using the Cluster native module that already comes with nodejs :D
Node has 1 main thread(that's why called single threaded) which receives all request and it then give that request to internal threads(these are not main thread these are smaller threads) which are present in thread pool. Node js put all its request in event queue. And node server continuously runs internal event loop which checks any request is placed in Event Queue. If no, then wait for incoming requests for indefinitely else it picks one request from event queue.
And checks Threads availability from Internal Thread Pool if it's available then it picks up one Thread and assign this request to that thread.
That Thread is responsible for taking that request, process it, perform Blocking IO operations, prepare response and send it back to the Event Loop
Event Loop in turn, sends that Response to the respective Client.
Click here for Node js architecture image

Play-Scala Async for Dummies

I tried researching and understanding the async and non-blocking ability of play.
What I understand(may be wrong):
Action.async leads to Future[Result] which are placeholders of results that are yet to be received. A request comes in and a method handles it (say a database query) but as the db call is made, the thread is freed up to take another request. So how does the system not lose track of that db call as a thread is no longer with it?
Once the result is received does an available thread then pick up on that result and responds with it?
Usually when learning new concepts I'd like an animated or layman's terms type video that visually shows threads and requests.
Also isn't there waiting that has to be done on every request anyway, is it just that resources are used when that waiting is being done?
Thanks in advance!

How Cassandra handle blocking execute statement in datastax java driver

Blocking execute fethod from com.datastax.driver.core.Session
public ResultSet execute(Statement statement);
Comment on this method:
This method blocks until at least some result has been received from
the database. However, for SELECT queries, it does not guarantee that
the result has been received in full. But it does guarantee that some
response has been received from the database, and in particular
guarantee that if the request is invalid, an exception will be thrown
by this method.
Non-blocking execute fethod from com.datastax.driver.core.Session
public ResultSetFuture executeAsync(Statement statement);
This method does not block. It returns as soon as the query has been
passed to the underlying network stack. In particular, returning from
this method does not guarantee that the query is valid or has even
been submitted to a live node. Any exception pertaining to the failure
of the query will be thrown when accessing the {#link
ResultSetFuture}.
I have 02 questions about them, thus it would be great if you can help me to understand them.
Let's say I have 1 million of records and I want all of them to be arrived in the database (without any lost).
Question 1: If I have n number of threads, all threads will have the same amount of records they need to send to the database. All of them continue sending multiple insert queries to cassandra using blocking execute call. If I increase the value of n, will it also helps to speed up the time that I need to insert all records to cassandra?
Will this cause performance problem for cassandra? Does Cassandra have to make sure that for every single insert record, all the nodes in the clusters should know about the new record immediately? In order to maintain the consistency in data. (I assume cassandra node won't even think about using the local machine time for controlling the record insertion time).
Question 2: With non-blocking execute, how can I assure that all of the insertions is successful? The only way I know is waiting for the ResultSetFuture to check the execution of the insert query. Is there any better way I can do ? Is there a higher chance that non-blocking execute is easier to fail then blocking execute?
Thank you very much for your helps.
If I have n number of threads, all threads will have the same amount of records they need to send to the database. All of them continue sending multiple insert queries to cassandra using blocking execute call. If I increase the value of n, will it also helps to speed up the time that I need to insert all records to cassandra?
To some extent. Lets divorce the client implementation details a bit and look at things from the perspective of "Number of concurrent requests", as you don't need to have a thread for each ongoing request if you use executeAsync. In my testing I have found that while there is a lot of value in having a high number of concurrent requests, there is a threshold for which there are diminishing returns or performance starts to degrade. My general rule of thumb is (number of Nodes *native_transport_max_threads (default: 128)* 2), but you may find more optimal results with more or less.
The idea here is that there is not much value in enqueuing more requests than cassandra will handle at a time. While reducing the number of inflight requests, you limit unnecessary congestion on the connections between your driver client and cassandra.
Question 2: With non-blocking execute, how can I assure that all of the insertions is successful? The only way I know is waiting for the ResultSetFuture to check the execution of the insert query. Is there any better way I can do ? Is there a higher chance that non-blocking execute is easier to fail then blocking execute?
Waiting on the ResultSetFuture via get is one route, but if you are developing a fully async application, you want to avoid blocking as much as possible. Using guava, your two best weapons are Futures.addCallback and Futures.transform.
Futures.addCallback allows you to register a FutureCallback that gets executed when the driver has received the response. onSuccess gets executed in the success case, onFailure otherwise.
Futures.transform allows you to effectively map the returned ResultSetFuture into something else. For example if you only want the value of 1 column you could use it to transform ListenableFuture<ResultSet> to a ListenableFuture<String> without having to block in your code on the ResultSetFuture and then getting the String value.
In the context of writing a dataloader program, you could do something like the following:
To keep things simple use a Semaphore or some other construct with a fixed number of permits (that will be your maximum number of inflight requests). Whenever you go to submit a query using executeAsync, acquire a permit. You should really only need 1 thread (but may want to introduce a pool of # cpu cores size that does this) that acquires the permits from the Semaphore and executes queries. It will just block on acquire until there is an available permit.
Use Futures.addCallback for the future returned from executeAsync. The callback should call Sempahore.release() in both onSuccess and onFailure cases. By releasing a permit, this should allow your thread in step 1 to continue and submit the next request.
To further improve throughput, you might want to consider using BatchStatement and submitting requests in batches. This is a good option if you keep your batches small (50-250 is a good number) and if your inserts in a batch all share the same partition key.
Besides the above answer,
Looks like execute() calls executeAsync(statement).getUninterruptibly(), so whether you manage your own "n thread pool" using execute() and block yourself until execution completes up to a max of n running threads OR using executeAsync() on all records, cassandra side performance should be roughly same, depending on execution time/count + timeouts.
They executions will all run connections borrowed from a pool, each execution has a streamId on client side and gets notified you via future when the response comes back for this streamId, limited by total requests per connection on client side and total requests limited by read threads on each node that was picked to execute your request, any higher number will be buffered in a queue (not blocked) limited by the connection maxQueueSize and maxRequestsPerConnection, any higher than this should fail. The beauty of this is that executeAsync() does not run on a new thread per request/execution.
So, there has to be a limit on how many requests can run via execute() or executeAsync(), in execute() you are avoiding beyond these limits.
Performance wise, you will start seeing a penalty beyond what each node can handle so execute() with a good size pool makes sense to me. Even better, use a reactive architecture to avoid creating so many threads that are doing nothing but waiting, so large number of threads will cause wasted context switching on client side. For smaller number of requests, executeAsync() will be better by avoiding thread pools.
DefaultResultSetFuture future = new DefaultResultSetFuture(..., makeRequestMessage(statement, null));
new RequestHandler(this, future, statement).sendRequest();

a synchronization issue between requests in express/node.js

I've come up with a fancy issue of synchronization in node.js, which I've not able to find an elegant solution:
I setup a express/node.js web app for retrieving statistics data from a one row database table.
If the table is empty, populate it by a long calculation task
If the record in table is older than 15 minutes from now, update it by a long calculation task
Otherwise, respond with a web page showing the record in DB.
The problem is,
when multiple users issue requests simultaneously, in case the record is old, the long calculation task would be executed once per request, instead of just once.
Is there any elegant way that only one request triggers the calculation task, and all others wait for the updated DB record?
Yes, it is called locks.
Put an additional column in your table say lock which will be of timestamp type. Once a process starts working with that record put a now+timeout time into it (by the rule of thumb I choose timeout to be 2x the average time of processing). When the process stops processing update that column with NULL value.
At the begining of processing check that column. If the value > now condition is satisfied then return some status code to client (don't force client to wait, it's a bad user experience, he doesn't know what's going on unless processing time is really short) like 409 Conflict. Otherwise start processing (also ideally processing takes place in a separate thread/process so that user won't have to wait: respond with an appropriate status code like 202 Accepted).
This now+timeout value is needed in case your processing process crashes (so we avoid deadlocks). Also remember that you have to "check and set" this lock column in transaction because of race conditions (might be quite difficult if you are working with MongoDB-like databases).

Single threaded and Event Loop in Node.js

First of all, I am starter trying to understand what is Node.Js. I have two questions.
First Question
From the article of Felix, it said "there can only be one callback firing at the same time. Until that callback has finished executing, all other callbacks have to wait in line".
Then, consider about the following code (copied from nodejs official website)
var http = require('http');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end('Hello World\n');
}).listen(8124, "127.0.0.1");
If two client requests are received simultaneously, it means the following workflow:
First http request event received, Second request event received.
As soon as first event received, callback function for first event is executing.
At while, callback function for second event has to be waiting.
Am I right? If I am right, how Node.js control if there are thousands of client request within very short-time duration.
Second Question
The term "Event Loop" is mostly used in Node.js topic. I have understood "Event Loop" as the following from http://www.wisegeek.com/what-is-an-event-loop.htm;
An event loop - or main loop, is a construct within programs that
controls and dispatches events following an initial event.
The initial event can be anything, including pushing a button on a
keyboard or clicking a button on a program (in Node.js, I think the
initial events will be http request, db queries or I/O file access).
This is called a loop, not because the event circles and happens
continuously, but because the loop prepares for an event, checks the
event, dispatches an event and repeats the process all over again.
I have a conflict about second paragraph especially the phrase"repeats the process all over again". I accepted that the above http.createServer code from above question is absolutely "event loop" because it repeatedly listens the http request events.
But I don't know how to identify the following code as whether event-driven or event loop. It does not repeat anything except the callback function fired after db query is finished.
database.query("SELECT * FROM table", function(rows) {
var result = rows;
});
Please, let me hear your opinions and answers.
Answer one, your logic is correct: second event will wait. And will execute as far as its queued callbacks time comes.
As well, remember that there is no such thing as "simultaneously" in technical world. Everything have very specific place and time.
The way node.js manages thousands of connections is that there is no need to hold thread idling while there is some database call blocking the logic, or another IO operation is processing (like streams for example). It can "serve" first request, maybe creating more callbacks, and proceed to others.
Because there is no way to block the execution (except nonsense while(true) and similar), it becomes extremely efficient in spreading actual resources all over application logic.
Threads - are expensive, and server capacity of threads is directly related to available Memory. So most of classic web applications would suffer just because RAM is used on threads that are simply idling while there is database query block going on or similar. In node that's not a case.
Still, it allows to create multiple threads (as child_process) through cluster, that expands even more possibilities.
Answer Two. There is no such thing as "loop" that you might thinking about. There will be no loop behind the scenes that does checks if there is connections or any data received and so on. It is nowadays handled by Async methods as well.
So from application point of view, there is no 'main loop', and everything from developer point of view is event-driven (not event-loop).
In case with http.createServer, you bind callback as response to requests. All socket operations and IO stuff will happen behind the scenes, as well as HTTP handshaking, parsing headers, queries, parameters, and so on. Once it happens behind the scenes and job is done, it will keep data and will push callback to event loop with some data. Once event loop ill be free and will come time it will execute in node.js application context your callback with data from behind the scenes.
With database request - same story. It ill prepare and ask stuff (might do it even async again), and then will callback once database responds and data will be prepared for application context.
To be honest, all you need with node.js is to understand the concept, but not implementation of events.
And the best way to do it - experiment.
1) Yes, you are right.
It works because everything you do with node is primarily I/O bound.
When a new request (event) comes in, it's put into a queue. At initialization time, Node allocates a ThreadPool which is responsible to spawn threads for I/O bound processing, like network/socket calls, database, etc. (this is non-blocking).
Now, your "callbacks" (or event handlers) are extremely fast because most of what you are doing is most likely CRUD and I/O operations, not CPU intensive.
Therefore, these callbacks give the feeling that they are being processed in parallel, but they are actually not, because the actual parallel work is being done via the ThreadPool (with multi-threading), while the callbacks per-se are just receiving the result from these threads so that processing can continue and send a response back to the client.
You can easily verify this: if your callbacks are heavy CPU tasks, you can be sure that you will not be able to process thousands of requests per second and it scales down really bad, comparing to a multi-threaded system.
2) You are right, again.
Unfortunately, due to all these abstractions, you have to dive in order to understand what's going on in background. However, yes, there is a loop.
In particular, Nodejs is implemented with libuv.
Interesting to read.
But I don't know how to identify the following code as whether event-driven or event loop. It does not repeat anything except the callback function fired after db query is finished.
Event-driven is a term you normally use when there is an event-loop, and it means an app that is driven by events such as click-on-button, data-arrived, etc. Normally you associate a callback to such events.

Resources