node.js C addon queueing by uv_queue_work - node.js

I have created a C node.js addon with the help of libUV to make the addon asynchronous.
I have made several queues for this.
The code is like this, loopArray is used for storing those queues:
//... variables declarations
void AsyncWork(uv_work_t* req) {
// ...
}
void AsyncAfter(uv_work_t* req) {
// ...
}
Handle<Value> RunCallback(const Arguments& args) {
// ... some preparation work
int loopNumber = (rand() % 10);
int status = uv_queue_work(loopArray[loopNumber], &baton->request, AsyncWork, AsyncAfter);
uv_run(loopArray[loopNumber]);
return Undefined();
}
extern "C" {
static void Init(Handle<Object> target) {
int i = 0;
for (i = 0; i< 10; i++){
loopArray[i] = uv_loop_new();
}
target->Set(String::NewSymbol("callback"), FunctionTemplate::New(RunCallback)->GetFunction());
}
}
NODE_MODULE(addon, Init)
The problem is that, even I created 10 queues for the CPU-demanding tasks. node.js does not switch between tasks while processing one of the queue. Is it due to the single-thread nature of node.js?
Is so, does uv_thread_create helps the situtation?
I cannot find any code sample for this, so I am not sure how to use it.
Thanks!

That is the main idea behind node's architecture: Using function call(back)s and a main event loop to run them instead of using threads to process multiple jobs in parallel.
If what you want to do is to process a queue of jobs, the best way to do it is doing one job at a time. Utilizing multiple cpu cores on a system is done by multiple node instances instead of threads. We have child_process and cluster node modules for this.
When you create multiple threads, let's say you want to run 10 threads for your work, if your system has 8 cpu cores, you are killing the performance by giving unnecessary work to operating system's scheduler. This is a very important point you should take into account. If you have 8 cores, you should not create more than 8 threads in parallel if you want the maximum performance.
For node, we don't try to create multiple queues or threads in one process. Instead, we employ multiple node processes, again maximum one process per core.
If you are going to process a queue which is already there. In this kind of work, you do not need your C module to be asynchronous.
We want asynchronous behavior when we have jobs coming from outside like http requests on a web server. On a web server, our job comes in a way that we cannot control. People and other machines connect to our server whenever they want and we want to answer each of them as quickly as possible. For this, we do not want any request to block others. We need to handle as many requests as we can in parallel.
If you are running on rows of a database table or making some calculations over a long list of parameters however, you are in a very different kind of business. You have your job queue in front of you waiting for your way of management. Your jobs are not coming to your system in a way you have no control over. In this kind of business, to reach the ultimate efficiency and hit the topmost profits, you should run jobs one after another without any switching between them. Parallelism is only good when you have multiple cores and to employ them, the best practice for node is to use multiple node processes.

Related

what would be the right way to go for my scenario, thread array, thread pool or tasks?

I am working on a small microfinance application that processes financial transactions, the frequency of these transaction are quite high, which is why I am planning to make it a multi-threaded application that can process multiple transactions in parallel.
I have already designed all the workers that are thread safe,
what I need help for is how to manage these threads. here are some of my options
1.make a specified number of thread pool threads at startup and keep them running like in a infinite loop where they could keep looking for new transactions and if any are found start processing
example code:
void Start_Job(){
for (int l_ThreadId = 0; l_ThreadId < PaymentNoOfWorkerThread; l_ThreadId++)
{
ThreadPool.QueueUserWorkItem(Execute, (object)l_TrackingId);
}
}
void Execute(object l_TrackingId)
{
while(true)
{
var new_txns = Get_New_Txns(); //get new txns if any returns a queue
while(new_txns.count > 0 ){
process_txn(new_txns.Dequeue())
}
Thread.Sleep(some_time);
}
}
2.look for new transactions and assign a thread pool thread for each transaction (my understanding that these threads would be reused after their execution is complete for new txns)
example code:
void Start_Job(){
while(true){
var new_txns = Get_New_Txns(); //get new txns if any returns a queue
for (int l_ThreadId = 0; l_ThreadId < new_txns.count; l_ThreadId++)
{
ThreadPool.QueueUserWorkItem(Execute, (object)new_txn.Dequeue());
}
}
Thread.Sleep(some_time);
}
void Execute(object Txn)
{
process_txn(txn);
}
3.do the above but with tasks.
which option would be most efficient and well suited for my application,
thanks in advance :)
ThreadPool.QueueUserWorkItem is an older API and you shouldn't be using it directly
anymore. Tasks is the way to go and Thread pool is managed automatically for you.
What may suite your application would depend on what happens in process_txn and is subjective, so this is very generic guideline:
If process_txn is a compute bound operation: for example it performs only CPU bound calculations, then you may look at the Task Parallel Library. It will help you use the CPU cores more efficiently.
If process_txn is less of CPU and more IO bound operations: meaning if it may read/write from files/database or connects to some other remote service, then what you should look at is asynchronous programming and make sure your IO operations are all asynchronous which means your threads are never blocked on IO. This will help your service to be more scalable. Also depending on what your queue is, see if you can await on the queue asynchronously, so that none of your application threads are blocked just waiting on the queue.

How async approach to rest api can reduce thread count?

Many people are saying that modern rest apis should be "async", and as a main argument they say that on some platforms, for example in Java, "blocking" way of doing things produce many threads and "async" way allows to limit thread count and overhead.
What I don't understand, is how it is achieved.
Consider I have an app in a framework like vert.x (but actually it doesn't matter, you can think of NodeJS as well), and say 1_000_000 concurrent connections for a service which makes some request to a database. The framework allows each request itself to be processed async on the long task i|o operations, so database data exchange looks syntactically asynchronous in the business logic code. BUT. As I understand, DB request is made not in the vacuum - it is processed in some other thread, and that thread actually blocks until db request is finished. So it means, that despite the fact, that request business logic looks async and non blocking, long time operations which are called from such logic are actually blocking somewhere under the hood of framework and the more such operations are done, the more threads should be consumed anyway (for NodeJS you can think of threads, created in C++ code of a framework itself)
So as I see the big picture - in async approach there is only one thread, which processes all the requests, it's ok, but there is a bunch of threads, which are doing the actual I/O work in the background anyway, and if one doesn't limit their count, then the number of threads will be the same as for a blocking approach + 1. On the other hand if you limit the number of background thread pool programmatically, then what will be the benefits compared to the blocking approach, which combines a queue for user requests and a limit for the number of request processing threads?
Since you're asking a fairly low level question I'll answer with a low level answer. Hope you're comfortable with C.
First, a disclaimer: I'll be talking mostly about networking code because the only widely used database I know of that use file I/O is sqlite. Since you're asking about postgres I can assume you're interested about how socket I/O (be it TCP socket or unix local sockets) can work with only one thread.
At the core of almost all async systems and libraries is a piece of code that looks like this:
while (1)
{
read_fd_set = active_fd_set;
// This blocks until we receive a packet or until timeout expires:
select(FD_SETSIZE, &read_fd_set, NULL, NULL, timeout);
// Process timed events:
timeout = process_timeout();
// Process I/O:
for (i = 0; i < FD_SETSIZE; ++i) {
if (FD_ISSET(i, &read_fd_set)) {
if (i == sock) {
/* Connection arriving on listening socket */
int new;
size = sizeof(clientname);
new = accept (sock,(struct sockaddr *) &clientname, &size);
FD_SET (new, &active_fd_set);
}
else {
/* Data arriving on an already-connected socket. */
if (read_from_client(i) < 0) {
close (i);
FD_CLR (i, &active_fd_set);
}
}
}
}
}
(code example paraphrased from a GNU socket programming example)
As you can see, the code above uses no threading whatsoever. Yet it can handle many connections simultaneously. If you take a look at the for loop it is also obvious that it is basically a simple state machine that processes sockets one at a time if they have any packets waiting to be read (if not it is skipped by the if (FD_ISSET...) statement).
Non-I/O events can logically only come from timed events. And that's where the timeout management (details not shown for clarity) comes in. All I/O related stuff (basically almost all your async code) gets called back from the read_from_client() function (again, details omitted for clarity).
There is zero code running in parallel.
Where does the parallelization come from?
Basically the server you're connecting to. Most databases support some form of parallelism. Some support mulththreading. Some even support node.js or vert.x style parallelism by supporting asynchronous disk I/O (like postgres). Some configurations of databases allow higher level of parallelism by storing data on more than one server via partitioning and/or sharding and/or master/slave servers.
That's where the big parallelism comes from -- parallel computing. Most databases have very strong support for read parallelism but weaker support for write parallelism (master/slave setups for example allow you to write only to the master database). But this is still a big win because most apps read more data than they write.
Where does disk parallelism come from?
The hardware. Mostly this has to do with DMA which can transfer data without the CPU. DMA is not one thing. It is more like a concept. Different systems like the PCI bus, SATA, USB even the CPU RAM bus itself has various kinds of DMA to transfer data directly to RAM (and in the case of RAM, to transfer data higher up to the various levels of CPU cache) or to a faster buffer.
While waiting for the DMA to complete. The CPU is not doing anything. And while it is doing nothing and there happens to be a network packet coming in or a setTimeout() expiring the code that handles them can be executed on the CPU. All while a file is being read into RAM.
But Node.js docs keep mentioning I/O threads
Only for disk I/O. It's not impossible to do async disk I/O with a single thread. Tcl has done that for years and many other programming languages and frameworks have too. It's just very-very messy since BSD does it differently form Linux which does it differently from Windows and even OSX may be subtly different form BSD even though it is derived from it etc. etc.
For the sake of simplicity and solid reliability node developers have opted to process disk I/O in separate threads.
Note that even for socket I/O it is not as simple as the code example I gave above. Since select() has some limitations (for example, you're forced to loop over ALL sockets to check for incoming data even though most won't have incoming data), people have come up with better APIs. And obviously different OSes do it differently. That is why there are a lot of libraries created to handle cross platform event processing like libevent and libuv (the one node.js uses).
OK. But postgres still runs on my PC
Asynchronous, event-oriented systems does not automagically give you performance superpowers. What they DO give you is choice: the app server is blazing fast so where you put your database servers and what database you use us up to you.
OK. But I can do this with threads. Why async?
Benchmarks.
Since 1999, many people have run many benchmarks and in the majority of cases single threaded (or low thread count), event-oriented systems have outperformed simple multithreaded systems. It was especially true in the old days of single CPU, single core servers. It is still partly true now (since cores are still limited).
That is why Apache was re-written into Apache2 to use a thread pool of async listeners and why Nginx was written from scratch to use a thread pool of async code.
Yes, on modern servers ideally you'd still want some threads in order to use all your CPUs. The alternative is a process pool like how the cluster module works in node.js. But you'd want the number of threads/processes to be constant or as constant as possible to avoid the overhead of context switching and thread creation.
This is true to some async frameworks where JDBC client is still synchronised.
When querying DB in Vert.x you reuse same application threads.
Please see the following example:
#Test
public void testMultipleThreads() throws InterruptedException {
Vertx vertx = Vertx.vertx();
System.out.println("Before starting server: " + Thread.activeCount());
// Start server
vertx.createHttpServer().
requestHandler(httpServerRequest -> {
// System.out.println("Request");
httpServerRequest.response().end();
}).
listen(8080, o -> {
System.out.println("Server ready");
});
// Start counting threads
vertx.setPeriodic(500, (o) -> {
System.out.println(Thread.activeCount());
});
// Create requests
HttpClient client = vertx.createHttpClient();
int loops = 1_000_000;
CountDownLatch latch = new CountDownLatch(loops);
for (int i = 0; i < loops; i++) {
client.getNow(8080, "localhost", "/", httpClientResponse -> {
// System.out.println("Response received");
latch.countDown();
});
}
latch.await();
}
You'll notice that the number of threads doesn't change, even though you serve as many connections as you would like. You can also add Vert.x JDBC client to test it.

Node.js multithreading using threads-a-gogo

I am implementing a REST service for financial calculation. So each request is supposed to be a CPU intensive task, and I think that the best place to create threads it's in the following function:
exports.execute = function(data, params, f, callback) {
var queriesList = [];
var resultList = [];
for (var i = 0; i < data.lista.length; i++)
{
var query = (function(cod) {
return function(callbackFlow) {
params.paramcodneg = cod;
doCdaQuery(params, function(err, result)
{
if (err)
{
return callback({ERROR: err}, null);
}
f(data, result, function(ret)
{
resultList.push(ret);
callbackFlow();
});
});
}
})(data.lista[i]);
queriesList.push(query);
}
flow.parallel(queriesList, function() {
callback(null, resultList);
});
};
I don't know what is best, run flow.parallel in a separeted thread or run each function of the queriesList in its own thread. What is best ? And how to use threads-a-gogo module for that ?
I've tried but couldn't write the right code for that.
Thanks in advance.
Kleyson Rios.
I'll admit that I'm relatively new to node.js and I haven't yet used threads a gogo, but I have had some experience with multi-threaded programming, so I'll take a crack at answering this question.
Creating a thread for every single query (I'm assuming these queries are CPU-bound calculations rather than IO-bound calls to a database) is not a good idea. Creating and destroying threads in an expensive operation, so creating an destroying a group of threads for every request that requires calculation is going to be a huge drag on performance. Too many threads will cause more overhead as the processor switches between them. There isn't any advantage to having more worker threads than processor cores.
Also, if each query doesn't take that much processing time, there will be more time spent creating and destroying the thread than running the query. Most of the time would be spent on threading overhead. In this case, you would be much better off using a single-threaded solution using flow or async, which distributes the processing over multiple ticks to allow the node.js event loop to run.
Single-threaded solutions are the easiest to understand and debug, but if the queries are preventing the main thread from getting other stuff done, then a multi-threaded solution is necessary.
The multi-threaded solution you propose is pretty good. Running all the queries in a separate thread prevents the main thread from bogging down. However, there isn't any point in using flow or async in this case. These modules simulate multi-threading by distributing the processing over multiple node.js ticks and tasks run in parallel don't execute in any particular order. However, these tasks still are running in a single thread. Since you're processing the queries in their own thread, and they're no longer interfering with the node.js event loop, then just run them one after another in a loop. Since all the action is happening in a thread without a node.js event loop, using flow or async in just introduces more overhead for no additional benefit.
A more efficient solution is to have a thread pool hanging out in the background and throw tasks at it. The thread pool would ideally have the same number of threads as processor cores, and would be created when the application starts up and destroyed when the application shuts down, so the expensive creating and destroying of threads only happens once. I see that Threads a Gogo has a thread pool that you can use, although I'm afraid I'm not yet familiar enough with it to give you all the details of using it.
I'm drifting into territory I'm not familiar with here, but I believe you could do it by pushing each query individually onto the global thread pool and when all the callbacks have completed, you'll be done.
The Node.flow module would be handy here, not because it would make processing any faster, but because it would help you manage all the query tasks and their callbacks. You would use a loop to push a bunch of parallel tasks on the flow stack using flow.parallel(...), where each task would send a query to the global threadpool using threadpool.any.eval(), and then call ready() in the threadpool callback to tell flow that the task is complete. After the parallel tasks have been queued up, use flow.join() to run all the tasks. That should run the queries on the thread pool, with the thread pool running as many tasks as it can at once, using all the cores and avoiding creating or destroying threads, and all the queries will have been processed.
Other requests would also be tossing their tasks onto the thread pool as well, but you wouldn't notice that because the request being processed would only get callbacks for the tasks that the request gave to the thread pool. Note that this would all be done on the main thread. The thread pool would do all the non-main-thread processing.
You'll need to do some threads a gogo and node.flow documentation reading and figure out some of the details, but that should give you a head start. Using a separate thread is more complex than using the main thread, and making use of a thread pool is even more complex, so you'll have to choose which one is best for you. The extra complexity might or might not be worth it.

Coldfusion limit to total number of threads

I've got some code that is trying to create 100 threaded http calls. It seems to be getting capped at about 40.
When I do threadJoin I'm only getting 38 - 40 sets of results from my http calls, despite the loop being from 1 - 100.
// thread http calls
pages = 100;
for (page="1";page <= pages; page++) {
thread name="req#page#" {
grabber.setURL('http://site.com/search.htm');
// request headers
grabber.addParam(type="url",name="page",value="#page#");
results = grabber.send().getPrefix();
arrayAppend(VARIABLES.arrResults,results.fileContent);
}
}
// rejoin threads
for (page="2";page <= pages; page++) {
threadJoin('req#page#',10000);
}
Is there a limit to the number of threads that CF can create? Is it to do with Java running in the background? Or can it not handle that many http requests?
Is there a just a much better way for me to do this than threaded HTTP calls?
The result you're seeing is likely because your variables aren't thread safe.
grabber.addParam(type="url",name="page",value="#page#");
That line is accessing Variables.Page which is shared by all of the spawned threads. Because threads start at different times, the value of page is often different from the value you think it is. This will lead to multiple threads having the same value for page.
Instead, if you pass page as an attribute to the thread, then each thread will have its own version of the variable, and you will end up with 100 unique values. (1-100).
Additionally you're writing to a shared variable as well.
arrayAppend(VARIABLES.arrResults,results.fileContent);
ArrayAppend is not thread safe and you will be overwriting versions of VARIABLES.arrResults with other versions of itself, instead of appending each bit.
You want to set the result to a thread variable, and then access that once the joins are complete.
thread name="req#page#" page=Variables.page {
grabber.setURL('http://site.com/search.htm');
// request headers
grabber.addParam(type="url",name="page",value="#Attributes.page#");
results = grabber.send().getPrefix();
thread.Result = results.fileContent;
}
And the join:
// rejoin threads
for (page="2";page <= pages; page++) {
threadJoin('req#page#',10000);
arrayAppend(VARIABLES.arrResults, CFThread['req#page#'].Result);
}
In the ColdFusion administrator, there's a setting for how many will run concurrently, mine's defaulted to 10. The rest apparently are queued. An Phantom42 mentions, you can up the number of running CF threads, however, with 100 or more threads, you may run into other problems.
On 32-bit processes, your whole process can only use 2gig of memory. Each thread uses up an amount of Stack memory, which isn't part of the heap. We've had problems with running out of memory with high numbers of threads as your Java Binary+Heap+Non-Heap(PermGen)+(threads*512k) can easily go over the 2-gig limit.
You'd also have to allow enough threads to handle your code above, as well as other requests coming into your app, which may bog down the app as a whole.
I would suggest changing your code to create N threads, each of which does more than 1 request. It's more work, but you break the N requests=N Threads problem. There's a couple of approaches you can take:
If you think that each request is going to take roughly the same time, then you can split up the work and give each thread a portion to work on before you start each one up.
Or each thread picks a URL off a list and processes it, you can then join to all N threads. You'd need to make sure you put locking around whatever counter you used to track progress though.
Check your Maximum number of running JRun threads setting in ColdFusion Administrator under the Request Tuning tab. The default is 50.

What happens when a single request takes a long time with these non-blocking I/O servers?

With Node.js, or eventlet or any other non-blocking server, what happens when a given request takes long, does it then block all other requests?
Example, a request comes in, and takes 200ms to compute, this will block other requests since e.g. nodejs uses a single thread.
Meaning your 15K per second will go down substantially because of the actual time it takes to compute the response for a given request.
But this just seems wrong to me, so I'm asking what really happens as I can't imagine that is how things work.
Whether or not it "blocks" is dependent on your definition of "block". Typically block means that your CPU is essentially idle, but the current thread isn't able to do anything with it because it is waiting for I/O or the like. That sort of thing doesn't tend to happen in node.js unless you use the non-recommended synchronous I/O functions. Instead, functions return quickly, and when the I/O task they started complete, your callback gets called and you take it from there. In the interim, other requests can be processed.
If you are doing something computation-heavy in node, nothing else is going to be able to use the CPU until it is done, but for a very different reason: the CPU is actually busy. Typically this is not what people mean when they say "blocking", instead, it's just a long computation.
200ms is a long time for something to take if it doesn't involve I/O and is purely doing computation. That's probably not the sort of thing you should be doing in node, to be honest. A solution more in the spirit of node would be to have that sort of number crunching happen in another (non-javascript) program that is called by node, and that calls your callback when complete. Assuming you have a multi-core machine (or the other program is running on a different machine), node can continue to respond to requests while the other program crunches away.
There are cases where a cluster (as others have mentioned) might help, but I doubt yours is really one of those. Clusters really are made for when you have lots and lots of little requests that together are more than a single core of the CPU can handle, not for the case where you have single requests that take hundreds of milliseconds each.
Everything in node.js runs in parallel internally. However, your own code runs strictly serially. If you sleep for a second in node.js, the server sleeps for a second. It's not suitable for requests that require a lot of computation. I/O is parallel, and your code does I/O through callbacks (so your code is not running while waiting for the I/O).
On most modern platforms, node.js does us threads for I/O. It uses libev, which uses threads where that works best on the platform.
You are exactly correct. Nodejs developers must be aware of that or their applications will be completely non-performant, if long running code is not asynchronous.
Everything that is going to take a 'long time' needs to be done asynchronously.
This is basically true, at least if you don't use the new cluster feature that balances incoming connections between multiple, automatically spawned workers. However, if you do use it, most other requests will still complete quickly.
Edit: Workers are processes.
You can think of the event loop as 10 people waiting in line to pay their bills. If somebody is taking too much time to pay his bill (thus blocking the event loop), the other people will just have to hang around waiting for their turn to come.. and waiting...
In other words:
Since the event loop is running on a single thread, it is very
important that we do not block it’s execution by doing heavy
computations in callback functions or synchronous I/O. Going over a
large collection of values/objects or performing time-consuming
computations in a callback function prevents the event loop from
further processing other events in the queue.
Here is some code to actually see the blocking / non-blocking in action:
With this example (long CPU-computing task, non I/O):
var net = require('net');
handler = function(req, res) {
console.log('hello');
for (i = 0; i < 10000000000; i++) { a = i + 5; }
}
net.createServer(handler).listen(80);
if you do 2 requests in the browser, only a single hello will be displayed in the server console, meaning that the second request cannot be processed because the first one blocks the Node.js thread.
If we do an I/O task instead (write 2 GB of data on disk, it took a few seconds during my test, even on a SSD):
http = require('http');
fs = require('fs');
buffer = Buffer.alloc(2*1000*1000*1000);
first = true;
done = false;
write = function() {
fs.writeFile('big.bin', buffer, function() { done = true; });
}
handler = function(req, res) {
if (first) {
first = false;
res.end('Starting write..')
write();
return;
}
if (done) {
res.end("write done.");
} else {
res.end('writing ongoing.');
}
}
http.createServer(handler).listen(80);
here we can see that the a-few-second-long-IO-writing-task write is non-blocking: if you do other requests in the meantime, you will see writing ongoing.! This confirms the well-known non-blocking-for-IO features of Node.js.

Resources