Child event order in Node.js - node.js

I have an api, its working process is like this:
doing some logic, using 1 second's CPU time
wait for network IO, and this IO need 1 second too.
So, normally this api will need about 2 seconds to respond
Then I did a test.
I start 10 requests at the same time.
EVERY ONE OF THEM need more than 10 seconds to respond
This test means
Node will finish all the cpu costly part of all the 10 requests first.
WHY?
why doesn't it respond to one request immediately after one IO is done.
Thanks for the comments. I think I need to do some explanation about my concern.
What i concern is if the request count is not 10, if there are 100 request at the same time.
All of them will timeout!!
If the Node respond to the child IO event immediately, I think at least 20% of them will not time out.
I think node need some Event Priority mechanism
router.use('/test/:id', function (req, res) {
var id = req.param('id');
console.log('start cpu code for ' + id);
for (var x = 0; x < 10000; x++) {
for (var x2 = 0; x2 < 30000; x2++) {
x2 -= 1;
x2 += 1;
}
}
console.log('cpu code over for ' + id);
request('http://terranotifier.duapp.com/wait3sec/' + id, function (a,b,data) {
// how can I make this code run immediately after the server response to me.
console.log('IO over for ' + data);
res.send('over');
});
});

Node.js is single threaded. Therefore as long as you have a long running routine it cannot process other pieces of code. The offending piece of code in this instance is your double for loop which takes up a lot of CPU time.
To understand what you're seeing first let me explain how the event loop works.
Node.js event loop evolved out of javascript's event loop which evolved out of web browsers event loop. The web browser event loop was originally implemented not for javascript but to allow progressive rendering of images. The event loop looks a bit like this:
,-> is there anything from the network?
| | |
| no yes
| | |
| | '-----------> read network data
| V |
| does the DOM need updating? <-------------'
| | |
| no yes
| | |
| | v
| | update the DOM
| | |
'------'--------------'
When javascript was added the script processing was simply inserted into the event loop:
,-> is there anything from the network?
| | |
| no yes
| | |
| | '-----------> read network data
| V |
| any javascript to run? <------------------'
| | |
| no yes
| | '-----------> run javascript
| V |
| does the DOM need updating? <-------------'
| | |
| no yes
| | |
| | v
| | update the DOM
| | |
'------'--------------'
When the javascript engine is made to run outside of the browser, as in Node.js, the DOM related parts are simply removed and the I/O becomes generalized:
,-> any javascript to run?
| | |
| no yes
| | |
| | '--------> RUN JAVASCRIPT
| V |
| is there any I/O <------------'
| | |
| no yes
| | |
| | v
| | read I/O
| | |
'------'--------------'
Note that all your javascript code is executed in the RUN JAVASCRIPT part.
So, what happens with your code when you make 10 connections?
connection1: node accepts your request, processes the double for loops
connection2: node is still processing the for loops, the request gets queued
connection3: node is still processing the for loops, the request gets queued
(at some point the for loop for connection 1 finishes)
node notices that connection2 is queued so connection2 gets accepted,
process the double for loops
...
connection10: node is still processing the for loops, the request gets queued
(at this point node is still busy processing some other for loop,
probably for connection 7 or something)
request1: node is still processing the for loops, the request gets queued
request2: node is still processing the for loops, the request gets queued
(at some point all connections for loops finishes)
node notices that response from request1 is queued so request1 gets processed,
console.log gets printed and res.send('over') gets executed.
...
request10: node is busy processing some other request, request10 gets queued
(at some point request10 gets executed)
This is why you see node taking 10 seconds answering 10 requests. It's not that the requests themselves are slow but their responses are queued behind all the for loops and the for loops get executed first (because we're still in the current loop of the event loop).
To counter this, you should make the for loops asynchronous to give node a chance to process the event loop. You can either write them in C and use C to run independent threads for each of them. Or you can use one of the thread modules from npm to run javascript in separate threads. Or you can use worker-threads which is a web-worker like API implemented for Node.js. Or you can fork a cluster of processes to execute them. Or you can simply loop them with setTimeout if parallelism is not critical:
router.use('/test/:id', function (req, res) {
var id = req.param('id');
console.log('start cpu code for ' + id);
function async_loop (count, callback, done_callback) {
if (count) {
callback();
setTimeout(function(){async_loop(count-1, callback)},1);
}
else if (done_callback) {
done_callback();
}
}
var outer_loop_done=0;
var x2=0;
async_loop(10000,function(){
x1++;
async_loop(30000,function(){
x2++;
},function() {
if (outer_loop_done) {
console.log('cpu code over for ' + id);
request('http://terranotifier.duapp.com/wait3sec/' + id,
function (a,b,data){
console.log('IO over for ' + data);
res.send('over');
}
);
}
});
},function(){
outer_loop_done = 1;
});
});
The above code will process a response from request() as soon as possible rather than wait for all the async_loops to execute to completion without using threads (so no parallelism) but simply using event queue priority.

Related

why double the cpu limit leads to only 20% time cost improvement?

I use python3 to do some encrypted calculation with MICROSOFT SEAL and is looking for some performance improvement.
I do it by:
create a shared memory to hold the plaintext data
(Use numpy array in shared memory for multiprocessing)
start multiple processes with multiprocessing.Process (there is a param controlling the number of processes, thus limiting the cpu usage)
processes read from shared memory and do some encrypted calculation
wait for calculation ends and join processes
I run this program on a 32U64G x86 linux server, cpu model is: Intel(R) Xeon(R) Gold 6161 CPU # 2.20GHz.
I notice that if I double the number of processes there is only about 20% time cost improvement.
I've tried three kinds of process nums:
| process nums | 7 | 13 | 27 |
| time ratio | 0.8 | 1 | 1.2 |
Why is this improvement disproportionate to the resources i use (cpu & memory)?
Conceptual knowledge or specific linux cmdlines are both welcome.
Thanks.
FYI:
My code of sub processes is like:
def sub_process_main(encrypted_bytes, plaintext_array, result_queue):
// init
// status_sign
while shared_int > 0:
// seal load and some other calculation
encrypted_matrix_list = seal.Ciphertext.load(encrypted_bytes)
shared_plaintext_matrix = seal.Encoder.encode(plaintext_array)
// ... do something
for some loop:
time1 = time.time()
res = []
for i in range(len(encrypted_matrix_list)):
enc = seal.evaluator.multiply_plain(encrypted_matrix_list[i], shared_plaintext_matrix[i])
res.append(enc)
time2 = time.time()
print(f'time usage: {time2 - time1}')
// ... do something
result_queue.put(final_result)
I actually print the time for every part of my code and here is the time cost for this part of code.
| process nums | 13 | 27 |
| occurrence | 1791 | 864 |
| total time | 1698.2140 | 1162.8330 |
| average | 0.9482 | 1.3459 |
I've monitored some metrics but I don't know if there are any abnormal ones.
13 cores:
top
pidstat
vmstat
27 cores:
top (Why is this using all cores rather than exactly 27 cores? Does it have anything to do with Hyper-Threading?)
pidstat
vmstat

poll(2) on read fd of pipe(2) and fd of inotify_init() is resulting in endless EINTR

Update
Possible bug in Firefox - https://bugzilla.mozilla.org/show_bug.cgi?id=1288293
Old Post
I am writing a inotify file watcher.
My main thread first creates a pipe with pipe and then creates a polling thread, and sends this thread the read fd of the pipe.
int mypipe[2];
rez = pipe(mypipe)
if (pipe(pipefd) == -1) {
exit(EXIT_FAILURE);
}
int mypipe_fd = mypipe[0];
My polling thread then starts an infinite poll by watching the inotify inotify_fd and the pipe mypipe_fd with poll like this:
int inotify_fd = inotify_init1(0);
if (inotify_fd == -1) {
exit('inotify init failed');
}
// IN_ALL_EVENTS = IN_ACCESS | IN_MODIFY | IN_ATTRIB | CONST.IN_CLOSE_WRITE | IN_CLOSE_NOWRITE | IN_OPEN | IN_MOVED_FROM | IN_MOVED_TO | IN_CREATE | IN_DELETE | IN_DELETE_SELF | IN_MOVE_SELF
if (inotify_add_watch(inotify_fd, path_to_desktop, IN_ALL_EVENTS) == -1) {
exit('add watch failed');
}
int mypipe_fd = BLAH; // this the read end of the pipe generated on the main thread
pollfd fds[2];
fds[0].fd = mypipe_fd;
fds[1].fd = inotify_fd;
fds[0].events = POLLIN;
fds[1].events = POLLIN;
if (poll(fds, 2, -1) == -1) {
exit(errno);
}
This exits as poll gives -1 and errno of 4 endlessly. Same situation with select with either of the fd's or both.
I did more tests. Even doing just read(mypipe_fd, .., ..) or read(inotify_fd, .., ..) continuously gives me EINTR as well. Mind boggling! Does anyone know what can be causing this? This behavior is seen on Ubuntu 14.01 and OpenSUSE 42.1 (the ones i tested on).

NodeJS, Promises and performance

My question is about performance in my NodeJS app...
If my program run 12 iteration of 1.250.000 each = 15.000.000 iterations all together - it takes dedicated servers at Amazon the following time to process:
r3.large: 2 vCPU, 6.5 ECU, 15 GB memory --> 123 minutes
4.8xlarge: 36 vCPU, 132 ECU, 60 GB memory --> 102 minutes
I have some code similair to the code below...
start();
start(){
for(var i=0; i<12; i++){
function2(); // Iterates over a collection - which contains data split up in intervals - by date intervals. This function is actually also recursive - due to the fact - that is run through the data many time (MAX 50-100 times) - due to different intervals sizes...
}
}
function2(){
return new Promise{
for(var i=0; i<1.250.000; i++){
return new Promise{
function3(); // This function simple iterate through all possible combinations - and call function3 - with all given values/combinations
}
}
}
}
function3(){
return new Promise{ // This function simple make some calculations based on the given values/combination - and then return the result to function2 - which in the end - decides which result/combination was the best...
}}
This is equal to 0.411 millisecond / 441 microseconds pér iteration!
When i look at performance and memory usage in the taskbar... the CPU is not running at 100% - but more like 50%...the entire time?
The memory usage starts very low - but KEEPS growing in GB - every minute until the process is done - BUT the (allocated) memory is first released when i press CTRL+C in the Windows CMD... so its like the NodeJS garbage collection doesn't not work optimal - or may be its simple the design of the code again...
When i execute the app i use the memory opt like:
node --max-old-space-size="50000" server.js
PLEASE tell me every thing you thing i can do - to make my program FASTER!
Thank you all - so much!
It's not that the garbage collector doesn't work optimally but that it doesn't work at all - you don't give it any chance to.
When developing the tco module that does tail call optimization in Node i noticed a strange thing. It seemed to leak memory and I didn't know why. It turned out that it was because of few console.log()
calls in various places that I used for testing to see what's going on because seeing a result of recursive call millions levels deep took some time so I wanted to see something while it was doing it.
Your example is pretty similar to that.
Remember that Node is single-threaded. When your computations run, nothing else can - including the GC. Your code is completely synchronous and blocking - even though it's generating millions of promises in a blocking manner. It is blocking because it never reaches the event loop.
Consider this example:
var a = 0, b = 10000000;
function numbers() {
while (a < b) {
console.log("Number " + a++);
}
}
numbers();
It's pretty simple - you want to print 10 million numbers. But when you run it it behaves very strangely - for example it prints numbers up to some point, and then it stops for several seconds, then it keeps going or maybe starts trashing if you're using swap, or maybe gives you this error that I just got right after seeing the Number 8486:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted
What's going on here is that the main thread is blocked in a synchronous loop where it keeps creating objects but the GC has no chance to release them.
For such long running tasks you need to divide your work and get into the event loop once in a while.
Here is how you can fix this problem:
var a = 0, b = 10000000;
function numbers() {
var i = 0;
while (a < b && i++ < 100) {
console.log("Number " + a++);
}
if (a < b) setImmediate(numbers);
}
numbers();
It does the same - it prints numbers from a to b but in bunches of 100 and then it schedules itself to continue at the end of the event loop.
Output of $(which time) -v node numbers1.js 2>&1 | egrep 'Maximum resident|FATAL'
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Maximum resident set size (kbytes): 1495968
It used 1.5GB of memory and crashed.
Output of $(which time) -v node numbers2.js 2>&1 | egrep 'Maximum resident|FATAL'
Maximum resident set size (kbytes): 56404
It used 56MB of memory and finished.
See also those answers:
How to write non-blocking async function in Express request handler
How node.js server serve next request, if current request have huge computation?
Maximum call stack size exceeded in nodejs
Node; Q Promise delay
How to avoid jimp blocking the code node.js

How many processes and threads will be created?

I have this code and trying to understand how many process and threads will be created from this:
pid t pid;
pid = fork();
if (pid == 0) { /* child process */
fork();
thread create( . . .);
}
fork();
I think it creates 2 threads, from the fork inside the if loop. and 8 processes? But Im not sure if thats right
Actually, there should be 8 threads and 6 processes.
Here's the diagrams to make it clear:
1) after first fork():
|------------------- child of p0 [p1]
---|------------------- parent [p0]
2) after second fork():
|--------------- child of p1 [p2]
|---|--------------- [p1]
---|------------------- [p0]
3) after pthread_create():
----------- thread 1 of p2 [p2t1]
|---/----------- thread 0 of p2 [p2t0]
| ----------- thread 1 of p1 [p1t1]
|---|---/----------- thread 0 of p1 [p1t0]
---|------------------- [p0]
4) after third fork():
|------------ child of p2 [p5]
| ------ [p2t1]
|-|-----/------ [p2t0]
| |---------- child of p1 [p4]
| | ------ [p1t1]
|---|---|---/------ [p1t0]
| |------------ child of p0 [p3]
---|-----|------------ [p0]
important: Remember that the fork(2) call clones just the thread which executed it, thus process 4 [p4] has only one thread (same apply to process 5[p5]).
One extra process will get created each time fork is called.
On first call to fork, parent process P creates sub-process SP1.
After fork, parent process calls fork again (skipping the if), creating sub-process SP2.
SP1 after fork calls fork inside if, creates sub-sub-process SSP1.
SP1 then spawns a thread.
SP1 leaves the if. and calls fork again, creating sub-sub-process SSP2.
SSP1 spawns a thread.
SSP1 leaves the if, and calls fork, creating sub-sub-sub-process SSSP.
So, processes created: SP1, SP2, SSP1, SSP2, SSSP = 5 processes.
If you count the original process P, there are 6 processes.
Only SP1 and SSP1 spawn threads, so there are 2 threads created. If you count all the main threads of all the processes, there are 7 or 8 threads, depending on whether or not you count the original process P.
An illustration of the processes and threads being created correlated to the code.
P
pid t pid; |
pid = fork(); +------SP1
if (pid == 0) { | |
fork(); | +---------------SSP1
thread create(...); | |-SP1's thread |-SSP1's thread
} | | |
fork(); +-SP2 +-SSP2 +-SSSP
| | | | | |
shouldn't it be 2 threads and 6 processes?
M
| ↘
M A
| |↘
M A* B*
| | |
| ↘ | ↘ |↘
M C A D B E
as I use * to represent thread.
Total fork process is=5
Thread create is=2

Disseminating a token in Alloy

I'm following an example in Daniel Jackson's excellent book (Software Abstractions), specifically the example in which he has a token-ring setup in order to elect a leader.
I'm attempting to extend this example (Ring election) to ensure that the token, instead of being limited to one, is being passed around to all members within the provided time (and each member only being elected once, not multiple times). However (mostly due to my inexperience in Alloy), I'm having issues figuring out the best way. Initially I'd thought that I could play with some of the operators (changing -'s to +'s), but I don't seem to be quite hitting the nail on the head.
Below is the code from the example. I've marked up a few areas with questions...any and all help is appreciated. I'm using Alloy 4.2.
module chapter6/ringElection1 --- the version up to the top of page 181
open util/ordering[Time] as TO
open util/ordering[Process] as PO
sig Time {}
sig Process {
succ: Process,
toSend: Process -> Time,
elected: set Time
}
// ensure processes are in a ring
fact ring {
all p: Process | Process in p.^succ
}
pred init [t: Time] {
all p: Process | p.toSend.t = p
}
//QUESTION: I'd thought that within this predicate and the following fact, that I could
// change the logic from only having one election at a time to all being elected eventually.
// However, I can't seem to get the logic down for this portion.
pred step [t, t': Time, p: Process] {
let from = p.toSend, to = p.succ.toSend |
some id: from.t {
from.t' = from.t - id
to.t' = to.t + (id - p.succ.prevs)
}
}
fact defineElected {
no elected.first
all t: Time-first | elected.t = {p: Process | p in p.toSend.t - p.toSend.(t.prev)}
}
fact traces {
init [first]
all t: Time-last |
let t' = t.next |
all p: Process |
step [t, t', p] or step [t, t', succ.p] or skip [t, t', p]
}
pred skip [t, t': Time, p: Process] {
p.toSend.t = p.toSend.t'
}
pred show { some elected }
run show for 3 Process, 4 Time
// This generates an instance similar to Fig 6.4
//QUESTION: here I'm attempting to assert that ALL Processes have an election,
// however the 'all' keyword has been deprecated. Is there an appropriate command in
// Alloy 4.2 to take the place of this?
assert OnlyOneElected { all elected.Time }
check OnlyOneElected for 10 Process, 20 Time
This network protocol is exactly about how to elect a single process to be the leader, so I don't really understand the meaning of your idea of having "all processes elected eventually".
instead of all elected.Time, you can equivalently write elected.Time = Process (since the type of elected is Process -> Time). This just says that elected.Time (all processes elected at any time step) is exactly the set of all processes, which, obviously, doesn't mean that "only one process is elected", as suggested by the name of your assertion.

Resources