How to infer node.js profiling results? - node.js

I have a simple nodejs application (app A) in Windows that listens to a port and as soon as it receives request it posts to another server (app B) and records the response in MongoDB.
App A (single thread, no cluster implemeneted yet) processes around 35 requests per second (measured using locust.io). The below is the profiling information of app A. 97.8% of time is taken by the shared libraries out of which 93.5% is due to ntdll.dll. Is it normal or a potential bottleneck that can be fixed?
[Summary]:
ticks total nonlib name
6023 2.0% 87.8% JavaScript
0 0.0% 0.0% C++
502 0.2% 7.3% GC
300434 97.8% Shared libraries
834 0.3% Unaccounted
[Shared libraries]:
ticks total nonlib name
287209 93.5% C:\windows\SYSTEM32\ntdll.dll
12907 4.2% C:\Program Files\nodejs\node.exe
144 0.0% C:\windows\system32\KERNELBASE.dll
133 0.0% C:\windows\system32\KERNEL32.DLL
25 0.0% C:\windows\system32\WS2_32.dll
15 0.0% C:\windows\system32\mswsock.dll
1 0.0% C:\windows\SYSTEM32\WINMM.dll
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
287209 93.5% C:\windows\SYSTEM32\ntdll.dll
6705 2.3% C:\Program Files\nodejs\node.exe
831 12.4% LazyCompile: <anonymous> C:\opt\acuity\proxy\nodejs\node_modules\mongoose\node_modules\mongodb-core\lib\topologies\server.js:786:54
826 99.4% LazyCompile: *Callbacks.emit

In a typical application (a mixture of CPU bound and I/O bound workloads) I would say the more the process was able to run among the CPU slots allotted to it, the better - that way we would see more CPU consumption in the user space, as opposed to the kernel space.
In Node.js, as the I/O is delayed until it is available to be actioned, and when they are ready to be actioned, we could see an increased activity in the OS space. If the massive usage of ntdll.dll is to carry out I/O, I would say this is not a concern, rather indication of a good performant system.
Do you have heavy profile showing the split-up in ntdll.dll? If they are pointing to win32 APIs / helper functions which facilitate I/O, then I would say this is a good sign of your system.

Related

Nodejs profiling, is epoll_pwait affecting the performance?

I am developing a node-js application that expects midi input and sends midi output.
In order to measure and improve the performance of the application, following this guide, I have extracted the CPU usage profile while using the application.
This is an extract of the data obtained:
[Summary]:
ticks total nonlib name
495 1.7% 2.0% JavaScript
24379 85.3% 96.9% C++
50 0.2% 0.2% GC
3430 12.0% Shared libraries
272 1.0% Unaccounted
Now the part that I find suspicious is the next:
ticks parent name
24080 84.3% epoll_pwait
Apparently I big percentage of the ticks belong to the same function.
According to this documentation:
Events are received from the event queue (e.g. kernel) via the event
provider (e.g. epoll_wait)
So, from my point of view the event-loop thread uses that function to poll events while in idle state. That would mean that high percentage of calls to epoll_pwait means that the event loop thread is rarely being blocked, and that would be good for performance.
Using the top command I can see that the CPU usage of the application is low (aprox. 3%)
The question is, are epoll_pwait calls affecting performance? If so, can I improve this somehow?

nodejs high cpu usage syscall

I have node app that makes an excessive usage of CPU.
I have used the --prof option to profile cause.
The profiler indicate that:
javascript uses 20% of all ticks
c++ uses 67% of all ticks
gc uses 8% of all ticks
[c++]: display 39% syscall
Diving into the c++ entry:
32.9% v8::internal::Runtime_SetProperty(int, v8::internal::Object**,v8::internal::Isolate*)
And 6.4% handleApiCall
I can't paste here the whole log, but I was wondering how can I understand and identify the root cause of the CPU usage from the log.

GHC per thread GC strategy

I have a Scotty api server which constructs an Elasticsearch query, fetches results from ES and renders the json.
In comparison to other servers like Phoenix and Gin, I'm getting higher CPU utilization and throughput for serving ES responses by using BloodHound but Gin and Phoenix were magnitudes better than Scotty in memory efficiency.
Stats for Scotty
wrk -t30 -c100 -d30s "http://localhost:3000/filters?apid=1&hfa=true"
Running 30s test # http://localhost:3000/filters?apid=1&hfa=true
30 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 192.04ms 305.45ms 1.95s 83.06%
Req/Sec 133.42 118.21 1.37k 75.54%
68669 requests in 30.10s, 19.97MB read
Requests/sec: 2281.51
Transfer/sec: 679.28KB
These stats are on my Mac having GHC 7.10.1 installed
Processor info 2.5GHx i5
Memory info 8GB 1600 Mhz DDR3
I am quite impressed by lightweight thread based concurrency of GHC but memory efficiency remains a big concern.
Profiling memory usage yielded me the following stats
39,222,354,072 bytes allocated in the heap
277,239,312 bytes copied during GC
522,218,848 bytes maximum residency (14 sample(s))
761,408 bytes maximum slop
1124 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 373 colls, 373 par 2.802s 0.978s 0.0026s 0.0150s
Gen 1 14 colls, 13 par 0.534s 0.166s 0.0119s 0.0253s
Parallel GC work balance: 42.38% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 17 peak workers (17 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.008s elapsed)
MUT time 31.425s ( 36.161s elapsed)
GC time 3.337s ( 1.144s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 34.765s ( 37.314s elapsed)
Alloc rate 1,248,117,604 bytes per MUT second
Productivity 90.4% of total user, 84.2% of total elapsed
gc_alloc_block_sync: 27215
whitehole_spin: 0
gen[0].sync: 8919
gen[1].sync: 30902
Phoenix never took more than 150 MB, while Gin took much lower memory.
I believe that GHC uses mark and sweep strategy for GC. I also believe it would have been better to use per thread incremental GC strategy akin to Erlang VM for better memory efficiency.
And by interpreting Don Stewart's answer to a related question there must be some way to change the GC strategy in GHC.
I also noted that the memory usage remained stable and pretty low when the concurrency level was low, so I think memory usage booms up only when concurrency is pretty high.
Any ideas/pointers to solve this issue.
http://community.haskell.org/~simonmar/papers/local-gc.pdf
This paper by Simon Marlow describes per-thread local heaps, and claims that this was implemented in GHC. It's dated 2011. I can't be sure if this is what the current version of GHC actually does (i.e., did this go into the release version of GHC, is it still the current status quo, etc.), but it seems my recollection wasn't completely made up.
I will also point out the section of the GHC manual that explains the settings you can twiddle to adjust the garbage collector:
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime-control.html#rts-options-gc
In particular, by default GHC uses a 2-space collector, but adding the -c RTS option makes it use a slightly slower 1-space collector, which should eat less RAM. (I'm entirely unclear which generation(s) this information applies to.)
I get the impression Simon Marlow is the guy who does most of the RTS stuff (including the garbage collector), so if you can find him on IRC, he's the guy to ask if you want the direct truth...

memory usage more than 100%

I am using arm processor and one qt based gui application.
There is an issue of slow process.
Mem: 36272K used, 24692K free, 0K shrd, 188K buff, 19544K cached
CPU: 6.1% usr 1.3% sys 0.0% nic 92.4% idle 0.0% io 0.0% irq 0.0% sirq
Load average: 0.25 0.18 0.07 1/43 553
PID : 512
PPID : 1
USER : root
STAT : S
VSZ : 62368
%MEM : 102.0
CPU : 0
%CPU : 5.5
COMMAND : ./gopaljeearm -qws -nomouse
This is status when i use top command.
There is a very nice answer for Android applications which in turn should be applicable for most of the Linux applications. Quoting...
Note that memory usage on modern operating systems like Linux is an
extremely complicated and difficult to understand area. In fact the
chances of you actually correctly interpreting whatever numbers you
get is extremely low.
you can read rest of it here.
Another nice read is ELC: How much memory are applications really using? from LWN.

uptime VS. top CPU usage : What should I believe, why this difference?

I'm having some performance issue on my embedded device:
# uptime
14:59:39 up 5:37, load average: 1.60, 1.50, 1.53
Very bad for a monocore system ... :-p! However if I check with the top utility, I always have an idle time around 80% !
Mem: 49020K used, 75960K free, 0K shrd, 0K buff, 21476K cached
CPU: 12.5% usr 4.8% sys 0.0% nic 81.7% idle 0.0% io 0.9% irq 0.0% sirq
Load average: 1.30 1.42 1.51 1/80 18696
After reading some articles, I would better believe the uptime command. But why this difference? Is my CPU really idle ??!
Load is not just a measure of how many processes in the R state (runnable, could use CPU time), but also processes in the D state (uninterruptable sleep, usually waiting for IO). You likely have a process in the D state which is contributing to load, but not using cpu. This command would show you all the current processes which are contributing to load:
ps aux | awk '$8~/[RD]/'
Have a look at that output and see if you have commands in the D state (in the 8th column)
you'd better to learn what 'load average' stands for.
in short, it's a number of processes, waiting for some resource, and the resource may be CPU, HDD, serial port, ...
The Load average seems a little high, that could meen that the cpu is busy with things like I/O(disk/network) or thread managment(you may have too meny running).

Resources