Profiling node application: most time spent in node itself

Profiling node application: most time spent in node itself - node.js

I am using a node application that is experiencing a performance problem under certain loads. I am attempting to use the V8 profiler to find out where the problem might be, basically following this guide.
I've generated a log file during the problem load using node --prof app.js, and analyzed it with node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > processed.txt. This all seems to work fine, but it seems that almost all the ticks are spent in the node executable itself:
[Summary]:
ticks total nonlib name
3887 5.8% 38.2% JavaScript
5590 8.4% 55.0% C++
346 0.5% 3.4% GC
56296 84.7% Shared libraries
689 1.0% Unaccounted
and:
[Shared libraries]:
ticks total nonlib name
55990 84.2% /usr/bin/node
225 0.3% /lib/x86_64-linux-gnu/libc-2.19.so
68 0.1% /lib/x86_64-linux-gnu/libpthread-2.19.so
7 0.0% /lib/x86_64-linux-gnu/libm-2.19.so
4 0.0% [vdso]
2 0.0% /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20
What does this mean? What is the app spending all its time doing? How can I find the performance problem?

I would suggest to try VTune Amplifier as alternative of the V8 profiler. I was able to identify and fix the time-consuming place in my code. You can download free trial version here and just follow by this step-by-step instructions. Hope it will help you.

Related

Printing python CPU count not working from terminal

Situation
I am in the situation in which I want to run a script from terminal which prints the used CPU percentage per cpu. I wrote this simple script.
import psutil
values = psutil.cpu_percent(percpu=True)
print(f"{'CPU':>3} {'CPU%':>6}")
print(10*'=')
for cpu in range(1, psutil.cpu_count()+1):
print(f"{cpu:<3} {values[cpu-1]:>5}%")
When I run this in a Jupyter Notebook cell I get what I want, which is per CPU the currently used percentage. Something like this
CPU CPU%
==========
1 100.0%
2 1.5%
3 1.3%
4 0.8%
5 0.1%
6 0.9%
7 1.0%
8 0.2%
Issue
Now I put the above code in a python file get_cpu_perc.py and run the script from my terminal
python3 get_cpu_perc.py
What is getting printed on the terminal is always
CPU CPU%
==========
1 0.0%
2 0.0%
3 0.0%
4 0.0%
5 0.0%
6 0.0%
7 0.0%
8 0.0%
Question
What is happening here? Why is it never showing the percentage when I run the script from the terminal?

psutil needs some time to measure your cpu usage. This is usually done between import of psutil and the first call. It might be that in your jupyter notebook there is more time between the import and the readout.
You can pause the program for a short amount of time in order to force psutil to 'listen' to your cpu:
# listen to the cpu for 0.1 seconds before returning the value
values = psutil.cpu_percent(interval=0.1, percpu=True)
Then the values should fill up.

GraphQL API with backend built in node js consuming high CPU Usage

We have built a graphql API , with our services written in node.js and leveraging apollo server. We are experiencing high CPU usage whenever requests per sec reach 20. We did profiling with flamegraphs and node built in profiler. Attaching the result of the built in profiler:-
[Summary]:
ticks total nonlib name
87809 32.1% 95.8% JavaScript
0 0.0% 0.0% C++
32531 11.9% 35.5% GC
182061 66.5% Shared libraries
3878 1.4% Unaccounted
[Shared libraries]:
ticks total nonlib name
138326 50.5% /usr/bin/node
30023 11.0% /lib/x86_64-linux-gnu/libc-2.27.so
12466 4.6% /lib/x86_64-linux-gnu/libpthread-2.27.so
627 0.2% [vdso]
567 0.2% /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25
52 0.0% /lib/x86_64-linux-gnu/libm-2.27.so
Results from flamegraph also complement the above result that we didn't see any javascript function consuming high CPU.
Why /usr/bin/node is consuming so much CPU? has it something to do with the way code has been written or it is in general the trend?
Also to give little info about what our graphQL API Does:- upon receiving a request, depending on the request it makes 3 to 5 downstream API calls and doesn't do any CPU intensive work on it's own.
Versions:-
Node version:- 10.16.3
graphql-modules/core:- 0.7.7
apollo-datasource-rest:- 0.5.0
apollo-server-express:- 2.6.8
A help is really appreciated here.

How to read nodejs internal profiler tick-processor output

I'm interested in profiling my Node.js application.
I've started it with --prof flag, and obtained a v8.log file.
I've taken the windows-tick-processor and obtained a supposedly human readable profiling log.
At the bottom of the question are a few a small excerpts from the log file, which I am completely failing to understand.
I get the ticks statistical approach. I don't understand what total vs nonlib means.
Also I don't understand why some things are prefixed with LazyCompile, Function, Stub or other terms.
The best answer I could hope for is the complete documentation/guide to the tick-processor output format, completely explaining every term, structure etc...
Barring that, I just don't understand what lazy-compile is. Is it compilation? Doesn't every function get compiled exactly once? Then how can compilation possibly be a significant part of my application execution? The application ran for hours to produce this log, and I'm assuming the internal JavaScript compilation takes milliseconds.
This suggests that lazy-compile is something that doesn't happen once per function, but happens during some kind of code evaluation? Does this mean that everywhere I've got a function definition (for example a nested function), the internal function gets "lazy-compiled" each time?
I couldn't find any information on this anywhere, and I've been googling for DAYS...
Also I realize there are a lot of profiler flags. Additional references on those are also welcome.
[JavaScript]:
ticks total nonlib name
88414 7.9% 20.1% LazyCompile: *getUniqueId C:\n\dev\SCNA\infra\lib\node-js\utils\general-utils.js:16
22797 2.0% 5.2% LazyCompile: *keys native v8natives.js:333
14524 1.3% 3.3% LazyCompile: Socket._flush C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\zmq\lib\index.js:365
12896 1.2% 2.9% LazyCompile: BasicSerializeObject native json.js:244
12346 1.1% 2.8% LazyCompile: BasicJSONSerialize native json.js:274
9327 0.8% 2.1% LazyCompile: * C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\zmq\lib\index.js:194
7606 0.7% 1.7% LazyCompile: *parse native json.js:55
5937 0.5% 1.4% LazyCompile: *split native string.js:554
5138 0.5% 1.2% LazyCompile: *Socket.send C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\zmq\lib\index.js:346
4862 0.4% 1.1% LazyCompile: *sort native array.js:741
4806 0.4% 1.1% LazyCompile: _.each._.forEach C:\n\dev\SCNA\infra\node_modules\underscore\underscore.js:76
4481 0.4% 1.0% LazyCompile: ~_.each._.forEach C:\n\dev\SCNA\infra\node_modules\underscore\underscore.js:76
4296 0.4% 1.0% LazyCompile: stringify native json.js:308
3796 0.3% 0.9% LazyCompile: ~b native v8natives.js:1582
3694 0.3% 0.8% Function: ~recursivePropertiesCollector C:\n\dev\SCNA\infra\lib\node-js\utils\object-utils.js:90
3599 0.3% 0.8% LazyCompile: *BasicSerializeArray native json.js:181
3578 0.3% 0.8% LazyCompile: *Buffer.write buffer.js:315
3157 0.3% 0.7% Stub: CEntryStub
2958 0.3% 0.7% LazyCompile: promise.promiseDispatch C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\q\q.js:516
88414 7.9% LazyCompile: *getUniqueId C:\n\dev\SCNA\infra\lib\node-js\utils\general-utils.js:16
88404 100.0% LazyCompile: *generateId C:\n\dev\SCNA\infra\lib\node-js\utils\general-utils.js:51
88404 100.0% LazyCompile: *register C:\n\dev\SCNA\infra\lib\node-js\events\pattern-dispatcher.js:72
52703 59.6% LazyCompile: * C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:216
52625 99.9% LazyCompile: *_.each._.forEach C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\underscore\underscore.js:76
52625 100.0% LazyCompile: ~usingEventHandlerMapping C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:214
35555 40.2% LazyCompile: *once C:\n\dev\SCNA\infra\lib\node-js\events\pattern-dispatcher.js:88
29335 82.5% LazyCompile: ~startAction C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:201
25687 87.6% LazyCompile: ~onActionComplete C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-logic.js:130
1908 6.5% LazyCompile: ~b native v8natives.js:1582
1667 5.7% LazyCompile: _fulfilled C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\q\q.js:795
4645 13.1% LazyCompile: ~terminate C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:160
4645 100.0% LazyCompile: ~terminate C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-logic.js:171
1047 2.9% LazyCompile: *startAction C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:201
1042 99.5% LazyCompile: ~onActionComplete C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-logic.js:130

Indeed, you are right in your assumption about time actually spent compiling the code: it takes milliseconds (which could be seen with --trace-opt flag).
Now talking about that LazyCompile. Here is a quotation from Vyacheslav Egorov's (former v8 dev) blog:
If you are using V8's tick processors keep in mind that LazyCompile:
prefix does not mean that this time was spent in compiler, it just
means that the function itself was compiled lazily.
An asterisk before a function name means that time is being spent in optimized function, tilda -- not optimized.
Concerning your question about how many times a function gets compiled. Actually the JIT (so-called full-codegen) creates a non-optimized version of each function when it gets executed for the first time. But later on an arbitrary (well, to some extent) number or recompilations could happen (due to optimizations and bail-outs). But you won't see any of it in this kind of profiling log.
Stub prefix to the best of my understanding means the execution was inside a C-Stub, which is a part of runtime and gets compiled along with other parts of the engine (i.e. it is not JIT-compiled JS code).
total vs. nonlib:
These columns simply mean than x% of total/non-lib time was spent there. (I can refer you to a discussion here).
Also, you can find https://github.com/v8/v8/wiki/Using%20V8%E2%80%99s%20internal%20profiler useful.

nodejs profiling; what can 'Unknown' be

While profiling a nodejs program, I see that 61% of the ticks are caused by 'Unknown' (see below). What can this be? What should I look for?
gr,
Coen
Statistical profiling result from node, (14907 ticks, 9132 unaccounted, 0 excluded).
[Unknown]:
ticks total nonlib name
9132 61.3%
[Shared libraries]:
ticks total nonlib name
1067 7.2% 0.0% C:\Windows\SYSTEM32\ntdll.dll
55 0.4% 0.0% C:\Windows\system32\kernel32.dll
[JavaScript]:
ticks total nonlib name
1381 9.3% 10.0% LazyCompile: *RowDataPacket.parse D:\MI\packet.js:9
......

Are you loading any modules that have built dependencies?
Basically by "Unknown" it means "unaccounted for" (check tickprocessor.js for more explanation). For example, the GC will print messages like "scavenge,begin,..." but that is unrecognized by logreader.js.
It would help to know what profiling library your using to parse the v8.log file.
Update
The node-tick package hasn't been updated for over a year and is probably missing a lot of recent prof commands. Try using node-profiler instead. It's created by one of node's maintainers. And if you want the absolute best result you'll need to build it using node-gyp.
Update
I've parsed the v8.log output using the latest from node-profiler (the latest on master, not the latest tag) and posted the results at http://pastebin.com/pdHDPjzE
Allow me to point out a couple key entries which appear about half way down:
[GC]:
ticks total nonlib name
2063 26.2%
[Bottom up (heavy) profile]
6578 83.4% c:\node\node.exe
1812 27.5% LazyCompile: ~parse native json.js:55
1811 99.9% Function: ~<anonymous> C:\workspace\repositories\asyncnode_MySQL\lib\MySQL_DB.js:41
736 11.2% Function: ~Buffer.toString buffer.js:392
So 26.2% of all script type was spent in garbage collection. Which is much higher than it should be. Though it does correlate well with how much time is spent on Buffer.toString. If that many Buffers are being created then converted to strings, both would need to be gc'd when they leave scope.
Also I'm curious why so much time is spent in LazyCompile for json.js. Or more so, why would json.js even be necessary in a node application?
To help you performance tune your application I'm including a few links below that give good instructions on what to do and look for.
Nice slide deck with the basics:
https://mkw.st/p/gdd11-berlin-v8-performance-tuning-tricks/#1
More advanced examples of optimization techniques:
http://floitsch.blogspot.com/2012/03/optimizing-for-v8-introduction.html
Better use of closures:
http://mrale.ph/blog/2012/09/23/grokking-v8-closures-for-fun.html
Now as far as why you couldn't achieve the same output. If you built and used node-profiler and its provided nprof from master and it still doesn't work then I'll assume it has something to do with being on Windows. Think about filing a bug on GitHub and see if he'll help you out.

You are using a 64 bit version of Node.JS to run your application and a 32bit build of the d8 shell to process your v8.log.
Using either a 32 bit version of Node.JS with ia32 as the build target for the d8 shell or a 64 bit version of Node.JS with x64 as the d8 shell build target should solve your problem.

Try to build v8 with profiling support on:
scons prof=on d8
Make sure you run node --prof with version corresponding to version of v8
Then tools/linux-tick-processor path/to/v8.log should show you the full profile info.

How to find CPU Usage in google profiler

I am using Google CPU Profiling tool.
http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html
On the documentation it is given
Analyzing Text Output
Text mode has lines of output that look like this:
14 2.1% 17.2% 58 8.7% std::_Rb_tree::find
Here is how to interpret the columns:
Number of profiling samples in this
function
Percentage of profiling
samples in this function
Percentage
of profiling samples in the functions
printed so far
Number of profiling
samples in this function and its
callees
Percentage of profiling
samples in this function and its
callees
Function name
But I am not able to understand which columns tell me exact or percentage CPU usages of function ?
How to get CPU uses of a function suing google profile ?

Text mode has lines of output that look like this:
It will have a lot of lines, for example, collect profile:
$ CPUPROFILE=a.pprof LD_PRELOAD=./libprofiler.so ./a.out
The program a.out is the same as here: Kcachegrind/callgrind is inaccurate for dispatcher functions?
Then analyze it with pprof top command:
$ pprof ./a.out a.pprof
Using local file ./a.out.
Using local file a.pprof.
Welcome to pprof! For help, type 'help'.
(pprof) top
Total: 185 samples
76 41.1% 41.1% 76 41.1% do_4
51 27.6% 68.6% 51 27.6% do_3
37 20.0% 88.6% 37 20.0% do_2
21 11.4% 100.0% 21 11.4% do_1
0 0.0% 100.0% 185 100.0% __libc_start_main
0 0.0% 100.0% 185 100.0% dispatcher
0 0.0% 100.0% 34 18.4% first2
0 0.0% 100.0% 42 22.7% inner2
0 0.0% 100.0% 68 36.8% last2
0 0.0% 100.0% 185 100.0% main
So, what is here: the total sample count is 185; and the Frequency is the default (1 sample every 10 ms; or 100 samples per second). Then total runtime is ~ 1.85 second.
First column is the number of samples, which was taken when a.out works in the given function. If we divide it by Frequency, we will get total time estimation of given function, e.g. do_4 runs for ~0.8 sec
Second column is the sample count in given function divided by total count, or the percentage of this function in total program run time. So do_4 is the slowest function (41% of total program time) and do_1 is only 11% of program runtime. I think you are interested in this column.
Third column is the sum of current and preceding lines; so we can know that 2 slowest functions, do_4 and do_3 totally accounted for 68% of total run time (41%+27%)
4rd and 5th columns are like first and second; but these one will account not only samples of the given function itself, but also samples of all functions called from given, both directly and indirectly. You can see, that main and all called from it is 100% of total run time (because main is the program itself; or root of calltree of program) and last2 with its children is 36.8% of runtime (its children in my program are: half of calls to do_4 and half of calls to do_3 = 41.1 + 27.6 /2 = 69.7/2 ~= 34% + some time in the function itself)
PS: there are some other useful pprof commands, like callgrind or gv which shows graphic representation of call tree with profiling information added.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string