How to read nodejs internal profiler tick-processor output - node.js

I'm interested in profiling my Node.js application.
I've started it with --prof flag, and obtained a v8.log file.
I've taken the windows-tick-processor and obtained a supposedly human readable profiling log.
At the bottom of the question are a few a small excerpts from the log file, which I am completely failing to understand.
I get the ticks statistical approach. I don't understand what total vs nonlib means.
Also I don't understand why some things are prefixed with LazyCompile, Function, Stub or other terms.
The best answer I could hope for is the complete documentation/guide to the tick-processor output format, completely explaining every term, structure etc...
Barring that, I just don't understand what lazy-compile is. Is it compilation? Doesn't every function get compiled exactly once? Then how can compilation possibly be a significant part of my application execution? The application ran for hours to produce this log, and I'm assuming the internal JavaScript compilation takes milliseconds.
This suggests that lazy-compile is something that doesn't happen once per function, but happens during some kind of code evaluation? Does this mean that everywhere I've got a function definition (for example a nested function), the internal function gets "lazy-compiled" each time?
I couldn't find any information on this anywhere, and I've been googling for DAYS...
Also I realize there are a lot of profiler flags. Additional references on those are also welcome.
[JavaScript]:
ticks total nonlib name
88414 7.9% 20.1% LazyCompile: *getUniqueId C:\n\dev\SCNA\infra\lib\node-js\utils\general-utils.js:16
22797 2.0% 5.2% LazyCompile: *keys native v8natives.js:333
14524 1.3% 3.3% LazyCompile: Socket._flush C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\zmq\lib\index.js:365
12896 1.2% 2.9% LazyCompile: BasicSerializeObject native json.js:244
12346 1.1% 2.8% LazyCompile: BasicJSONSerialize native json.js:274
9327 0.8% 2.1% LazyCompile: * C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\zmq\lib\index.js:194
7606 0.7% 1.7% LazyCompile: *parse native json.js:55
5937 0.5% 1.4% LazyCompile: *split native string.js:554
5138 0.5% 1.2% LazyCompile: *Socket.send C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\zmq\lib\index.js:346
4862 0.4% 1.1% LazyCompile: *sort native array.js:741
4806 0.4% 1.1% LazyCompile: _.each._.forEach C:\n\dev\SCNA\infra\node_modules\underscore\underscore.js:76
4481 0.4% 1.0% LazyCompile: ~_.each._.forEach C:\n\dev\SCNA\infra\node_modules\underscore\underscore.js:76
4296 0.4% 1.0% LazyCompile: stringify native json.js:308
3796 0.3% 0.9% LazyCompile: ~b native v8natives.js:1582
3694 0.3% 0.8% Function: ~recursivePropertiesCollector C:\n\dev\SCNA\infra\lib\node-js\utils\object-utils.js:90
3599 0.3% 0.8% LazyCompile: *BasicSerializeArray native json.js:181
3578 0.3% 0.8% LazyCompile: *Buffer.write buffer.js:315
3157 0.3% 0.7% Stub: CEntryStub
2958 0.3% 0.7% LazyCompile: promise.promiseDispatch C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\q\q.js:516
88414 7.9% LazyCompile: *getUniqueId C:\n\dev\SCNA\infra\lib\node-js\utils\general-utils.js:16
88404 100.0% LazyCompile: *generateId C:\n\dev\SCNA\infra\lib\node-js\utils\general-utils.js:51
88404 100.0% LazyCompile: *register C:\n\dev\SCNA\infra\lib\node-js\events\pattern-dispatcher.js:72
52703 59.6% LazyCompile: * C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:216
52625 99.9% LazyCompile: *_.each._.forEach C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\underscore\underscore.js:76
52625 100.0% LazyCompile: ~usingEventHandlerMapping C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:214
35555 40.2% LazyCompile: *once C:\n\dev\SCNA\infra\lib\node-js\events\pattern-dispatcher.js:88
29335 82.5% LazyCompile: ~startAction C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:201
25687 87.6% LazyCompile: ~onActionComplete C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-logic.js:130
1908 6.5% LazyCompile: ~b native v8natives.js:1582
1667 5.7% LazyCompile: _fulfilled C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\q\q.js:795
4645 13.1% LazyCompile: ~terminate C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:160
4645 100.0% LazyCompile: ~terminate C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-logic.js:171
1047 2.9% LazyCompile: *startAction C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:201
1042 99.5% LazyCompile: ~onActionComplete C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-logic.js:130

Indeed, you are right in your assumption about time actually spent compiling the code: it takes milliseconds (which could be seen with --trace-opt flag).
Now talking about that LazyCompile. Here is a quotation from Vyacheslav Egorov's (former v8 dev) blog:
If you are using V8's tick processors keep in mind that LazyCompile:
prefix does not mean that this time was spent in compiler, it just
means that the function itself was compiled lazily.
An asterisk before a function name means that time is being spent in optimized function, tilda -- not optimized.
Concerning your question about how many times a function gets compiled. Actually the JIT (so-called full-codegen) creates a non-optimized version of each function when it gets executed for the first time. But later on an arbitrary (well, to some extent) number or recompilations could happen (due to optimizations and bail-outs). But you won't see any of it in this kind of profiling log.
Stub prefix to the best of my understanding means the execution was inside a C-Stub, which is a part of runtime and gets compiled along with other parts of the engine (i.e. it is not JIT-compiled JS code).
total vs. nonlib:
These columns simply mean than x% of total/non-lib time was spent there. (I can refer you to a discussion here).
Also, you can find https://github.com/v8/v8/wiki/Using%20V8%E2%80%99s%20internal%20profiler useful.

Related

Printing python CPU count not working from terminal

Situation
I am in the situation in which I want to run a script from terminal which prints the used CPU percentage per cpu. I wrote this simple script.
import psutil
values = psutil.cpu_percent(percpu=True)
print(f"{'CPU':>3} {'CPU%':>6}")
print(10*'=')
for cpu in range(1, psutil.cpu_count()+1):
print(f"{cpu:<3} {values[cpu-1]:>5}%")
When I run this in a Jupyter Notebook cell I get what I want, which is per CPU the currently used percentage. Something like this
CPU CPU%
==========
1 100.0%
2 1.5%
3 1.3%
4 0.8%
5 0.1%
6 0.9%
7 1.0%
8 0.2%
Issue
Now I put the above code in a python file get_cpu_perc.py and run the script from my terminal
python3 get_cpu_perc.py
What is getting printed on the terminal is always
CPU CPU%
==========
1 0.0%
2 0.0%
3 0.0%
4 0.0%
5 0.0%
6 0.0%
7 0.0%
8 0.0%
Question
What is happening here? Why is it never showing the percentage when I run the script from the terminal?
psutil needs some time to measure your cpu usage. This is usually done between import of psutil and the first call. It might be that in your jupyter notebook there is more time between the import and the readout.
You can pause the program for a short amount of time in order to force psutil to 'listen' to your cpu:
# listen to the cpu for 0.1 seconds before returning the value
values = psutil.cpu_percent(interval=0.1, percpu=True)
Then the values should fill up.

Optimal number of threads for GNU parallel

I think I have a fairly basic question. I just discovered the GNU parallel package and I think my workflow can really benefit from it!
I am using a loop which loops through my read files and generates the desired output. The command that is excecuted for each read looks something like this:
STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn R1.fq R2.fq
As you can see I specified 8 threads, which is the amount of threads my virtual machine has.
My question now is this following:
If I use GNU parallel with a command like this:
cat reads| parallel -j 3 STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn {}_R1.fq {}_R2.fq
Can my virtual machine handle the number of threads I specified, if I execute 3 jobs in parallel?
Or do I need 24 threads (3*8 threads) to properly excecute this command?
Im sorry if this is a basic question, I am very new to the field and any help is much appreciated!
The best advice is simply: Try different values and measure.
In parallelization there are sooo many factors that can affect the results: Disk I/O, shared CPU cache, and shared RAM bandwidth just to name three.
top is your friend when measuring. If you can manage to get all CPUs to have <5% idle you are unlikely to go any faster - no matter what you do.
top - 14:49:10 up 10 days, 5:48, 123 users, load average: 2.40, 1.72, 1.67
Tasks: 751 total, 3 running, 616 sleeping, 8 stopped, 4 zombie
%Cpu(s): 17.3 us, 6.2 sy, 0.0 ni, 76.2 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
GiB Mem : 31.239 total, 1.441 free, 21.717 used, 8.081 buff/cache
GiB Swap: 117.233 total, 104.146 free, 13.088 used. 4.706 avail Mem
This machine is 76.2% idle. If your processes use loads of CPU then starting more processes in parallel here may help. If they use loads of disk I/O it may or may not help. Only way to know is to test and measure.
top - 14:51:00 up 10 days, 5:50, 124 users, load average: 3.41, 2.04, 1.78
Tasks: 759 total, 8 running, 619 sleeping, 8 stopped, 4 zombie
%Cpu(s): 92.8 us, 6.9 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
GiB Mem : 31.239 total, 1.383 free, 21.772 used, 8.083 buff/cache
GiB Swap: 117.233 total, 104.146 free, 13.087 used. 4.649 avail Mem
This machine is 0.1% idle. Starting more processes is unlikely to make things go faster.
So increase the parallelization until idle time hits a minimum or until average processing time hits a minimum (--joblog my.log can be useful to see how long a job takes).
And yes: GNU Parallel is likely to speed-up bioinformatics (being written by a fellow bioinformatician).
Consider reading GNU Parallel 2018 (paper: http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html download: https://doi.org/10.5281/zenodo.1146014) Read at least chapter 1+2. It should take you less than 20 minutes. Your command line will love you for it.

GraphQL API with backend built in node js consuming high CPU Usage

We have built a graphql API , with our services written in node.js and leveraging apollo server. We are experiencing high CPU usage whenever requests per sec reach 20. We did profiling with flamegraphs and node built in profiler. Attaching the result of the built in profiler:-
[Summary]:
ticks total nonlib name
87809 32.1% 95.8% JavaScript
0 0.0% 0.0% C++
32531 11.9% 35.5% GC
182061 66.5% Shared libraries
3878 1.4% Unaccounted
[Shared libraries]:
ticks total nonlib name
138326 50.5% /usr/bin/node
30023 11.0% /lib/x86_64-linux-gnu/libc-2.27.so
12466 4.6% /lib/x86_64-linux-gnu/libpthread-2.27.so
627 0.2% [vdso]
567 0.2% /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25
52 0.0% /lib/x86_64-linux-gnu/libm-2.27.so
Results from flamegraph also complement the above result that we didn't see any javascript function consuming high CPU.
Why /usr/bin/node is consuming so much CPU? has it something to do with the way code has been written or it is in general the trend?
Also to give little info about what our graphQL API Does:- upon receiving a request, depending on the request it makes 3 to 5 downstream API calls and doesn't do any CPU intensive work on it's own.
Versions:-
Node version:- 10.16.3
graphql-modules/core:- 0.7.7
apollo-datasource-rest:- 0.5.0
apollo-server-express:- 2.6.8
A help is really appreciated here.

Profiling node application: most time spent in node itself

I am using a node application that is experiencing a performance problem under certain loads. I am attempting to use the V8 profiler to find out where the problem might be, basically following this guide.
I've generated a log file during the problem load using node --prof app.js, and analyzed it with node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > processed.txt. This all seems to work fine, but it seems that almost all the ticks are spent in the node executable itself:
[Summary]:
ticks total nonlib name
3887 5.8% 38.2% JavaScript
5590 8.4% 55.0% C++
346 0.5% 3.4% GC
56296 84.7% Shared libraries
689 1.0% Unaccounted
and:
[Shared libraries]:
ticks total nonlib name
55990 84.2% /usr/bin/node
225 0.3% /lib/x86_64-linux-gnu/libc-2.19.so
68 0.1% /lib/x86_64-linux-gnu/libpthread-2.19.so
7 0.0% /lib/x86_64-linux-gnu/libm-2.19.so
4 0.0% [vdso]
2 0.0% /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20
What does this mean? What is the app spending all its time doing? How can I find the performance problem?
I would suggest to try VTune Amplifier as alternative of the V8 profiler. I was able to identify and fix the time-consuming place in my code. You can download free trial version here and just follow by this step-by-step instructions. Hope it will help you.

How to find CPU Usage in google profiler

I am using Google CPU Profiling tool.
http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html
On the documentation it is given
Analyzing Text Output
Text mode has lines of output that look like this:
14 2.1% 17.2% 58 8.7% std::_Rb_tree::find
Here is how to interpret the columns:
Number of profiling samples in this
function
Percentage of profiling
samples in this function
Percentage
of profiling samples in the functions
printed so far
Number of profiling
samples in this function and its
callees
Percentage of profiling
samples in this function and its
callees
Function name
But I am not able to understand which columns tell me exact or percentage CPU usages of function ?
How to get CPU uses of a function suing google profile ?
Text mode has lines of output that look like this:
It will have a lot of lines, for example, collect profile:
$ CPUPROFILE=a.pprof LD_PRELOAD=./libprofiler.so ./a.out
The program a.out is the same as here: Kcachegrind/callgrind is inaccurate for dispatcher functions?
Then analyze it with pprof top command:
$ pprof ./a.out a.pprof
Using local file ./a.out.
Using local file a.pprof.
Welcome to pprof! For help, type 'help'.
(pprof) top
Total: 185 samples
76 41.1% 41.1% 76 41.1% do_4
51 27.6% 68.6% 51 27.6% do_3
37 20.0% 88.6% 37 20.0% do_2
21 11.4% 100.0% 21 11.4% do_1
0 0.0% 100.0% 185 100.0% __libc_start_main
0 0.0% 100.0% 185 100.0% dispatcher
0 0.0% 100.0% 34 18.4% first2
0 0.0% 100.0% 42 22.7% inner2
0 0.0% 100.0% 68 36.8% last2
0 0.0% 100.0% 185 100.0% main
So, what is here: the total sample count is 185; and the Frequency is the default (1 sample every 10 ms; or 100 samples per second). Then total runtime is ~ 1.85 second.
First column is the number of samples, which was taken when a.out works in the given function. If we divide it by Frequency, we will get total time estimation of given function, e.g. do_4 runs for ~0.8 sec
Second column is the sample count in given function divided by total count, or the percentage of this function in total program run time. So do_4 is the slowest function (41% of total program time) and do_1 is only 11% of program runtime. I think you are interested in this column.
Third column is the sum of current and preceding lines; so we can know that 2 slowest functions, do_4 and do_3 totally accounted for 68% of total run time (41%+27%)
4rd and 5th columns are like first and second; but these one will account not only samples of the given function itself, but also samples of all functions called from given, both directly and indirectly. You can see, that main and all called from it is 100% of total run time (because main is the program itself; or root of calltree of program) and last2 with its children is 36.8% of runtime (its children in my program are: half of calls to do_4 and half of calls to do_3 = 41.1 + 27.6 /2 = 69.7/2 ~= 34% + some time in the function itself)
PS: there are some other useful pprof commands, like callgrind or gv which shows graphic representation of call tree with profiling information added.

Resources