We have built a graphql API , with our services written in node.js and leveraging apollo server. We are experiencing high CPU usage whenever requests per sec reach 20. We did profiling with flamegraphs and node built in profiler. Attaching the result of the built in profiler:-
[Summary]:
ticks total nonlib name
87809 32.1% 95.8% JavaScript
0 0.0% 0.0% C++
32531 11.9% 35.5% GC
182061 66.5% Shared libraries
3878 1.4% Unaccounted
[Shared libraries]:
ticks total nonlib name
138326 50.5% /usr/bin/node
30023 11.0% /lib/x86_64-linux-gnu/libc-2.27.so
12466 4.6% /lib/x86_64-linux-gnu/libpthread-2.27.so
627 0.2% [vdso]
567 0.2% /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25
52 0.0% /lib/x86_64-linux-gnu/libm-2.27.so
Results from flamegraph also complement the above result that we didn't see any javascript function consuming high CPU.
Why /usr/bin/node is consuming so much CPU? has it something to do with the way code has been written or it is in general the trend?
Also to give little info about what our graphQL API Does:- upon receiving a request, depending on the request it makes 3 to 5 downstream API calls and doesn't do any CPU intensive work on it's own.
Versions:-
Node version:- 10.16.3
graphql-modules/core:- 0.7.7
apollo-datasource-rest:- 0.5.0
apollo-server-express:- 2.6.8
A help is really appreciated here.
Related
I am using a node application that is experiencing a performance problem under certain loads. I am attempting to use the V8 profiler to find out where the problem might be, basically following this guide.
I've generated a log file during the problem load using node --prof app.js, and analyzed it with node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > processed.txt. This all seems to work fine, but it seems that almost all the ticks are spent in the node executable itself:
[Summary]:
ticks total nonlib name
3887 5.8% 38.2% JavaScript
5590 8.4% 55.0% C++
346 0.5% 3.4% GC
56296 84.7% Shared libraries
689 1.0% Unaccounted
and:
[Shared libraries]:
ticks total nonlib name
55990 84.2% /usr/bin/node
225 0.3% /lib/x86_64-linux-gnu/libc-2.19.so
68 0.1% /lib/x86_64-linux-gnu/libpthread-2.19.so
7 0.0% /lib/x86_64-linux-gnu/libm-2.19.so
4 0.0% [vdso]
2 0.0% /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20
What does this mean? What is the app spending all its time doing? How can I find the performance problem?
I would suggest to try VTune Amplifier as alternative of the V8 profiler. I was able to identify and fix the time-consuming place in my code. You can download free trial version here and just follow by this step-by-step instructions. Hope it will help you.
I've just tested performance on a simple nest's controller, that returns text on a get request (no database).
And the same simple GET controller (middleware) with express.
I used WRK tool to test performance.
And as a result plain express is 2 x times faster than nestjs.
Why is so much overhead created by nestjs?
UPDATE - 17.03.2020
We are now running benchmarks for every new PR. One of the latest benchmarks can be found here: https://github.com/nestjs/nest/runs/482105333
Req/sec Trans/sec
Nest-Express 15370 3.17MB
Nest-Fastify 30001 4.38MB
Express 17208 3.53MB
Fastify 33578 4.87MB
That means Nest + FastifyAdapter is now almost 2 times faster than express.
UPDATE - 22.09.2018
Benchmarks directory has been added to the repository: https://github.com/nestjs/nest/blob/master/benchmarks/all_output.txt (you can run benchmarks on your machine as well).
UPDATE - 24.06.2018
Nest v5.0.0 supports fastify. Fastify + Nest integration is even more performant than plain(!) express.
The following list shows what Nest is doing in comparison to plain express route handler:
it surrounds your route handler body with try..catch blocks
it makes every route handler async
it creates a global express router
it creates a separated router for each controller
it binds error-handling middleware
it binds body-parser middleware (both json and extended urlencoded)
All of the mentioned things reflect a real-world example (probably 99.9% express apps have to do this as well, it's unavoidable). It means that if you want to compare Express and Nest performance, you should at least cover above points. The comparison with the example below:
app.get('/', (req, res, next) => res.status(200).send('Hello world'));
Is unfair in this case, because it's not enough. When I cover these points, this is what I received (express 4.16.2):
Running 10s test # http://localhost:3000
1024 connections
Stat Avg Stdev Max
Latency (ms) 225.67 109.97 762
Req/Sec 4560 1034.78 5335
Bytes/Sec 990 kB 226 kB 1.18 MB
46k requests in 10s, 9.8 MB read
Additionally, Nest has to:
recognize whether a result is a Promise/Observable/plain value
based on the result type, use send() or json() (+1 condition)
add 3 conditions (if statements) to check pipes, interceptors and guards
There's an output for Nest (4.5.8):
Running 10s test # http://localhost:3000
1024 connections
Stat Avg Stdev Max
Latency (ms) 297.79 55.5 593
Req/Sec 3433.2 367.84 3649
Bytes/Sec 740 kB 81.9 kB 819 kB
34k requests in 10s, 7.41 MB read
This implies that Nest performance is around 79% express (-21%). This is due to the reasons set out above, and moreover, because Nest is compatible with Node 6.11.x which means that it can't use async/await under the hood - it has to use generators.
Which conclusion is to be drawn based on those stats? None, because we aren't used to creating applications that only returns plain strings without any asynchronous stuff. The comparisons with Hello world means nothing, it's only a titbit :)
PS. I used autocannon library https://github.com/mcollina/autocannon
autocannon -c 1024 -t30 http://localhost:3000
I'm just playing with the Snap framework and wanted to see how it performs against other frameworks (under completely artificial circumstances).
What I have found is that my Snap application tops out at about 1500 requests/second (the app is simply snap init; snap build; ./dist/app/app, ie. no code changes to the default app created by snap):
$ ab -n 20000 -c 500 http://127.0.0.1:8000/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 2000 requests
Completed 4000 requests
Completed 6000 requests
Completed 8000 requests
Completed 10000 requests
Completed 12000 requests
Completed 14000 requests
Completed 16000 requests
Completed 18000 requests
Completed 20000 requests
Finished 20000 requests
Server Software: Snap/0.9.5.1
Server Hostname: 127.0.0.1
Server Port: 8000
Document Path: /
Document Length: 721 bytes
Concurrency Level: 500
Time taken for tests: 12.845 seconds
Complete requests: 20000
Failed requests: 0
Total transferred: 17140000 bytes
HTML transferred: 14420000 bytes
Requests per second: 1557.00 [#/sec] (mean)
Time per request: 321.131 [ms] (mean)
Time per request: 0.642 [ms] (mean, across all concurrent requests)
Transfer rate: 1303.07 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 44 287.6 0 3010
Processing: 6 274 153.6 317 1802
Waiting: 5 274 153.6 317 1802
Total: 20 318 346.2 317 3511
Percentage of the requests served within a certain time (ms)
50% 317
66% 325
75% 334
80% 341
90% 352
95% 372
98% 1252
99% 2770
100% 3511 (longest request)
I then fired up a Grails application, and it seems like Tomcat (once the JVM warms up) can take a bit more load:
$ ab -n 20000 -c 500 http://127.0.0.1:8080/test-0.1/book
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 2000 requests
Completed 4000 requests
Completed 6000 requests
Completed 8000 requests
Completed 10000 requests
Completed 12000 requests
Completed 14000 requests
Completed 16000 requests
Completed 18000 requests
Completed 20000 requests
Finished 20000 requests
Server Software: Apache-Coyote/1.1
Server Hostname: 127.0.0.1
Server Port: 8080
Document Path: /test-0.1/book
Document Length: 722 bytes
Concurrency Level: 500
Time taken for tests: 4.366 seconds
Complete requests: 20000
Failed requests: 0
Total transferred: 18700000 bytes
HTML transferred: 14440000 bytes
Requests per second: 4581.15 [#/sec] (mean)
Time per request: 109.143 [ms] (mean)
Time per request: 0.218 [ms] (mean, across all concurrent requests)
Transfer rate: 4182.99 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 67 347.4 0 3010
Processing: 1 30 31.4 21 374
Waiting: 0 26 24.4 20 346
Total: 1 97 352.5 21 3325
Percentage of the requests served within a certain time (ms)
50% 21
66% 28
75% 35
80% 42
90% 84
95% 230
98% 1043
99% 1258
100% 3325 (longest request)
I'm guessing that a part of this could be the fact that Tomcat seems to reserve a lot of RAM and can keep/cache some methods. During this experiment Tomcat was using in excess of 700mb or RAM while Snap barely approached 70mb.
Questions I have:
Am I comparing apples and oranges here?
What steps would one take to optimise Snap for throughput/speed?
Further experiments:
Then, as suggested by mightybyte, I started experimenting with +RTS -A4M -N4 options. The app was able to serve just over 2000 requests per second (about 25% increase).
I also removed the nested templating and served a document (same size as before) from the top level tpl file. This increased the performance to just over 7000 requests a second. The memory usage went up to about 700MB.
I'm by no means an expert on the subject so I can only really answer your first question, and yes you are comparing apples and oranges (and also bananas without realizing it).
First off, it looks like you are attempting to benchmark different things, so naturally, your results will be inconsistent. One of these is the sample Snap application and the other is just "a Grails application". What exactly are each of these things doing? Are you serving pages? Handling requests? The difference in applications will explain the differences in performance.
Secondly, the difference in RAM usage also shows the difference in what these applications are doing. Haskell web frameworks are very good at handling large instances without much RAM where other frameworks, like Tomcat as you saw, will be limited in their performance with limited RAM. Try limiting both applications to 100mb and see what happens to your performance difference.
If you want to compare the different frameworks, you really need to run a standard application to do that. Snap did this with a Pong benchmark. The results of an old test (from 2011 and Snap 0.3) can be seen here. This paragraph is extremely relevant to your situation:
If you’re comparing this with our previous results you will notice that we left out Grails. We discovered that our previous results for Grails may have been too low because the JVM had not been given time to warm up. The problem is that after the JVM warms up for some reason httperf isn’t able to get any samples from which to generate a replies/sec measurement, so it outputs 0.0 replies/sec. There are also 1000 connreset errors, so we decided the Grails numbers were not reliable enough to use.
As a comparison, the Yesod blog has a Pong benchmark from around the same time that shows similar results. You can find that here. They also link to their benchmark code if you would like to try to run a more similar benchmark, it is available on Github.
The answer by jkeuhlen makes good observations relevant to your first question. As to your second question, there are definitely things you can play with to tune performance. If you look at Snap's old raw result data, you can see that we were running the application with +RTS -A4M -N4. The -N4 option tells the GHC runtime to use 4 threads. (Note that you have to build the application with -threaded to do this.) The -A4M option sets the size of the garbage collector's allocation area. Our experiments showed that these two seemed to have the biggest impact on performance. But that was done a long time ago and GHC has changed a lot since then, so you probably want to play around with them and find what works best for you. This page has in-depth information about other command line options available to control GHC's runtime if you wish to do more experimentation.
A little work was done last year on updating the benchmarks. If you're interested in that, look around the different branches in the snap-benchmarks repository. It would be great to get more help on a new set of benchmarks.
I'm interested in profiling my Node.js application.
I've started it with --prof flag, and obtained a v8.log file.
I've taken the windows-tick-processor and obtained a supposedly human readable profiling log.
At the bottom of the question are a few a small excerpts from the log file, which I am completely failing to understand.
I get the ticks statistical approach. I don't understand what total vs nonlib means.
Also I don't understand why some things are prefixed with LazyCompile, Function, Stub or other terms.
The best answer I could hope for is the complete documentation/guide to the tick-processor output format, completely explaining every term, structure etc...
Barring that, I just don't understand what lazy-compile is. Is it compilation? Doesn't every function get compiled exactly once? Then how can compilation possibly be a significant part of my application execution? The application ran for hours to produce this log, and I'm assuming the internal JavaScript compilation takes milliseconds.
This suggests that lazy-compile is something that doesn't happen once per function, but happens during some kind of code evaluation? Does this mean that everywhere I've got a function definition (for example a nested function), the internal function gets "lazy-compiled" each time?
I couldn't find any information on this anywhere, and I've been googling for DAYS...
Also I realize there are a lot of profiler flags. Additional references on those are also welcome.
[JavaScript]:
ticks total nonlib name
88414 7.9% 20.1% LazyCompile: *getUniqueId C:\n\dev\SCNA\infra\lib\node-js\utils\general-utils.js:16
22797 2.0% 5.2% LazyCompile: *keys native v8natives.js:333
14524 1.3% 3.3% LazyCompile: Socket._flush C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\zmq\lib\index.js:365
12896 1.2% 2.9% LazyCompile: BasicSerializeObject native json.js:244
12346 1.1% 2.8% LazyCompile: BasicJSONSerialize native json.js:274
9327 0.8% 2.1% LazyCompile: * C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\zmq\lib\index.js:194
7606 0.7% 1.7% LazyCompile: *parse native json.js:55
5937 0.5% 1.4% LazyCompile: *split native string.js:554
5138 0.5% 1.2% LazyCompile: *Socket.send C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\zmq\lib\index.js:346
4862 0.4% 1.1% LazyCompile: *sort native array.js:741
4806 0.4% 1.1% LazyCompile: _.each._.forEach C:\n\dev\SCNA\infra\node_modules\underscore\underscore.js:76
4481 0.4% 1.0% LazyCompile: ~_.each._.forEach C:\n\dev\SCNA\infra\node_modules\underscore\underscore.js:76
4296 0.4% 1.0% LazyCompile: stringify native json.js:308
3796 0.3% 0.9% LazyCompile: ~b native v8natives.js:1582
3694 0.3% 0.8% Function: ~recursivePropertiesCollector C:\n\dev\SCNA\infra\lib\node-js\utils\object-utils.js:90
3599 0.3% 0.8% LazyCompile: *BasicSerializeArray native json.js:181
3578 0.3% 0.8% LazyCompile: *Buffer.write buffer.js:315
3157 0.3% 0.7% Stub: CEntryStub
2958 0.3% 0.7% LazyCompile: promise.promiseDispatch C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\q\q.js:516
88414 7.9% LazyCompile: *getUniqueId C:\n\dev\SCNA\infra\lib\node-js\utils\general-utils.js:16
88404 100.0% LazyCompile: *generateId C:\n\dev\SCNA\infra\lib\node-js\utils\general-utils.js:51
88404 100.0% LazyCompile: *register C:\n\dev\SCNA\infra\lib\node-js\events\pattern-dispatcher.js:72
52703 59.6% LazyCompile: * C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:216
52625 99.9% LazyCompile: *_.each._.forEach C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\underscore\underscore.js:76
52625 100.0% LazyCompile: ~usingEventHandlerMapping C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:214
35555 40.2% LazyCompile: *once C:\n\dev\SCNA\infra\lib\node-js\events\pattern-dispatcher.js:88
29335 82.5% LazyCompile: ~startAction C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:201
25687 87.6% LazyCompile: ~onActionComplete C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-logic.js:130
1908 6.5% LazyCompile: ~b native v8natives.js:1582
1667 5.7% LazyCompile: _fulfilled C:\n\dev\SCNA\runtime-environment\load-generator\node_modules\q\q.js:795
4645 13.1% LazyCompile: ~terminate C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:160
4645 100.0% LazyCompile: ~terminate C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-logic.js:171
1047 2.9% LazyCompile: *startAction C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-driver.js:201
1042 99.5% LazyCompile: ~onActionComplete C:\n\dev\SCNA\runtime-environment\load-generator\lib\vuser-driver\mdrv-logic.js:130
Indeed, you are right in your assumption about time actually spent compiling the code: it takes milliseconds (which could be seen with --trace-opt flag).
Now talking about that LazyCompile. Here is a quotation from Vyacheslav Egorov's (former v8 dev) blog:
If you are using V8's tick processors keep in mind that LazyCompile:
prefix does not mean that this time was spent in compiler, it just
means that the function itself was compiled lazily.
An asterisk before a function name means that time is being spent in optimized function, tilda -- not optimized.
Concerning your question about how many times a function gets compiled. Actually the JIT (so-called full-codegen) creates a non-optimized version of each function when it gets executed for the first time. But later on an arbitrary (well, to some extent) number or recompilations could happen (due to optimizations and bail-outs). But you won't see any of it in this kind of profiling log.
Stub prefix to the best of my understanding means the execution was inside a C-Stub, which is a part of runtime and gets compiled along with other parts of the engine (i.e. it is not JIT-compiled JS code).
total vs. nonlib:
These columns simply mean than x% of total/non-lib time was spent there. (I can refer you to a discussion here).
Also, you can find https://github.com/v8/v8/wiki/Using%20V8%E2%80%99s%20internal%20profiler useful.
While profiling a nodejs program, I see that 61% of the ticks are caused by 'Unknown' (see below). What can this be? What should I look for?
gr,
Coen
Statistical profiling result from node, (14907 ticks, 9132 unaccounted, 0 excluded).
[Unknown]:
ticks total nonlib name
9132 61.3%
[Shared libraries]:
ticks total nonlib name
1067 7.2% 0.0% C:\Windows\SYSTEM32\ntdll.dll
55 0.4% 0.0% C:\Windows\system32\kernel32.dll
[JavaScript]:
ticks total nonlib name
1381 9.3% 10.0% LazyCompile: *RowDataPacket.parse D:\MI\packet.js:9
......
Are you loading any modules that have built dependencies?
Basically by "Unknown" it means "unaccounted for" (check tickprocessor.js for more explanation). For example, the GC will print messages like "scavenge,begin,..." but that is unrecognized by logreader.js.
It would help to know what profiling library your using to parse the v8.log file.
Update
The node-tick package hasn't been updated for over a year and is probably missing a lot of recent prof commands. Try using node-profiler instead. It's created by one of node's maintainers. And if you want the absolute best result you'll need to build it using node-gyp.
Update
I've parsed the v8.log output using the latest from node-profiler (the latest on master, not the latest tag) and posted the results at http://pastebin.com/pdHDPjzE
Allow me to point out a couple key entries which appear about half way down:
[GC]:
ticks total nonlib name
2063 26.2%
[Bottom up (heavy) profile]
6578 83.4% c:\node\node.exe
1812 27.5% LazyCompile: ~parse native json.js:55
1811 99.9% Function: ~<anonymous> C:\workspace\repositories\asyncnode_MySQL\lib\MySQL_DB.js:41
736 11.2% Function: ~Buffer.toString buffer.js:392
So 26.2% of all script type was spent in garbage collection. Which is much higher than it should be. Though it does correlate well with how much time is spent on Buffer.toString. If that many Buffers are being created then converted to strings, both would need to be gc'd when they leave scope.
Also I'm curious why so much time is spent in LazyCompile for json.js. Or more so, why would json.js even be necessary in a node application?
To help you performance tune your application I'm including a few links below that give good instructions on what to do and look for.
Nice slide deck with the basics:
https://mkw.st/p/gdd11-berlin-v8-performance-tuning-tricks/#1
More advanced examples of optimization techniques:
http://floitsch.blogspot.com/2012/03/optimizing-for-v8-introduction.html
Better use of closures:
http://mrale.ph/blog/2012/09/23/grokking-v8-closures-for-fun.html
Now as far as why you couldn't achieve the same output. If you built and used node-profiler and its provided nprof from master and it still doesn't work then I'll assume it has something to do with being on Windows. Think about filing a bug on GitHub and see if he'll help you out.
You are using a 64 bit version of Node.JS to run your application and a 32bit build of the d8 shell to process your v8.log.
Using either a 32 bit version of Node.JS with ia32 as the build target for the d8 shell or a 64 bit version of Node.JS with x64 as the d8 shell build target should solve your problem.
Try to build v8 with profiling support on:
scons prof=on d8
Make sure you run node --prof with version corresponding to version of v8
Then tools/linux-tick-processor path/to/v8.log should show you the full profile info.