nodejs response speed and nginx - node.js

Just started testing nodejs, and wanted to get some help in understanding following behavior:
Example #1:
var http = require('http');
http.createServer(function(req, res){
res.writeHeader(200, {'Content-Type': 'text/plain'});
res.end('foo');
}).listen(1001, '0.0.0.0');
Example #2:
var http = require('http');
http.createServer(function(req, res){
res.writeHeader(200, {'Content-Type': 'text/plain'});
res.write('foo');
res.end('bar');
}).listen(1001, '0.0.0.0');
When testing response time in Chrome:
example #1 - 6-10ms
example #2 - 200-220ms
But, if test both examples through nginx proxy_pass
server{
listen 1011;
location / {
proxy_pass http://127.0.0.1:1001;
}
}
i get this:
example #1 - 4-8ms
example #2 - 4-8ms
I am not an expert on either nodejs or nginx, and asking if someone can explain this?
nodejs - v.0.8.1
nginx - v.1.2.2
update:
thanks to Hippo, i made test with ab on my server with and without nginx,
and got opposite results.
also added to nginx config proxy_cache off
server{
listen 1011;
location / {
proxy_pass http://127.0.0.1:1001;
proxy_cache off;
}
}
example #1 direct:
ab -n 1000 -c 50 http:// 127.0.0.1:1001/
Server Software:
Server Hostname: 127.0.0.1
Server Port: 1001
Document Path: /
Document Length: 65 bytes
Concurrency Level: 50
Time taken for tests: 1.018 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Total transferred: 166000 bytes
HTML transferred: 65000 bytes
Requests per second: 981.96 [#/sec] (mean)
Time per request: 50.919 [ms] (mean)
Time per request: 1.018 [ms] (mean, across all concurrent requests)
Transfer rate: 159.18 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.6 0 3
Processing: 0 50 44.9 19 183
Waiting: 0 49 44.8 17 183
Total: 1 50 44.7 19 183
example #1 nginx:
ab -n 1000 -c 50 http:// 127.0.0.1:1011/
Server Software: nginx/1.2.2
Server Hostname: 127.0.0.1
Server Port: 1011
Document Path: /
Document Length: 65 bytes
Concurrency Level: 50
Time taken for tests: 1.609 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Total transferred: 187000 bytes
HTML transferred: 65000 bytes
Requests per second: 621.40 [#/sec] (mean)
Time per request: 80.463 [ms] (mean)
Time per request: 1.609 [ms] (mean, across all concurrent requests)
Transfer rate: 113.48 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.6 0 3
Processing: 2 77 44.9 96 288
Waiting: 2 77 44.8 96 288
Total: 3 78 44.7 96 288
example #2 direct:
ab -n 1000 -c 50 http:// 127.0.0.1:1001/
Server Software:
Server Hostname: 127.0.0.1
Server Port: 1001
Document Path: /
Document Length: 76 bytes
Concurrency Level: 50
Time taken for tests: 1.257 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Total transferred: 177000 bytes
HTML transferred: 76000 bytes
Requests per second: 795.47 [#/sec] (mean)
Time per request: 62.856 [ms] (mean)
Time per request: 1.257 [ms] (mean, across all concurrent requests)
Transfer rate: 137.50 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.3 0 2
Processing: 0 60 47.8 88 193
Waiting: 0 60 47.8 87 193
Total: 0 61 47.7 88 193
example #2 nginx:
ab -n 1000 -c 50 http:// 127.0.0.1:1011/
Server Software: nginx/1.2.2
Server Hostname: 127.0.0.1
Server Port: 1011
Document Path: /
Document Length: 76 bytes
Concurrency Level: 50
Time taken for tests: 1.754 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Total transferred: 198000 bytes
HTML transferred: 76000 bytes
Requests per second: 570.03 [#/sec] (mean)
Time per request: 87.715 [ms] (mean)
Time per request: 1.754 [ms] (mean, across all concurrent requests)
Transfer rate: 110.22 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.4 0 2
Processing: 1 87 42.1 98 222
Waiting: 1 86 42.3 98 222
Total: 1 87 42.0 98 222
Now results looks more logic, but still there is a strange delay when calling res.write()
I guess it was (sure looks like) a stupid question, but i still get huge difference in response time in browser with this server configuration (Centos 6) and this concrete server (vps).
On my home computer (Ubuntu 12) but with older versions testing from localhost everything works fine.

Peeking into http.js reveals that case #1 has special handling in nodejs itself, some kind of a shortcut optimization I guess.
var hot = this._headerSent === false &&
typeof(data) === 'string' &&
data.length > 0 &&
this.output.length === 0 &&
this.connection &&
this.connection.writable &&
this.connection._httpMessage === this;
if (hot) {
// Hot path. They're doing
// res.writeHead();
// res.end(blah);
// HACKY.
if (this.chunkedEncoding) {
var l = Buffer.byteLength(data, encoding).toString(16);
ret = this.connection.write(this._header + l + CRLF +
data + '\r\n0\r\n' +
this._trailer + '\r\n', encoding);
} else {
ret = this.connection.write(this._header + data, encoding);
}
this._headerSent = true;
} else if (data) {
// Normal body write.
ret = this.write(data, encoding);
}
if (!hot) {
if (this.chunkedEncoding) {
ret = this._send('0\r\n' + this._trailer + '\r\n'); // Last chunk.
} else {
// Force a flush, HACK.
ret = this._send('');
}
}
this.finished = true;

I've took you examples files and used ab (Apache Benchmark) as a proper tool for benchmarking HTTP server performance:
Example 1:
Concurrency Level: 50
Time taken for tests: 0.221 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Total transferred: 104000 bytes
HTML transferred: 3000 bytes
Requests per second: 4525.50 [#/sec] (mean)
Time per request: 11.049 [ms] (mean)
Time per request: 0.221 [ms] (mean, across all concurrent requests)
Transfer rate: 459.62 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.7 0 4
Processing: 1 11 6.4 10 32
Waiting: 1 11 6.4 10 32
Total: 1 11 6.7 10 33
Example 2:
Concurrency Level: 50
Time taken for tests: 0.256 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Total transferred: 107000 bytes
HTML transferred: 6000 bytes
Requests per second: 3905.27 [#/sec] (mean)
Time per request: 12.803 [ms] (mean)
Time per request: 0.256 [ms] (mean, across all concurrent requests)
Transfer rate: 408.07 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.6 0 4
Processing: 1 12 7.0 12 34
Waiting: 1 12 6.9 12 34
Total: 1 12 7.1 12 34
Note:
The second example is as fast as the first one. The small differences are probably caused by the the additional function call in the code and the fact that the document size is larger then with the first one.

Related

Can't explain this Node clustering behavior

I'm learning about threads and how they interact with Node's native cluster module. I saw some behavior I can't explain that I'd like some help understanding.
My code:
process.env.UV_THREADPOOL_SIZE = 1;
const cluster = require('cluster');
if (cluster.isMaster) {
cluster.fork();
} else {
const crypto = require('crypto');
const express = require('express');
const app = express();
app.get('/', (req, res) => {
crypto.pbkdf2('a', 'b', 100000, 512, 'sha512', () => {
res.send('Hi there');
});
});
app.listen(3000);
}
I benchmarked this code with one request using apache benchmark.
ab -c 1 -n 1 localhost:3000/ yielded these connection times
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 605 605 0.0 605 605
Waiting: 605 605 0.0 605 605
Total: 605 605 0.0 605 605
So far so good. I then ran ab -c 2 -n 2 localhost:3000/ (doubling the number of calls from the benchmark). I expected the total time to double since I limited the libuv thread pool to one thread per child process and I only started one child process. But nothing really changed. Here's those results.
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 0
Processing: 608 610 3.2 612 612
Waiting: 607 610 3.2 612 612
Total: 608 610 3.3 612 612
For extra info, when I further increase the number of calls with ab -c 3 -n 3 localhost:3000/, I start to see a slow down.
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 599 814 352.5 922 1221
Waiting: 599 814 352.5 922 1221
Total: 599 815 352.5 922 1221
I'm running all this on a quadcore mac using Node v14.13.1.
tldr: how did my benchmark not use up all my threads? I forked one child process with one thread in its libuv pool - so the one call in my benchmark should have been all it could handle without taking longer. And yet the second test (the one that doubled the amount of calls) took the same amount of time as the benchmark.

How to parse Ip and port from http tracker response

I am sending a request to a tracker and get a response
d8:completei2e10:downloadedi1e10:incompletei1e8:intervali1971e12:min intervali985e5:peers18:\235'\027\253\000\000\331e57\374-\033"\022,\270\302e
How to get Peers list or peer IP and port from this response
The response from the tracker is bencoded.
Adding some whitespace for clarity:
d
8:complete
i2e
10:downloaded
i1e
10:incomplete
i1e
8:interval
i1971e
12:min interval
i985e
5:peers
18:\235'\027\253\000\000\331e57\374-\033"\022,\270\302
e
The key:peers that has a 18 bytes binary string as value contains peers in the 'compact=1'-form that is specified in: [BEP23 - Tracker Returns Compact Peer Lists] and also in [the wiki]
Every peer is represented by 6 bytes, 4 bytes IPv4 + 2 bytes PORT in bigendian, so the 18 bytes string is for 3 peers.
\235 ' \027 \253 \000 \000=>157 39 23 171 0 0(0*256+0=0) =>157.39.23.171:0
\331 e 5 7 \374 - =>227 101 53 55 252 45(252*256+45=64557)=>227.101.53.55:64557
\033 " \022 , \270 \302=>27 34 18 44 184 192(184*256+192=47298)=>27.34.18.44:47298
(\235 is octal for 157, ' has ASCII value 39 etc.)

OpenMPI cannot fully utilize 10 GE

I tried to perform data exchange between two machines connected with 10GE. The size of data is large enough (8 GB) to expect network utilization near the maximum. But surprisingly I observed absolutely different behavior.
To check the throughput I have used two different programs - nethogs and nload, both of them show that network utilization is much lower than expected. Moreover the results are unpredictable - sometimes in and out channels are utilized simultaneously, but sometimes transmission and reception are separated as if there is a half-duplex channel. Sample output of nload:
Device enp1s0f0 [192.168.0.11] (1/1):
======================================================================================================================
Incoming:
|||||||||||||||||||
.###################
####################|
##################### Curr: 0.00 GBit/s
##################### Avg: 2.08 GBit/s
.##################### Min: 0.00 GBit/s
####################### Max: 6.32 GBit/s
####################### Ttl: 57535.38 GByte
Outgoing:
||||||||||||||||||
##################
|##################
###################|
#################### Curr: 0.00 GBit/s
#################### Avg: 2.09 GBit/s
.#################### Min: 0.00 GBit/s
#####################. Max: 6.74 GBit/s
###################### Ttl: 57934.64 GByte
The code I use is here:
int main(int argc, char** argv) {
boost::mpi::environment env{};
boost::mpi::communicator world{};
boost::mpi::request reqs[2];
int k = 10;
if(argc > 1)
k = std::atoi(argv[1]);
uint64_t n = (1ul << k);
std::vector<std::complex<double>> sv(n, world.rank());
std::vector<std::complex<double>> rv(n);
int dest = world.rank() == 0 ? 1 : 0;
int src = dest;
world.barrier();
reqs[0] = world.irecv(src, 0, rv.data(), n);
reqs[1] = world.isend(dest, 0, sv.data(), n);
boost::mpi::wait_all(reqs, reqs + 2);
return 0;
}
And here is the command I use to run on cluster:
mpirun --mca btl_tcp_if_include 192.168.0.0/24 --hostfile ./host_file -n 2 --bind-to core /path/to/shared/folder/mpi_exp 29
29 here means that 2^(29 + 4) = 8 GBytes will be sent
What I have done:
Proved that there is no hardware problem by successful saturation of the channel with netcat.
Checked with tcpdump that the size of TCP packets during the communication is unstable and rarely reach the maximum size (in netcat case it is stable).
Checked with strace that socket operations are correct.
Checked TCP parameters in sysctl - they are ok.
Could you please advise me why OpenMPI doesn't work as expected?
EDIT (14.08.2018):
Finally I was able to continue to dig into this problem. Below is the output of OSU bandwidth benchmark (it was run without any mca options):
# OSU MPI Bandwidth Test v5.3
# Size Bandwidth (MB/s)
1 0.50
2 0.98
4 1.91
8 3.82
16 6.92
32 10.32
64 22.03
128 43.95
256 94.74
512 163.96
1024 264.90
2048 400.01
4096 533.47
8192 640.02
16384 705.02
32768 632.03
65536 667.29
131072 842.00
262144 743.82
524288 654.09
1048576 775.50
2097152 759.44
4194304 774.81
Actually I think that such poor performance is caused by CPU bound. Each MPI process is single-threaded by default, and it is just not able to saturate 10GE channel.
I know it is possible to communicate with several threads by enabling multithreading when building OpenMPI. But such approach will lead to increased complexity on application level.
So is it possible to have multithreaded sending/receiving in OpenMPI internally on the level responsible for point-to-point data transfer?

Go server performance is the same when adding more cores

I am trying to understand how the go server scales when adding more cores but it seems that I can't see an improvement and I don't know why.
There does not seem to be a change in any way when increasing cores. Do I need to do something in the code to let it know that I want to use more than 1 core? Would that help on performance?
The code I am using for the test is a simple server that outputs "Hello World".
package main
import (
"net/http"
)
func main() {
http.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) {
w.Write([]byte("Hello World"))
})
http.ListenAndServe(":80", nil)
}
I am doing the tests on virtualbox.
These results are with 1 core:
$ nproc
1
Testing with ab with 1 core:
$ ab -n 10000 -c 1000 http://127.0.0.1/
Result from ab with 1 core:
Concurrency Level: 1000
Time taken for tests: 1.467 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1280000 bytes
HTML transferred: 110000 bytes
Requests per second: 6815.42 [#/sec] (mean)
Time per request: 146.726 [ms] (mean)
Time per request: 0.147 [ms] (mean, across all concurrent requests)
Transfer rate: 851.93 [Kbytes/sec] received
Testing with wrk with 1 core:
$ wrk -t1 -c1000 -d5s http://127.0.0.1:80/
Result from wrk with 1 core:
Running 5s test # http://127.0.0.1:80/
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 32.32ms 15.79ms 279.10ms 77.01%
Req/Sec 24.61k 1.89k 27.77k 64.58%
121709 requests in 5.01s, 14.86MB read
Requests/sec: 24313.72
Transfer/sec: 2.97MB
Changing to 2 cores:
$ nproc
2
Testing with ab with 2 cores:
$ ab -n 10000 -c 1000 http://127.0.0.1/
Result from ab with 2 cores:
Concurrency Level: 1000
Time taken for tests: 1.247 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1280000 bytes
HTML transferred: 110000 bytes
Requests per second: 8021.12 [#/sec] (mean)
Time per request: 124.671 [ms] (mean)
Time per request: 0.125 [ms] (mean, across all concurrent requests)
Transfer rate: 1002.64 [Kbytes/sec] received
Testing with wrk with 2 cores:
$ wrk -t1 -c1000 -d5s http://127.0.0.1:80/
Result with wrk with 2 cores:
Running 5s test # http://127.0.0.1:80/
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 37.04ms 5.67ms 64.92ms 79.73%
Req/Sec 26.98k 1.97k 29.71k 66.00%
134040 requests in 5.06s, 16.36MB read
Requests/sec: 26481.38
Transfer/sec: 3.23MB
Testing with wrk with 2 cores and 2 threads:
$ wrk -t2 -c1000 -d5s http://127.0.0.1:80/
Results with wrk with 2 cores and 2 threads:
Running 5s test # http://127.0.0.1:80/
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 29.32ms 29.21ms 505.22ms 98.47%
Req/Sec 13.48k 2.11k 18.16k 63.00%
134121 requests in 5.03s, 16.37MB read
Requests/sec: 26680.46
Transfer/sec: 3.26MB
Changing to 4 cores:
$ nproc
4
Testing with ab with 4 cores:
$ ab -n 10000 -c 1000 http://127.0.0.1/
Result with ab with 4 cores:
Concurrency Level: 1000
Time taken for tests: 1.301 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1280000 bytes
HTML transferred: 110000 bytes
Requests per second: 7683.90 [#/sec] (mean)
Time per request: 130.142 [ms] (mean)
Time per request: 0.130 [ms] (mean, across all concurrent requests)
Transfer rate: 960.49 [Kbytes/sec] received
Testing with wrk with 4 cores:
$ wrk -t1 -c1000 -d5s http://127.0.0.1:80/
Result with wrk with 4 cores:
Running 5s test # http://127.0.0.1:80/
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 36.84ms 5.78ms 58.23ms 77.43%
Req/Sec 26.69k 2.06k 30.19k 64.00%
132604 requests in 5.06s, 16.19MB read
Requests/sec: 26207.42
Transfer/sec: 3.20MB
Testing with wrk with 4 cores and 4 threads:
$ wrk -t4 -c1000 -d5s http://127.0.0.1:80/
Results with wrk with 4 cores and 4 threads:
Running 5s test # http://127.0.0.1:80/
4 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 35.58ms 26.65ms 508.77ms 98.44%
Req/Sec 5.82k 2.21k 10.44k 64.85%
117089 requests in 5.10s, 14.29MB read
Requests/sec: 22972.33
Transfer/sec: 2.80MB
I don't know if I can use go if it "does not scale" at all, with multiple cores. I don't understand how go works compared to other languages. When I run tests with facebooks HHVM it scales no problem out of the box when adding more cores.
What can I do to see a performance gain in the go server when adding more cores?
EDIT:
After changing the initial code to:
package main
import (
"net/http"
"runtime"
)
func main() {
runtime.GOMAXPROCS(4)
http.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) {
w.Write([]byte("Hello World"))
})
http.ListenAndServe(":80", nil)
}
The results from wrk were different, changing the GOMAXPROCS from 1 to 4 resulted in significant increase.
Testing 1 thread 4 cores:
$ wrk -t1 -c1000 -d5s http://127.0.0.1:80/
Result for 1 thread and 4 cores:
Running 5s test # http://127.0.0.1:80/
1 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 11.00ms 4.33ms 53.58ms 83.83%
Req/Sec 48.65k 3.30k 55.18k 81.25%
242131 requests in 5.08s, 29.56MB read
Requests/sec: 47658.92
Transfer/sec: 5.82MB
Testing 4 thread 4 cores:
$ wrk -t4 -c1000 -d5s http://127.0.0.1:80/
Result for 4 thread and 4 cores:
Running 5s test # http://127.0.0.1:80/
4 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 15.47ms 8.49ms 99.35ms 80.88%
Req/Sec 14.98k 2.98k 27.42k 78.65%
298885 requests in 5.10s, 36.48MB read
Requests/sec: 58639.84
Transfer/sec: 7.16MB
But the tests with ab were the same. Does anyone know why it does not affect ab? When benchmarking with HHVM ab results also gets affected. But on go I get same results.
$ ab -n 10000 -c 1000 http://127.0.0.1/
Results:
Concurrency Level: 1000
Time taken for tests: 1.410 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1280000 bytes
HTML transferred: 110000 bytes
Requests per second: 7094.18 [#/sec] (mean)
Time per request: 140.961 [ms] (mean)
Time per request: 0.141 [ms] (mean, across all concurrent requests)
Transfer rate: 886.77 [Kbytes/sec] received
You need to tell the Go runtime to use more cores by setting the environment variable GOMAXPROCS to your desired core count. Alternatively, there is also a function to change it.
By default, this is set to one. As of Go 1.5 it will be the number of cores in your system.

What's the meaning of "Min xfer" and "throughput" in the output of IOzone

I'm a new user of IOzone, when I run the IOzone with the command: ./iozone -i 0 -i 1 -t 2 -T, it generates the following result(partially):
Command line used: ./iozone -i 0 -i 1 -t 2 -T
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 2 threads
Each thread writes a 512 Kbyte file in 4 Kbyte records
Children see throughput for 2 initial writers = 650943.69 KB/sec
Parent sees throughput for 2 initial writers = 13090.24 KB/sec
Min throughput per thread = 275299.72 KB/sec
Max throughput per thread = 375643.97 KB/sec
Avg throughput per thread = 325471.84 KB/sec
Min xfer = 356.00 KB
Children see throughput for 2 rewriters = 1375881.50 KB/sec
Parent sees throughput for 2 rewriters = 10523.74 KB/sec
Min throughput per thread = 1375881.50 KB/sec
Max throughput per thread = 1375881.50 KB/sec
Avg throughput per thread = 687940.75 KB/sec
Min xfer = 512.00 KB
Children see throughput for 2 readers = 2169601.25 KB/sec
Parent sees throughput for 2 readers = 27753.94 KB/sec
Min throughput per thread = 2169601.25 KB/sec
Max throughput per thread = 2169601.25 KB/sec
Avg throughput per thread = 1084800.62 KB/sec
Min xfer = 512.00 KB
Children see throughput for 2 re-readers = 2572435.25 KB/sec
Parent sees throughput for 2 re-readers = 26311.78 KB/sec
Min throughput per thread = 2572435.25 KB/sec
Max throughput per thread = 2572435.25 KB/sec
Avg throughput per thread = 1286217.62 KB/sec
Min xfer = 512.00 KB
iozone test complete.
I get confused about meaning of "throughput" and "Min xfer", is there someone can help me?
By the way, why the throughput seen from children and parent is different? Thanks!
Min xfer refers to the smallest amount of data written at one time. "Each thread writes a 512 Kbyte file in 4 Kbyte records"
So if the Min xfer was 512.00 KB it wrote the entire actual file to disk at once (grouped all the 4 Kbyte records together).
Children and parent throughput are different due to OS I/O buffering. iozone doesn't force direct (non-buffered) read or writes with the throughput test. What you're really testing is your system's buffer cache + disk cache + disk speed combo.

Resources