I'm using Artillery to run some performance tests on a node app. I'm not sure how to interpret the results. I get something like
All virtual users finished
Summary report # 11:24:12(+1000) 2019-04-29
Scenarios launched: 600
Scenarios completed: 600
Requests completed: 600
RPS sent: 19.73
Request latency:
min: 1.2
max: 7.7
median: 1.7
p95: 3.1
p99: 3.8
Scenario counts:
0: 600 (100%)
Codes:
400: 600
Not sure what these results mean for example
Request latency
Codes
Scenario Counts,
Is there any other more popular tool that can be used as well on a side note for node apps?
Read through the docs page of artillery to understand more about the results.
https://artillery.io/docs/getting-started/
Additionally you can check out ab and wrk for a more deep analysis of your http endpoints. You'd almost always want to keep and eye on what's happening inside your web server when under load. For that you can take a look at tools like node-clinic and n-solid.
Related
We use RestHighLevelClient to query AWS OpenSearch in our service. Recently we have seen some latency issues related to OpenSearch calls so I'm doing stress test to troubleshoot but observed some unexpected behaviors.
In our service when a request is received, we start 5 threads and make one OpenSearch call within each thread in parallel in order to achieve the latency performance similar to one call. During load tests even when I send traffic with 1TPS, for the same request I'm seeing very different latency numbers for different threads, specifically there's usually one or two threads seeing huge latency compared to others, which seems like that thread is being blocked by something, for example 390 ms, 300ms, 1.1 sec, 520ms, 30ms for each thread while in the mean time I don't see any search latency spike reported on OpenSearch service, with the max SearchLatency being under 350ms all the time.
I read that the low level rest client used in the RestHighLevelClient is managing a conn pool with very small default maxConn values so I've override both the DEFAULT_MAX_CONN_PER_ROUTE to be 100 and DEFAULT_MAX_CONN_TOTAL to be 200 when creating the client but it doesn't seem working based on the test results I saw before and after updating these two values.
I'm wondering if anyone has seen similar issues or has any ideas on what could be the reason for this behavior. Thanks!
In brief, I am having trouble supporting more than 5000 read requests per minute from a data API leveraging Postgresql, Node.js, and node-postgres. The bottleneck appears to be in between the API and the DB. Here are the implmentation details.
I'm using an AWS Postgresql RDS database instance (m4.4xlarge - 64 GB mem, 16 vCPUs, 350 GB SSD, no provisioned IOPS) for a Node.js powered data API. By default the RDS's max_connections=5000. The node API is load-balanced across two clusters with 4 processes each (2 Ec2s with 4 vCPUs running the API with PM2 in cluster-mode). I use node-postgres to bind the API to the Postgresql RDS, and am attempting to use it's connection pooling feature. Below is a sample of my connection pool code:
var pool = new Pool({
user: settings.database.username,
password: settings.database.password,
host: settings.database.readServer,
database: settings.database.database,
max: 25,
idleTimeoutMillis: 1000
});
/* Example of pool usage */
pool.query('SELECT my_column FROM my_table', function(err, result){
/* Callback code here */
});
Using this implementation and testing with a load tester, I can support about 5000 requests over the course of one minute, with an average response time of about 190ms (which is what I expect). As soon as I fire off more than 5000 requests per minute, my response time increases to over 1200ms in the best of cases and in the worst of cases the API begins to frequently timeout. Monitoring indicates that for the EC2s running the Node.js API, CPU utilization remains below 10%. Thus my focus is on the DB and the API's binding to the DB.
I have attempted to increase (and decrease for that matter) the node-postgres "max" connections setting, but there was no change in the API response/timeout behavior. I've also tried provisioned IOPS on the RDS, but no improvement. Also, interestingly, I scaled the RDS up to m4.10xlarge (160 GB mem, 40 vCPUs), and while the RDS CPU utilization dropped greatly, the overall performance of the API worsed considerably (couldn't even support the 5000 requests per minute that I was able to with the smaller RDS).
I'm in unfamilar territory in many respects and am unsure of how to best determine which of these moving parts is bottlenecking API performance when over 5000 requests per minute. As noted I have attempted a variety of adjustments based on the review of Postgresql configuration documentation and node-postgres documentation, but to no avail.
If anyone has advice on how to diagnose or optimize I would greatly appreciate it.
UPDATE
After scaling up to m4.10xlarge, i performed a series of load-tests, varying the number of request/min and the max number of connections in each pool. Here are some screen captures of monitoring metrics:
In order to support more then 5k requests, while maintaining the same response rate, you'll need better hardware...
The simple math states that:
5000 requests*190ms avg = 950k ms divided into 16 cores ~ 60k ms per core
which basically means your system was highly loaded.
(I'm guessing you had some spare CPU as some time was lost on networking)
Now, the really interesting part in your question comes from the scale up attempt: m4.10xlarge (160 GB mem, 40 vCPUs).
The drop in CPU utilization indicates that the scale up freed DB time resources - So you need to push more requests!
2 suggestions:
Try increasing the connection pool to max: 70 and look at the network traffic (depending on the amount of data you might be hogging the network)
also, are your requests to the DB a-sync from the application side? make sure your app can actually push more requests.
The best way is to make use of a separate Pool for each API call, based on the call's priority:
const highPriority = new Pool({max: 20}); // for high-priority API calls
const lowPriority = new Pool({max: 5}); // for low-priority API calls
Then you just use the right pool for each of the API calls, for optimum service/connection availability.
Since you are interested in read performance can set up replication between two (or more) PostgreSQL instances, and then use pgpool II to load balance between the instances.
Scaling horizontally means you won't start hitting the max instance sizes at AWS if you decide next week you need to go to 10,000 concurrent reads.
You also start to get some HA in your architecture.
--
Many times people will use pgbouncer as a connection pooler even if they have one built into their application code already. pgbouncer works really well and is typically easier to configure and manage that pgpool, but it doesn't do load balancing. I'm not sure if it would help you very much in this scenario though.
How can I limit/ control my concurrent request with rest assured. The REST API I'm currently testing is only limited to 3 concurrent request per account at a time which means all the other tests get a 429 response code.
An example would be much appreciated.
As I know Rest Assured does not have such functionality. You could easily do it using test runner, e.g. TestNG, by managing parallel threads:
name: SingleSuite
threadCount: 3
parallel: methods
tests:
- name: Regression
classes:
- test.ApiTest
If you want to run several accounts simultaneously you should think about making RA thread safe (or check that it has been fixed in latest versions)
I cannot figure out what is the cause of the bottleneck on this site, very bad response times once about 400 users reached. The site is on Google compute engine, using an instance group, with network load balancing. We created the project with sailjs.
I have been doing load testing with Google container engine using kubernetes, running the locust.py script.
The main results for one of the tests are:
RPS : 30
Spawn rate: 5 p/s
TOTALS USERS: 1000
AVG(res time): 27500!! (27,5 seconds)
The response time initially is great, below one second, but when it starts reaching about 400 users the response time starts to jump massively.
I have tested obvious factors that can influence that response time, results below:
Compute engine Instances
(2 x standard-n2, 200gb disk, ram:7.5gb per instance):
Only about 20% cpu utilization used
Outgoing network bytes: 340k bytes/sec
Incoming network bytes: 190k bytes/sec
Disk operations: 1 op/sec
Memory: below 10%
MySQL:
Max_used_connections : 41 (below total possible)
Connection errors: 0
All other results for MySQL also seem fine, no reason to cause bottleneck.
I tried the same test for a new sailjs created project, and it did better, but still had terrible results, 5 seconds res time for about 2000 users.
What else should I test? What could be the bottleneck?
Are you doing any file reading/writing? This is a major obstacle in node.js, and will always cause some issues. Caching read files or removing the need for such code should be done as much as possible. In my own experience, serving files like images, css, js and such trough my node server would start causing trouble when the amount of concurrent requests increased. The solution was to serve all of this trough a CDN.
Another proble could be the mysql driver. We had some problems with connection not being closed correctly (Not using sails.js, but I think they used the same driver at the time I encountered this), so they would cause problems on the mysql server, resulting in long delays when fetching data from the database. You should time/track the amount of mysql queries and make sure they arent delayed.
Lastly, it could be some special issue with sails.js and Google compute engine. You should make sure there arent any open issues on either of these about the same problem you are experiencing.
I am trying to scale websites on Widows Azure. So far I‘ve tested Wordpress, Ghost (Blog) and a plain HTML site and it’s all the same: If I scale them up (add instances), they don’t get any faster. I am sure I must do something wrong…
This is what I did:
I created a new shared website, with a plain HTML Bootstrap template on it. http://demobootstrapsite.azurewebsites.net/
Then I installed ab.exe from the Apache Project on a hosted bare metal server (4 Cores, 12 GB RAM, 100 MBit)
I ran the test two times. The first time with a single shared instance and the second time with two shared instances using this command:
ab.exe -n 10000 -c 100 http://demobootstrapsite.azurewebsites.net/
This means ab.exe is going to create 10000 requests with 100 parallel threads.
I expected the response times of the test with two shared instances to be significantly lower than the response times with just one shared instance. But the mean time per request even rised a bit from 1452.519 ms with one shared instance to 1460.631 ms with two shared instances. Later I even ran the site on 8 shared instances with no effect at all. My first thought was that maybe the shared instances are the problem. So I put the site on a standard VM and ran the test again. But the problems remain the same. Also adding more instances didn’t make the site any faster (even a bit slower).
Later I‘ve whatched a Video with Scott Hanselman and Stefan Schackow in which they‘ve explained the Azure Scaling features. Stefan says that Azure has a kind of „sticky loadbalancing“ which will redirect a client always to the same instance/VM to avoid compatibility problems with statefull applications. So I‘ve checked the WebServer logs and I found a Logfile for every instance with about the same size. Usually that means that every instance was used during the test..
PS: During the test run I‘ve checked the response time oft the website from my local computer (from a different network than the server) and the response times were about 1.5s.
Here are the test results:
######################################
1 instance result
######################################
PS C:\abtest> .\ab.exe -n 10000 -c 100 http://demobootstrapsite.azurewebsites.net/
This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking demobootstrapsite.azurewebsites.net (be patient)
Finished 10000 requests
Server Software: Microsoft-IIS/8.0
Server Hostname: demobootstrapsite.azurewebsites.net
Server Port: 80
Document Path: /
Document Length: 16396 bytes
Concurrency Level: 100
Time taken for tests: 145.252 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 168800000 bytes
HTML transferred: 163960000 bytes
Requests per second: 68.85 [#/sec] (mean)
Time per request: 1452.519 [ms] (mean)
Time per request: 14.525 [ms] (mean, across all concurrent requests)
Transfer rate: 1134.88 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 14 8.1 16 78
Processing: 47 1430 93.9 1435 1622
Waiting: 16 705 399.3 702 1544
Total: 62 1445 94.1 1451 1638
Percentage of the requests served within a certain time (ms)
50% 1451
66% 1466
75% 1482
80% 1498
90% 1513
95% 1529
98% 1544
99% 1560
100% 1638 (longest request)
######################################
2 instances result
######################################
PS C:\abtest> .\ab.exe -n 10000 -c 100 http://demobootstrapsite.azurewebsites.net/
This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking demobootstrapsite.azurewebsites.net (be patient)
Finished 10000 requests
Server Software: Microsoft-IIS/8.0
Server Hostname: demobootstrapsite.azurewebsites.net
Server Port: 80
Document Path: /
Document Length: 16396 bytes
Concurrency Level: 100
Time taken for tests: 146.063 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 168800046 bytes
HTML transferred: 163960000 bytes
Requests per second: 68.46 [#/sec] (mean)
Time per request: 1460.631 [ms] (mean)
Time per request: 14.606 [ms] (mean, across all concurrent requests)
Transfer rate: 1128.58 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 14 8.1 16 78
Processing: 31 1439 92.8 1451 1607
Waiting: 16 712 402.5 702 1529
Total: 47 1453 92.9 1466 1622
Percentage of the requests served within a certain time (ms)
50% 1466
66% 1482
75% 1482
80% 1498
90% 1513
95% 1529
98% 1544
99% 1560
100% 1622 (longest request)
"Scaling" the website in terms of resources adds more capacity to accept more requests, and won't increase the speed at which a single capacity instance can perform when not overloaded.
For example; assume a Small VM can accept 100 requests per second, processing each request at 1000ms, (and if it was 101 requests per second, each request would start to slow down to say 1500ms) then scaling to more Small VMs won't increase the speed at which a single request can be processed, it just raises us to accepting 200 requests per second under 1000ms each (as now both machines are not overloaded).
For per-request performance; the code itself (and CPU performance of the Azure VM) will impact how quickly a single request can be executed.
Given the complete absence in the question of the most important detail of such a test, it sounds to me you are merely testing your Internet connection bandwidth. 10 Mb/sec is a very common rate.
No, it doesn't scale.
I usually run logparser against the iis logs that were generated at the time of the load test and calculate the RPS and latency (time-taken field) off that. This helps isolate the slowness from network, to server processing to actual load test tool reporting.
Some ideas:
Is Azure throttling to prevent a DOS attack? You are making a hell of a lot of requests from one location to a single page.
Try Small sized Web Sites rather than shared. Capacity and Scaling might be quite different. Load of 50 requests/sec doesn't seem terrible for a shared service.
Try to identify where that time is going. 1.4s is a really long time.
Run load tests from several different machines simultaneously, to determine if there's throttling going on or you're affected by sticky load balancing or other network artefacts.
You said it's ok under load of about 10 concurrent requests at 50 requests/second. Gradually increase the load you're putting on the server to determine the point at which it starts to choke. Do this across multiple machines too.
Can you log on to Web Sites? Probably not ... see if you can replicate the same issues on a Cloud Service Web Role and analyze from there using Performance Monitor and typical IIS tools to see where the bottleneck is, or if it's even on the machine versus Azure network infrastructure.
Before you load test the websites, you should do a baseline test with a single instance, say with 10 concurrent threads, to check how the website handles when not under load. Then use this base line to understand how the websites behave under load.
For example, if the baseline shows the website responds in 1.5s to requests when not under load, and again with 1.5s under load, then this means the website is able to handle the load easily. If under load the website takes 3-4s using a single instance, then this means it doesn't handle the load so well - try to add another instance and check if the response time improves.
Here
You can test for FREE
http://tools.pingdom.com/fpt/#!/ELmHA/http://demobootstrapsite.azurewebsites.net/
http://tools.pingdom.com/
Regards
Valentin