Why does Windows Azure not scale? - azure

I am trying to scale websites on Widows Azure. So far I‘ve tested Wordpress, Ghost (Blog) and a plain HTML site and it’s all the same: If I scale them up (add instances), they don’t get any faster. I am sure I must do something wrong…
This is what I did:
I created a new shared website, with a plain HTML Bootstrap template on it. http://demobootstrapsite.azurewebsites.net/
Then I installed ab.exe from the Apache Project on a hosted bare metal server (4 Cores, 12 GB RAM, 100 MBit)
I ran the test two times. The first time with a single shared instance and the second time with two shared instances using this command:
ab.exe -n 10000 -c 100 http://demobootstrapsite.azurewebsites.net/
This means ab.exe is going to create 10000 requests with 100 parallel threads.
I expected the response times of the test with two shared instances to be significantly lower than the response times with just one shared instance. But the mean time per request even rised a bit from 1452.519 ms with one shared instance to 1460.631 ms with two shared instances. Later I even ran the site on 8 shared instances with no effect at all. My first thought was that maybe the shared instances are the problem. So I put the site on a standard VM and ran the test again. But the problems remain the same. Also adding more instances didn’t make the site any faster (even a bit slower).
Later I‘ve whatched a Video with Scott Hanselman and Stefan Schackow in which they‘ve explained the Azure Scaling features. Stefan says that Azure has a kind of „sticky loadbalancing“ which will redirect a client always to the same instance/VM to avoid compatibility problems with statefull applications. So I‘ve checked the WebServer logs and I found a Logfile for every instance with about the same size. Usually that means that every instance was used during the test..
PS: During the test run I‘ve checked the response time oft the website from my local computer (from a different network than the server) and the response times were about 1.5s.
Here are the test results:
######################################
1 instance result
######################################
PS C:\abtest> .\ab.exe -n 10000 -c 100 http://demobootstrapsite.azurewebsites.net/
This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking demobootstrapsite.azurewebsites.net (be patient)
Finished 10000 requests
Server Software: Microsoft-IIS/8.0
Server Hostname: demobootstrapsite.azurewebsites.net
Server Port: 80
Document Path: /
Document Length: 16396 bytes
Concurrency Level: 100
Time taken for tests: 145.252 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 168800000 bytes
HTML transferred: 163960000 bytes
Requests per second: 68.85 [#/sec] (mean)
Time per request: 1452.519 [ms] (mean)
Time per request: 14.525 [ms] (mean, across all concurrent requests)
Transfer rate: 1134.88 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 14 8.1 16 78
Processing: 47 1430 93.9 1435 1622
Waiting: 16 705 399.3 702 1544
Total: 62 1445 94.1 1451 1638
Percentage of the requests served within a certain time (ms)
50% 1451
66% 1466
75% 1482
80% 1498
90% 1513
95% 1529
98% 1544
99% 1560
100% 1638 (longest request)
######################################
2 instances result
######################################
PS C:\abtest> .\ab.exe -n 10000 -c 100 http://demobootstrapsite.azurewebsites.net/
This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking demobootstrapsite.azurewebsites.net (be patient)
Finished 10000 requests
Server Software: Microsoft-IIS/8.0
Server Hostname: demobootstrapsite.azurewebsites.net
Server Port: 80
Document Path: /
Document Length: 16396 bytes
Concurrency Level: 100
Time taken for tests: 146.063 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 168800046 bytes
HTML transferred: 163960000 bytes
Requests per second: 68.46 [#/sec] (mean)
Time per request: 1460.631 [ms] (mean)
Time per request: 14.606 [ms] (mean, across all concurrent requests)
Transfer rate: 1128.58 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 14 8.1 16 78
Processing: 31 1439 92.8 1451 1607
Waiting: 16 712 402.5 702 1529
Total: 47 1453 92.9 1466 1622
Percentage of the requests served within a certain time (ms)
50% 1466
66% 1482
75% 1482
80% 1498
90% 1513
95% 1529
98% 1544
99% 1560
100% 1622 (longest request)

"Scaling" the website in terms of resources adds more capacity to accept more requests, and won't increase the speed at which a single capacity instance can perform when not overloaded.
For example; assume a Small VM can accept 100 requests per second, processing each request at 1000ms, (and if it was 101 requests per second, each request would start to slow down to say 1500ms) then scaling to more Small VMs won't increase the speed at which a single request can be processed, it just raises us to accepting 200 requests per second under 1000ms each (as now both machines are not overloaded).
For per-request performance; the code itself (and CPU performance of the Azure VM) will impact how quickly a single request can be executed.

Given the complete absence in the question of the most important detail of such a test, it sounds to me you are merely testing your Internet connection bandwidth. 10 Mb/sec is a very common rate.
No, it doesn't scale.

I usually run logparser against the iis logs that were generated at the time of the load test and calculate the RPS and latency (time-taken field) off that. This helps isolate the slowness from network, to server processing to actual load test tool reporting.

Some ideas:
Is Azure throttling to prevent a DOS attack? You are making a hell of a lot of requests from one location to a single page.
Try Small sized Web Sites rather than shared. Capacity and Scaling might be quite different. Load of 50 requests/sec doesn't seem terrible for a shared service.
Try to identify where that time is going. 1.4s is a really long time.
Run load tests from several different machines simultaneously, to determine if there's throttling going on or you're affected by sticky load balancing or other network artefacts.
You said it's ok under load of about 10 concurrent requests at 50 requests/second. Gradually increase the load you're putting on the server to determine the point at which it starts to choke. Do this across multiple machines too.
Can you log on to Web Sites? Probably not ... see if you can replicate the same issues on a Cloud Service Web Role and analyze from there using Performance Monitor and typical IIS tools to see where the bottleneck is, or if it's even on the machine versus Azure network infrastructure.

Before you load test the websites, you should do a baseline test with a single instance, say with 10 concurrent threads, to check how the website handles when not under load. Then use this base line to understand how the websites behave under load.
For example, if the baseline shows the website responds in 1.5s to requests when not under load, and again with 1.5s under load, then this means the website is able to handle the load easily. If under load the website takes 3-4s using a single instance, then this means it doesn't handle the load so well - try to add another instance and check if the response time improves.

Here
You can test for FREE
http://tools.pingdom.com/fpt/#!/ELmHA/http://demobootstrapsite.azurewebsites.net/
http://tools.pingdom.com/
Regards
Valentin

Related

how to interpret the nodejs performance tests run through Artillery

I'm using Artillery to run some performance tests on a node app. I'm not sure how to interpret the results. I get something like
All virtual users finished
Summary report # 11:24:12(+1000) 2019-04-29
Scenarios launched: 600
Scenarios completed: 600
Requests completed: 600
RPS sent: 19.73
Request latency:
min: 1.2
max: 7.7
median: 1.7
p95: 3.1
p99: 3.8
Scenario counts:
0: 600 (100%)
Codes:
400: 600
Not sure what these results mean for example
Request latency
Codes
Scenario Counts,
Is there any other more popular tool that can be used as well on a side note for node apps?
Read through the docs page of artillery to understand more about the results.
https://artillery.io/docs/getting-started/
Additionally you can check out ab and wrk for a more deep analysis of your http endpoints. You'd almost always want to keep and eye on what's happening inside your web server when under load. For that you can take a look at tools like node-clinic and n-solid.

Using ab command, 100 concurent process fails in Linux, How to solve this issue for Linux?

What i have done :
I have used following ab command :
ab -n 1000 -c 100 http://192.168.101.143:8558/num?num=5
Here, I am trying to generate 1000 http request , 100 concurrent process, port is 8558, and i want factorial of 5 from my own web server .
Another side, My own Web Server is waiting for request at that IP address port:8558 and when we execute the ab command my own server accept the requests and did process gives response factorial of 5 but some of 10-20 request fails every time in native Linux system.
But When, I have run my own server on native windows system, it give 1000 correct response with 100 concurrent process on time without fail any request.
Problem:
In windows 100 concurrent process(thread) working fine but In Linux some request fails, Why?
How to solve this issue for Linux?
Is there any socket related issue ?
When Server on in Windows Platform:
This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.101.143 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 192.168.101.143
Server Port: 8558
Document Path: /num?num=5
Document Length: 3 bytes
Concurrency Level: 100
Time taken for tests: 1.350 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 123000 bytes
HTML transferred: 3000 bytes
Requests per second: 652.93 [#/sec] (mean)
Time per request: 155.493 [ms] (mean)
Time per request: 1.550 [ms] (mean, across all concurrent requests)
Transfer rate: 75.20 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 2 1.6 1 12
Processing: 4 74 117.5 25 1803
Waiting: 2 56 74.4 16 1281
Total: 5 150 117.8 26 1810
Percentage of the requests served within a certain time (ms)
50% 26
66% 42
75% 78
80% 93
90% 171
95% 218
98% 358
99% 536
100% 1810 (longest request)
When Server Running in Linux Platform:
This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.101.143 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 192.168.101.143
Server Port: 8558
Document Path: /num?num=5
Document Length: 3 bytes
Concurrency Level: 100
Time taken for tests: 9.899 seconds
Complete requests: 1000
Failed requests: 13
(Connect: 0, Receive: 0, Length: 13, Exceptions: 0)
Total transferred: 119802 bytes
HTML transferred: 2922 bytes
Requests per second: 102.91 [#/sec] (mean)
Time per request: 120.934 [ms] (mean)
Time per request: 1.299 [ms] (mean, across all concurrent requests)
Transfer rate: 50.19 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 4 3.0 3 21
Processing: 2 346 2206.7 7 18823
Waiting: 0 76 657.4 6 6371
Total: 3 90 2206.7 10 18827
Percentage of the requests served within a certain time (ms)
50% 10
66% 12
75% 15
80% 17
90% 26
95% 39
98% 6341
99% 18820
100% 18827 (longest request)
In this case, I am using Virtual Linux and in that i am running my web server. so when I am execute my ab command from the another system. some what problem in port forwarding in Virtual box(NAT,Bridge,Host Only) . so, some of the request fails.
After that, When i have tried run my web server on host Linux and i have execute ab command in another system , i got perfect output in 100 concurrent process.
So from this , I conclude that this type of load testing we need to see the hardware latency as well as port forwarding issue also is there when you are testing you web server from Virtual box but in host Linux system there is no issue.

Why node.js+mongodb does not gives 100 req/sec throughput for 100 req sent in a second?

I kept node.js sever on one machine and mongodb sever on another machine. requests were mixture of 70% read and 30% write. It is observed that at 100 request in a second throughput is 60req/sec and at 200 requests second throughput is 130 req/sec. cpu and memory usage is same in both the cases. If application can server 130 req/sec then why it has not server 100req/sec in first case since cpu and memory utilization is same. machines are using ubuntu server 14.04
Make user threads in Jmeter and use loop forever for 300 sec. Then get the values.

Azure load balancer unstable performance

I have cloud service with single worker role running self hosted web api.
Trying to consume this web api from another cloud service (in same vnet) using xxx.cloudapp.net address. But performance is very unstable. Sometimes after couple hundred requests, http requests freezes for some time. Seams like Azure load balancer is throttling my requests.
Here is output from Apache Bench with reproduced freezing (ran from another VM in same vnet):
ab -c 10 -n 1000 http://xxx.cloudapp.net/ping
<..>
Time taken for tests: 39.970 seconds
Complete requests: 1000
Failed requests: 1
(Connect: 1, Receive: 0, Length: 0, Exceptions: 0)
<..>
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 19 402.6 0 9017
Processing: 0 360 2307.4 15 21046
Waiting: 0 318 2178.6 0 21046
Total: 0 379 2339.6 16 21046
Percentage of the requests served within a certain time (ms)
50% 16
66% 16
75% 16
80% 16
90% 16
95% 17
98% 9015
99% 9032
100% 21046 (longest request)
There are not freezes when using local IP (e.g. 10.0.0.x).
Tried using web role/iis with same results.
Why this is happening? How to avoid this? I don't want to use local IP, because then I will loose swapping feature.
I understand that you are doing perf test from one cloud service to another cloud service.
Azure load balancer does not throttle. It is an infrastructure components with extremely high limits. What could happen is that you may run our or ports. When you create outbound connections, we SNAT to source VIP. Since you SNAT, you get a limit of 64K ports.
The best to verify this is to associate a public IP to your VM running the perf test (https://azure.microsoft.com/en-us/documentation/articles/virtual-networks-instance-level-public-ip/) then run your test again
Yves

Nodejs max number of open websocket on the server

I am finally trying to benchmark my nodejs application. To benchmark I open a nodejs instance where I run my server, and one where I run my benchmark.
The benchmark open a number of websockets, the server wait for all of them to be open, gets the data from them, and only after having all the data, crunch some numbers.
I have tested it with 2000 websocket and it works fine, then I have tried with 10 000 and it took 38 minutes to open all the connection.
Being my first nodejs/websocket project,I was wondering if that is a reasonable timing (I guess so), or 38 minutes means that something went wrong in my implementation?
SHORT VERSION: is 38 minutes "too long" to open 10 000 websockets on a nodejs server?
I am testing on a server with two AMD Opteron(TM) Processor 6272, with 96GiB DIMM DDR3 Synchronous 667 MHz (1.5 ns) (12 x 8GB Memory Modules), but I am not doing any load balancing, i.e. my server (I think) runs on only one core.

Resources