Azure - Extra small instance web role - ready for production?

Azure - Extra small instance web role - ready for production? - azure

I'm planning for a website running in Azure. I'm estimating max. 2000 users a day creating about 20.000 hits.
I know I'm kinda vague here, but is the extra small instance ready for this kind of site? I'm using MVC 3 to create the site. Thanks for any answers.

You'd have to do some load-testing to best judge that question. Remember that, to enjoy the benefits of Windows Azure Compute SLA, you'll need a minimum of 2 instances (so now you have instances in different fault domains, so your site remains running even if one of the instances recycles due to OS upgrade, hardware failure, etc.). The question then becomes: can two Extra Small instances handle 20,000 hits daily? This equates to approx. 10K hits per VM instance per day, or 416 hits per hour, or 7 per minute. And... even with one instance, a hit rate of 14 per minute is fairly low.
More than CPU, you might find yourself bottlenecked by bandwidth, since you'll only see about 5Mbps per instance, vs. around 100Mbps per Small instance.
You might want to run a quick test with something like LoadStorm, which provides Load Testing as a Service. This should give you a good idea of how well the XS will perform under load.
EDIT (March 2012): Extra Small instances are now $0.02 / hour vs $0.04, so you could run up to 6 XS instances for the same cost of a single Small. This makes the XS option even more compelling. See this blog post for the official announcement on the price drop (including Storage cuts as well).

I agree with David that this is very dependent on the load per request you generate (both in CPU and bandwidth resources)
I just wanted to share our own experience with the XS instances. We've found that these instances suffer from severe clock drift: http://blog.codingoutloud.com/2011/08/25/azure-faq-how-frequently-is-the-clock-on-my-windows-azure-vm-synchronized/
This could be as much as a minute of difference over the week between NTP syncs. For most applications this isn't necessarily a problem, but we used Oauth1.0a authentication with an allowed timestamp difference of 30 seconds which resulted in major headaches when using XS. The S and larger don't have shared cores and consequently suffer much less clock drift.

You get a better SLA with 2 small instances rather than 1 larger.
You should also look at your peak load. For example with 20,000 hits per day, do 50% come between 9 and 10 in the morning?
Instance storage is 20GB, if this is just your application code should not be a problem.
IO performance is low, if this is just reading your app code first time it compiles should not be a problem.
CPU single 1 GHz, if this is just web pages and little calculation should not be a problem. The time this will be really slow is during a JIT compile.
The memory is 768 MB, this could be a problem especially if you are caching data.
You save under 2 USD a day using the small instance. But that is a Latte every 2 days so maybe it is worth taking the risk and having to do an extra deploy.

Related

Self hosted gitlab is slow, how to make it faster?

We've been running self hosted gitlab on AWS for a couple of years now, and with everything growing (repo size, board size, pipeline size, team size), things have slowed down considerably, to point where I'm losing my mind.
Just as a reference, here are some loading times. I've checked both, performance with the performance bar (entering p+b in gitlab) and also checking out the networking tab in the browser and how long it takes to finish.
Loading the board, perf checked with p+b, longest request (/api/graphql): ~5s
Loading merge requests page, networking finished: ~2s
Loading pipelines, networking finished: ~8s
What options do we have to make gitlab faster again?

I've done research and surprisingly not that much comes up (don't you people have the same problem?!). The only remedy I've found was changing the instance types, and those do make a difference. So an option, if you have the money to spare, is to get better machines.
Setup for Performance Tests
The slowest to load was the pipelines page and the board, so there I've conducted the speed tests. I took several measurements and averaged the results. On the board I checked the performance bar p+b longest request (/api/graphql) and on the pipelines page I checked the networking tab with caching disabled, until all requests were finished.
I conducted the tests on a gitlab instance where it was only me playing around and no other team members, so I cannot tell how much the results degrade when more people are working.
Machine comparison
Machine Type
Price
vcpu
RAM
clock speed
Loading Board
Loading Pipelines
t2.large
$70 / month
2
8
3.3
5s
5s
t3.large
$60 / month
2
8
3.1
5s
6s
t2.2xlarge
$270 / month
8
32
3.3
2s
5.5s
z1d.large
$135 / month
2
8
4
5s
3.5s
m5zn.xlarge
$240 / month
4
16
4.5
2s
3.3s
It seems that loading the board is rather sensitive to number of cores or memory while loading the pipelines is sensitive to clock speed. I'm not a pro with the different instance types on AWS, maybe there's some other magic ingredient (low latency networking?) in m5zn that makes it the fastest, those are just the factors that came to my mind.
I also tested changing the disk from gp2 to gp3 with maximum IOPS and throughput, but that didn't change the performance.
Conclusion
For an all-round ok-ish performance, choose the m5zn.xlarge instance. It is way above the requirements that gitlab claims are necessary, but it speeds up things significantly.

Node app eating memory incrementally over time

I've just launched two Express servers on DigitalOcean along with an instance of mongodb. I'm using PM2 to keep both of them running.
When I use htop to see the memory usage, the total usage is usually around 220-235mb (out of a total 488mb). The only thing I can see changing is the blue bars which I assume is buffer memory, the actual green memory in usage seems to always be around same.
I look on DO's graph however and over the past 24 hours the memory graph has been climbing upwards slowly, say 0.5% of the total per hour, sometimes it drops but overall it's on the up, at the moment it has been hovering around 60-65% of the total memory for a few hours.
There has been almost no traffic on these node web servers yet the memory keeps increasing slowly. So my question is could this be a memory leak within one of my servers or is it the nature of the v8 engine to incrementally expand its memory?

If you are considering memory-leak, then why don't you check your theory by writing 2-3 heapdumps with 2-3 hours difference in time. Then you can answer surely on your question.
You may use this module to write heapdumps on disk and then simply compare it using Chrome Developer Tools. Moreover you will see what's exactly placed inside the heap.
FYI: snapshots comparison from official documentation

Linux acceptable load average

I have a linux dedicated server machine(8cores 8gbRAM) where i run some crawler php scripts. The load on the system ends up being arround 200, which sounds a lot. Since i am not using the machine to host content, what could be the sideeffects of such high level of load for the purposes stated above.

Machines were made to work so there are no issues with high load average, per se. But, a high load average can be an indicator of a performance issue, often. Such investigation is usually application specific, but here is a very general guideline:
Since load average is a combined metric of (CPU, IO .. etc) you want to examine all separately. I would start with making sure the machine is not thrashing, by checking swap usage (vmstat comes in handy), and disk performance (using iostat). You may also check if your operations are CPU intensive.

You should read your load average value as a 3 component value (1 minute load, 5 minutes load and 15 minutes load respectively).
Take a look at the example taken from Wiki:
For example, one can interpret a load average of "1.73 0.60 7.98" on a single-CPU system as:
during the last minute, the system was overloaded by 73% on average (1.73 runnable processes, so that 0.73 processes had to wait for a turn for a single CPU system on average).
during the last 5 minutes, the CPU was idling 40% of the time on average.
during the last 15 minutes, the system was overloaded 698% on average (7.98 runnable processes, so that 6.98 processes had to wait for a turn for a single CPU system on average).
Full article
Please note that this value depends on the resources of your machine.
Cheers!

Scaling in Windows Azure for IO Performance

Windows Azure advertises three types of IO performance levels:
Extra Small : Low
Small: Moderate
Medium and above: High
So, if I have an IO bound application (rather than CPU or Memory bound) and need at least 6 CPUs to process my work load - will I get better IO performance with 12-15 Extra Smalls, 6 Smalls, or 3 Mediums?
I'm sure this varies based on applications - is there an easy way to go about testing this? Are there any numbers that give a better picture of how much of an IO performance increase you get as you move to large instance roles?
It seems like the IO performance for smaller roles could be equivalent to the larger ones, they are just the ones that get throttled down first if the overall load becomes too great. Does that sound right?

Windows Azure compute sizes offer approx. 100Mbps per core. Extra Small instances are much lower, at 5Mbps. See this blog post for more details. If you're IO-bound, the 6-Small setup is going to offer far greater bandwidth than 12 Extra-Smalls.
When you talk about processing your workload, are you working off a queue? If so, multiple worker roles, each being Small instance, could then each work with a 100Mbps pipe. You'd have to do some benchmarking to determine if 3 Mediums gives you enough of a performance boost to justify the larger VM size, knowing that when workload is down, your "idle" cost footprint per hour is now 2 cores (medium, $0.24) vs 1 (small, $0.12).

As I understand it, the amount of IO allowed per-core is constant and supposed to be dedicated. But I haven't been able to get formal confirmation of this. This likely is different for x-small instances which operatin in a shared mode and not dedicated like the other Windows Azure vm instances.

I'd imagine what you suspect is in fact true, that even being IO-bound varies by application. I think you could accomplish your goal of timing by using Timers and writing the output to a file on storage you could then retrieve. Do some math to figure out you can process X number of work units / hour by cramming as many through a small then a medium instance as possible. If your work unit size drastically fluctuates, you might have to do some averaging too. I would always prefer smaller instances if possible and just spin up more copies as you have need for more firepower.

Comparing CPU speed likely improvements for business hardware upgrade justification

I have c# Console app, Monte Carlo simulation entirely CPU bound, execution time is inversely proportional to the number of dedicated threads/cores available (I keep a 1:1 ratio between cores/threads).
It currently runs daily on:
AMD Opteron 275 # 2.21 GHz (4 core)
The app is multithread using 3 threads, the 4th thread is for another Process Controller app.
It takes 15 hours per day to run.
I need to estimate as best I can how long the same work would take to run on a system configured with the following CPU's:
http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)
2 x X5570
2 x X5540
and compare the cases, I will recode it use the available threads. I want to justify that we need a Server with 2 x x5570 CPUs over the cheaper x5540 (they support 2 cpus on a single motherboard). This should make available 8 cores, 16 threads (that's how the Nehalem chips work I believe) to the operating system. So for my app that's 15 threads to the Monte Carlo Simulation.
Any ideas how to do this? Is there a website I can go and see benchmark data for all 3 CPUS involved for a single threaded benchmark? I can then extrapolate for my case and number of threads. I have access to the current system to install and run a benchmark on if necessary.
Note the business are also dictating the workload for this app over the next 3 months will increase about 20 times and needs to complete in a 24 hour clock.
Any help much appreciated.
Have also posted this here: http://www.passmark.com/forum/showthread.php?t=2308 hopefully they can better explain their benchmarking so I can effectively get a score per core which would be much more helpful.

have you considered recreating the algorithm in cuda? It uses current day GPU's to increase calculations like these 10-100 fold. This way you just need to buy a fat videocard

Finding a single-box server which can scale according to the needs you've described is going to be difficult. I would recommend looking at Sun CoolThreads or other high-thread count servers even if their individual clock speeds are lower. http://www.sun.com/servers/coolthreads/overview/performance.jsp
The T5240 supports 128 threads: http://www.sun.com/servers/coolthreads/t5240/index.xml
Memory and CPU cache bandwidth may be a limiting factor for you if the datasets are as large as they sound. How much time is spent getting data from disk? Would massively increased RAM sizes and caches help?
You might want to step back and see if there is a different algorithm which can provide the same or similar solutions with fewer calculations.
It sounds like you've spent a lot of time optimizing the the calculation thread, but is every calculation being performed actually important to the final result?
Is there a way to shortcut calculations anywhere?
Is there a way to identify items which have negligible effects on the end result, and skip those calculations?
Can a lower resolution model be used for early iterations with detail added in progressive iterations?
Monte Carlo algorithms I am familiar with are non-deterministic, and run time would be related to the number of samples; is there any way to optimize the sampling model to limit the number of items examined?
Obviously I don't know what problem domain or data set you are processing, but there may be another approach which can yield equivalent results.

tomshardware.com contains a comprehensive list of CPU benchmarks. However... you can't just divide them, you need to find as close to an apples to apples comparison as you can get and you won't quite get it because the mix of instructions on your workload may or may not depend.
I would guess please don't take this as official, you need to have real data for this that you're probably in the 1.5x - 1.75x single threaded speedup if work is cpu bound and not highly vectorized.
You also need to take into account that you are:
1) using C# and the CLR, unless you've taken steps to prevent it GC may kick in and serialize you.
2) the nehalems have hyperthreads so you won't be seeing perfect 16x speedup, more likely you'll see 8x to 12x speedup depending on how optimized your code is. Be optimistic here though (just don't expect 16x).
3) I don't know how much contention you have, getting good scaling on 3 threads != good scaling on 16 threads, there may be dragons here (and usually is).
I would envelope calc this as:
15 hours * 3 threads / 1.5 x = 30 hours of single threaded work time on a nehalem.
30 / 12 = 2.5 hours (best case)
30 / 8 = 3.75 hours (worst case)
implies a parallel run time if there is truly a 20x increase:
2.5 hours * 20 = 50 hours (best case)
3.74 hours * 20 = 75 hours (worst case)
How much have you profiled, can you squeeze 2x out of app? 1 server may be enough, but likely won't be.
And for gosh sakes try out the task parallel library in .Net 4.0 or the .Net 3.5 CTP it's supposed to help with this sort of thing.
-Rick

I'm going to go out on a limb and say that even the dual-socket X5570 will not be able to scale to the workload you envision. You need to distribute your computation across multiple systems. Simple math:
Current Workload
3 cores * 15 real-world-hours = 45 cpu-time-hours
Proposed 20X Workload
45 cpu-time-hours * 20 = 900 cpu-time-hours
900 cpu-time-hours / (20 hours-per-day-per-core) = 45 cores
Thus, you would need the equivalent of 45 2.2GHz Opteron cores to achieve your goal (despite increasing processing time from 15 hours to 20 hours per day), assuming a completely linear scaling of performance. Even if the Nehalem CPUs are 3X faster per-thread you will still be at the outside edge of your performance envelop - with no room to grow. That also assumes that hyper-threading will even work for your application.
The best-case estimates I've seen would put the X5570 at perhaps 2X the performance of your existing Opteron.
Source: http://www.dailytech.com/Server+roundup+Intel+Nehalem+Xeon+versus+AMD+Shanghai+Opteron/article15036.htm

It'd be swinging big hammer, but perhaps it makes sense to look at some heavy-iron 4-way servers. They are expensive, but at least you could get up to 24 physical cores in a single box. If you've exhausted all other means of optimization (including SIMD), then it's something to consider.
I'd also be weary of other bottlenecks such as memory bandwidth. I don't know the performance characteristics of Monte Carlo Simulations, but ramping up one resource might reveal some other bottleneck.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string