Scaling in Windows Azure for IO Performance

Scaling in Windows Azure for IO Performance - azure

Windows Azure advertises three types of IO performance levels:
Extra Small : Low
Small: Moderate
Medium and above: High
So, if I have an IO bound application (rather than CPU or Memory bound) and need at least 6 CPUs to process my work load - will I get better IO performance with 12-15 Extra Smalls, 6 Smalls, or 3 Mediums?
I'm sure this varies based on applications - is there an easy way to go about testing this? Are there any numbers that give a better picture of how much of an IO performance increase you get as you move to large instance roles?
It seems like the IO performance for smaller roles could be equivalent to the larger ones, they are just the ones that get throttled down first if the overall load becomes too great. Does that sound right?

Windows Azure compute sizes offer approx. 100Mbps per core. Extra Small instances are much lower, at 5Mbps. See this blog post for more details. If you're IO-bound, the 6-Small setup is going to offer far greater bandwidth than 12 Extra-Smalls.
When you talk about processing your workload, are you working off a queue? If so, multiple worker roles, each being Small instance, could then each work with a 100Mbps pipe. You'd have to do some benchmarking to determine if 3 Mediums gives you enough of a performance boost to justify the larger VM size, knowing that when workload is down, your "idle" cost footprint per hour is now 2 cores (medium, $0.24) vs 1 (small, $0.12).

As I understand it, the amount of IO allowed per-core is constant and supposed to be dedicated. But I haven't been able to get formal confirmation of this. This likely is different for x-small instances which operatin in a shared mode and not dedicated like the other Windows Azure vm instances.

I'd imagine what you suspect is in fact true, that even being IO-bound varies by application. I think you could accomplish your goal of timing by using Timers and writing the output to a file on storage you could then retrieve. Do some math to figure out you can process X number of work units / hour by cramming as many through a small then a medium instance as possible. If your work unit size drastically fluctuates, you might have to do some averaging too. I would always prefer smaller instances if possible and just spin up more copies as you have need for more firepower.

Related

Is there a way to limit Linux processes' absolute resource-spend, the way Ethereum limits transactions using gas?

Let's say I'm building something like AWS Lambda / Cloudflare Workers, where I allow users to submit arbitrary binaries, and then I run them wrapped in sandboxes (e.g. Docker containers / gVisor / etc), packed multitenant-ly onto a fleet of machines.
Ignore the problem of ensuring the sandboxing is effective for now; assume that problem is solved.
Each individual execution of one of these worker-processes is potentially a very heavy workload (think SQL OLAP reports.) A worker-process may spend tons of CPU, memory, IOPS, etc. We want to allow them to do this. We don't want to limit users to a small fixed slice of a machine, as traditional cgroups limits enable. Part of our service's value-proposition is low latency (rather than high throughput) in answering heavy queries, and that means allowing each query to essentially monopolize our infrastructure as much as it needs, with as much parallelization as it can manage, to get done as quickly as possible.
We want to charge users in credits for the resources they use, according to some formula that combines the CPU-seconds, memory-GB-seconds, IO operations, etc. This will disincentivize users from submitting "sloppy" worker-processes (because a process that costs us more to run, costs them more to submit.) It will also prevent users from DoSing us with ultra-heavy workloads, without first buying enough credits to pay the ensuing autoscaling bills in advance :)
We would also like to enable users to set, for each worker-process launch, a limit on the total credit spend during execution — where if it spends too many CPU-seconds, or allocates too much memory for too long, or does too many IO operations, or any combination of these that adds up to "spending too many credits", then the worker-process gets hard-killed by the host machine. (And we then bill their account for exactly as many credits as the resource-limit they specified at launch, despite not successfully completing the job.) This would protect users (and us) from the monetary consequences of launching faulty/leaky workers; and would also enable us to predict an upper limit on how heavy a workload could be before running it, and autoscale accordingly.
This second requirement implies that we can't do the credit-spend accounting after the fact, async, using observed per-cgroup metrics fed into some time-series server; but instead, we need each worker hypervisor to do the credit-spend accounting as the worker runs, in order to stop it as close to the time it overruns its budget as possible.
Basically, this is, to a tee, a description of the "gas" accounting system in the Ethereum Virtual Machine: the EVM does credit-spend accounting based on a formula that combines resource-costs for each op, and hard-kills any "worker process" (smart contract) that goes over its allocated credit (gas) limit for this launch (tx and/or CALL op) of the worker.
However, the "credit-spend accounting" in the EVM is enabled by instrumenting the VM that executes code such that each VM ISA op also updates a gas-left-to-spend VM register, and aborts VM execution if the gas-left-to-spend ever goes negative. Running native code on bare-metal/regular IaaS VMs, we don't have the ability to instrument our CPU like that. (And doing so through static binary translation would probably introduce far too much overhead.) So doing this the way the EVM does it, is not really an option.
I know Linux does CPU accounting, memory accounting, etc. Is there a way, using some combination of cgroups + gVisor-alike syscall proxying, to approximate the function of the EVM's "tx gas limit", i.e. to enable processes to be hard-killed (instantly/within a few ms of) when they go over their credit limit?
I'm assuming there's no off-the-shelf solution for this (haven't been able to find one after much research.) But are the right CPU counters + kernel data structures + syscalls in place to be able to develop such a solution, and to have it be efficient/low-overhead?

Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?

*Adding a second core or CPU might increase the performance of your parallel program, but it is unlikely to double it. Likewise, a
four-core machine is not going to execute your parallel program four
times as quickly— in part because of the overhead and coordination
described in the previous sections. However, the design of the
computer hardware also limits its ability to scale. You can expect a
significant improvement in performance, but it won’t be 100 percent
per additional core, and there will almost certainly be a point at
which adding additional cores or CPUs doesn’t improve the performance
at all.
*
I read the paragraph above from a book. But I don't get the last sentence.
So, Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?

If you take a serial program and a parallel version of the same program then the parallel program has to do some operations that the serial program does not, specifically operations concerned with coordinating the operations of the multiple processors. These contribute to what is often called 'parallel overhead' -- additional work that a parallel program has to do. This is one of the factors that makes it difficult to get 2x speed-up on 2 processors, 4x on 4 or 32000x on 32000 processors.
If you examine the code of a parallel program you will often find segments which are serial, that is which only use one processor while the others are idle. There are some (fragments of) algorithms which are not parallelisable, and there are some operations which are often not parallelised but which could be: I/O operations for instance, to parallelise these you need some sort of parallel I/O system. This 'serial fraction' provides an irreducible minimum time for your computation. Amdahl's Law explains this, and that article provides a useful starting point for your further reading.
Even when you do have a program which is well parallelised the scaling (ie the way speed-up changes as the number of processors increases) does not equal 1. For most parallel programs the size of the parallel overhead (or the amount of processor time which is devoted to operations which are only necessary for parallel computing) increases as some function of the number of processors. This often means that adding processors adds parallel overhead and at some point in the scaling of your program and jobs the increase in overhead cancels out (or even reverses) the increase in processor power. The article on Amdahl's Law also covers Gustafson's Law which is relevant here.
I've phrased this all in very general terms, no consideration of current processor and computer architectures; what I am describing are features of parallel computation (as currently understood) not of any particular program or computer.
I flat out disagree with #Daniel Pittman's assertion that these issues are of only theoretical concern. Some of us are working very hard to make our programs scale to very large numbers of processors (1000s). And almost all desktop and office development these days, and most mobile development too, targets multi-processor systems and using all those cores is a major concern.
Finally, to answer your question, at what point does adding processors no longer increase execution speed, now that is an architecture- and program-dependent question. Happily, it is one that is amenable to empirical investigation. Figuring out the scalability of parallel programs, and identifying ways of improving it, are a growing niche within the software engineering 'profession'.

#High Performance Mark is right. This happens when you are trying to solve a fixed size problem in the fastest possible way, so that Amdahl' law applies. It does not (usually) happen when you are trying to solve in a fixed time a problem. In the former case, you are willing to use the same amount of time to solve a problem
whose size is bigger;
whose size is exactly the same as before, but with a greeter accuracy.
In this situation, Gustafson's law applies.
So, let's go back to fixed size problems.
In the speedup formula you can distinguish these components:
Inherently sequential computations: σ(n)
Potentially parallel computations: ϕ(n)
Overhead (Communication operations etc): κ(n,p)
and the speedup for p processors for a problem size n is
Adding processors reduces the computation time but increases the communication time (for message-passing algorithms; it increases the synchronization overhead etcfor shared-memory algorithm); if we continue adding more processors, at some point the communication time increase will be larger than the corresponding computation time decrease.
When this happens, the parallel execution time begins to increase.
Speedup is inversely proportional to execution time, so that its curve begins to decline.
For any fixed problem size, there is an optimum number of processors that minimizes the overall parallel execution time.
Here is how you can compute exactly (analytical solution in closed form) the point at which you get no benefit by adding additional processors (or cores if you prefer).

The answer is, of course, "it depends", but in the current world of shared memory multi-processors the short version is "when traffic coordinating shared memory or other resources consumes all available bus bandwidth and/or CPU time".
That is a very theoretical problem, though. Almost nothing scales well enough to keep taking advantage of more cores at small numbers. Few applications benefit from 4, less from 8, and almost none from 64 cores today - well below any theoretical limitations on performance.

If we're talking x86 that architecture is more or less at its limits. # 3 GHz electricity travels 10 cm (actually somewhat less) per Hz, the die is about 1 cm square, components have to be able to switch states in that single Hz (1/3000000000 of a second). The current manufacturing process (22nm) gives interconnections that are 88 (silicon) atoms wide (I may have misunderstood this). With this in mind you realize that there isn't that much more that can be done with physics here (how narrow can an interconnection be? 10 atoms? 20?). At the other end the manufacturer, to be able to market a device as "higher performing" than its predecessor, adds a core which theoretically doubles the processing power.
"Theoretically" is not actually completely true. Some specially written applications will subdivide a large problem into parts that are small enough to be contained inside a single core and its exclusive caches (L1 & L2). A part is given to the core and it processes for a significant amount of time without accessing the L3 cache or RAM (which it shares with other cores and therefore will be where collisions/bottlenecks will occur). Upon completion it writes its results to RAM and receives a new part of the problem to work on.
If a core spends 99% of its time doing internal processing and 1% reading from and writing to shared memory (L3 cache and RAM) you could have an additional 99 cores doing the same thing because, in the end, the limiting factor will be the number of accesses the shared memory is capable of. Given my example of 99:1 such an application could make efficient use of 100 cores.
With more common programs - office, ie, etc - the extra processing power available will hardly be noticed. Some parts of the programs may have smaller parts written to take advantage of multiple cores and if you know which ones you may notice that those parts of the programs are much faster.
The 3 GHz was used as an example because it works well with the speed of light which is 300000000 meters/sec. I read recently that AMD's latest architecture was able to execute at 5 GHz but this was with special coolers and, even then, it was slower (processed less) than an intel i7 running at a significantly slower frequency.

It heavily depends on your program architecture/design. Adding cores improves parallel processing. If your program is not doing anything in parallel but only sequentially, adding cores would not improve its performance at all. It might improve other things though like framework internal processing (if you're using a framework).
So the more parallel processing is allowed in your program the better it scales with more cores. But if your program has limits on parallel processing (by design or nature of data) it will not scale indefinitely. It takes a lot of effort to make program run on hundreds of cores mainly because of growing overhead, resource locking and required data coordination. The most powerful supercomputers are indeed massively multi-core but writing programs that can utilize them is a significant effort and they can only show their power in an inherently parallel tasks.

Azure - Extra small instance web role - ready for production?

I'm planning for a website running in Azure. I'm estimating max. 2000 users a day creating about 20.000 hits.
I know I'm kinda vague here, but is the extra small instance ready for this kind of site? I'm using MVC 3 to create the site. Thanks for any answers.

You'd have to do some load-testing to best judge that question. Remember that, to enjoy the benefits of Windows Azure Compute SLA, you'll need a minimum of 2 instances (so now you have instances in different fault domains, so your site remains running even if one of the instances recycles due to OS upgrade, hardware failure, etc.). The question then becomes: can two Extra Small instances handle 20,000 hits daily? This equates to approx. 10K hits per VM instance per day, or 416 hits per hour, or 7 per minute. And... even with one instance, a hit rate of 14 per minute is fairly low.
More than CPU, you might find yourself bottlenecked by bandwidth, since you'll only see about 5Mbps per instance, vs. around 100Mbps per Small instance.
You might want to run a quick test with something like LoadStorm, which provides Load Testing as a Service. This should give you a good idea of how well the XS will perform under load.
EDIT (March 2012): Extra Small instances are now $0.02 / hour vs $0.04, so you could run up to 6 XS instances for the same cost of a single Small. This makes the XS option even more compelling. See this blog post for the official announcement on the price drop (including Storage cuts as well).

I agree with David that this is very dependent on the load per request you generate (both in CPU and bandwidth resources)
I just wanted to share our own experience with the XS instances. We've found that these instances suffer from severe clock drift: http://blog.codingoutloud.com/2011/08/25/azure-faq-how-frequently-is-the-clock-on-my-windows-azure-vm-synchronized/
This could be as much as a minute of difference over the week between NTP syncs. For most applications this isn't necessarily a problem, but we used Oauth1.0a authentication with an allowed timestamp difference of 30 seconds which resulted in major headaches when using XS. The S and larger don't have shared cores and consequently suffer much less clock drift.

You get a better SLA with 2 small instances rather than 1 larger.
You should also look at your peak load. For example with 20,000 hits per day, do 50% come between 9 and 10 in the morning?
Instance storage is 20GB, if this is just your application code should not be a problem.
IO performance is low, if this is just reading your app code first time it compiles should not be a problem.
CPU single 1 GHz, if this is just web pages and little calculation should not be a problem. The time this will be really slow is during a JIT compile.
The memory is 768 MB, this could be a problem especially if you are caching data.
You save under 2 USD a day using the small instance. But that is a Latte every 2 days so maybe it is worth taking the risk and having to do an extra deploy.

Highly concurrent multi-threaded application requires hardware

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.

UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)

Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.

Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.

Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64

Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.

MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.

I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

Comparing CPU speed likely improvements for business hardware upgrade justification

I have c# Console app, Monte Carlo simulation entirely CPU bound, execution time is inversely proportional to the number of dedicated threads/cores available (I keep a 1:1 ratio between cores/threads).
It currently runs daily on:
AMD Opteron 275 # 2.21 GHz (4 core)
The app is multithread using 3 threads, the 4th thread is for another Process Controller app.
It takes 15 hours per day to run.
I need to estimate as best I can how long the same work would take to run on a system configured with the following CPU's:
http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)
2 x X5570
2 x X5540
and compare the cases, I will recode it use the available threads. I want to justify that we need a Server with 2 x x5570 CPUs over the cheaper x5540 (they support 2 cpus on a single motherboard). This should make available 8 cores, 16 threads (that's how the Nehalem chips work I believe) to the operating system. So for my app that's 15 threads to the Monte Carlo Simulation.
Any ideas how to do this? Is there a website I can go and see benchmark data for all 3 CPUS involved for a single threaded benchmark? I can then extrapolate for my case and number of threads. I have access to the current system to install and run a benchmark on if necessary.
Note the business are also dictating the workload for this app over the next 3 months will increase about 20 times and needs to complete in a 24 hour clock.
Any help much appreciated.
Have also posted this here: http://www.passmark.com/forum/showthread.php?t=2308 hopefully they can better explain their benchmarking so I can effectively get a score per core which would be much more helpful.

have you considered recreating the algorithm in cuda? It uses current day GPU's to increase calculations like these 10-100 fold. This way you just need to buy a fat videocard

Finding a single-box server which can scale according to the needs you've described is going to be difficult. I would recommend looking at Sun CoolThreads or other high-thread count servers even if their individual clock speeds are lower. http://www.sun.com/servers/coolthreads/overview/performance.jsp
The T5240 supports 128 threads: http://www.sun.com/servers/coolthreads/t5240/index.xml
Memory and CPU cache bandwidth may be a limiting factor for you if the datasets are as large as they sound. How much time is spent getting data from disk? Would massively increased RAM sizes and caches help?
You might want to step back and see if there is a different algorithm which can provide the same or similar solutions with fewer calculations.
It sounds like you've spent a lot of time optimizing the the calculation thread, but is every calculation being performed actually important to the final result?
Is there a way to shortcut calculations anywhere?
Is there a way to identify items which have negligible effects on the end result, and skip those calculations?
Can a lower resolution model be used for early iterations with detail added in progressive iterations?
Monte Carlo algorithms I am familiar with are non-deterministic, and run time would be related to the number of samples; is there any way to optimize the sampling model to limit the number of items examined?
Obviously I don't know what problem domain or data set you are processing, but there may be another approach which can yield equivalent results.

tomshardware.com contains a comprehensive list of CPU benchmarks. However... you can't just divide them, you need to find as close to an apples to apples comparison as you can get and you won't quite get it because the mix of instructions on your workload may or may not depend.
I would guess please don't take this as official, you need to have real data for this that you're probably in the 1.5x - 1.75x single threaded speedup if work is cpu bound and not highly vectorized.
You also need to take into account that you are:
1) using C# and the CLR, unless you've taken steps to prevent it GC may kick in and serialize you.
2) the nehalems have hyperthreads so you won't be seeing perfect 16x speedup, more likely you'll see 8x to 12x speedup depending on how optimized your code is. Be optimistic here though (just don't expect 16x).
3) I don't know how much contention you have, getting good scaling on 3 threads != good scaling on 16 threads, there may be dragons here (and usually is).
I would envelope calc this as:
15 hours * 3 threads / 1.5 x = 30 hours of single threaded work time on a nehalem.
30 / 12 = 2.5 hours (best case)
30 / 8 = 3.75 hours (worst case)
implies a parallel run time if there is truly a 20x increase:
2.5 hours * 20 = 50 hours (best case)
3.74 hours * 20 = 75 hours (worst case)
How much have you profiled, can you squeeze 2x out of app? 1 server may be enough, but likely won't be.
And for gosh sakes try out the task parallel library in .Net 4.0 or the .Net 3.5 CTP it's supposed to help with this sort of thing.
-Rick

I'm going to go out on a limb and say that even the dual-socket X5570 will not be able to scale to the workload you envision. You need to distribute your computation across multiple systems. Simple math:
Current Workload
3 cores * 15 real-world-hours = 45 cpu-time-hours
Proposed 20X Workload
45 cpu-time-hours * 20 = 900 cpu-time-hours
900 cpu-time-hours / (20 hours-per-day-per-core) = 45 cores
Thus, you would need the equivalent of 45 2.2GHz Opteron cores to achieve your goal (despite increasing processing time from 15 hours to 20 hours per day), assuming a completely linear scaling of performance. Even if the Nehalem CPUs are 3X faster per-thread you will still be at the outside edge of your performance envelop - with no room to grow. That also assumes that hyper-threading will even work for your application.
The best-case estimates I've seen would put the X5570 at perhaps 2X the performance of your existing Opteron.
Source: http://www.dailytech.com/Server+roundup+Intel+Nehalem+Xeon+versus+AMD+Shanghai+Opteron/article15036.htm

It'd be swinging big hammer, but perhaps it makes sense to look at some heavy-iron 4-way servers. They are expensive, but at least you could get up to 24 physical cores in a single box. If you've exhausted all other means of optimization (including SIMD), then it's something to consider.
I'd also be weary of other bottlenecks such as memory bandwidth. I don't know the performance characteristics of Monte Carlo Simulations, but ramping up one resource might reveal some other bottleneck.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string