How to improve download speed on Azure VMs? - azure

My organization is spinning down its private cloud and preparing to deploy a complete analytics and data warehousing solution on Azure VMs. Naturally, we are doing performance testing so we can identify and address any unforeseen issues before fully decommissioning our datacenter servers.
So far, the only significant issue I've noticed in Azure VMs is that my download speeds don't seem to change at all no matter which VM size I test. The speed is always right around 60 Mbps downstream. This is despite guidance such as this which indicates I should see improvements in ingress based on VM size.
I have done significant and painstaking research into the issue, but everything I've read so far only really addresses "intra-cloud" optimizations (e.g., ExpressRoute, Accelerated Networking, VNet peering) or communications to specific resources. My requirement is for fast download speeds on the open internet (secure and trusted sites only). To preempt any attempts to question this requirement, I will not specifically address the reasons why I can't consider alternatives such as establishing private/dedicated connections, but suffice it to say, I have no choice but to rule out those options.
That said, I'm open to any other advice! What, if anything, can I do to improve download speeds to Azure VMs when data needs to be transferred to the VM from the open web?
Edit: Corrected comment about Azure guidance.

I finally figured out what was going on. #evilSnobu was onto something -- there was something flawed about my tests. Namely, all of them (seemingly by sheer coincidence) "throttled" my data transfers. I was able to confirm this by examining the network throughput carefully. Since my private cloud environment never provisioned enough bandwidth to hit the 50-60 Mbps ceiling that seems to be fairly common among certain hosts, it didn't occur to me that I could eventually be throttled at a higher throughput rate. Real bummer. What this experiment did teach me, is that you should NOT assume more bandwidth will solve all your throughput problems. Throttling appears to be exceptionally common, and I would suggest planning for what to do when you encounter it.

Related

What does the Azure Web Apps architecture look like?

I've had a few outages of 10 to 15 minutes, because apparently Microsoft had a 'blip' on their storages. They told me that it is because of a shared file system between the instances (making it a single point of failure?)
I didn't understand it and asked how file share is involved, because I would assume a really dumb stateless IIS app that communicates with SQL Azure for its data.
I would assume the situation below:
This is their reply to my question (I didn't include the drawing)
The file shares are not necessarily for your web app to communicate to
another resources but they are on our end where the app content
resides on. That is what we meant when we suggested that about storage
being unavailable on our file servers. The reason the restarts would
be triggered for your app that is on both the instances is because the
resources are shared, the underlying storage would be the same for
both the instances. That’s the reason if it goes down on one, the
other would also follow eventually. If you really want the
availability of the app to be improved, you can always use a traffic
manager. However, there is no guarantee that even with traffic manager
in place, the app doesn’t go down but it improves overall availability
of your app. Also we have recently rolled out an update to production
that should take care of restarts caused by storage blips ideally, but
for this feature to be kicked it you need to make sure that there is
ample amount of memory needs to be available in the cases where this
feature needs to kick in. We have couple of options that you can have
set up in order to avoid any unexpected restarts of the app because of
a storage blip on our end:
You can evaluate if you want to move to a bigger instance so that
we might have enough memory for the overlap recycling feature to be
kicked in.
If you don’t want to move to a bigger instance, you can always use
local cache feature as outlined by us in our earlier email.
Because of the time differences the communication takes ages. Can anyone tell me what is wrong in my thinking?
The only thing that I think of is that when you've enabled two instances, they run on the same physical server. But that makes really little sense to me.
I have two instances one core, 1.75 GB memory.
My presumption for App Service Plans was that they were automatically split into availability sets (see below for a brief description) Largely based on Web Apps sales spiel which states
App Service provides availability and automatic scale on a global data centre infrastructure. Easily scale applications up or down on demand, and get high availability within and across different geographical regions.
Following on from David Ebbo's answer and comments, the underlying architecture of Web apps appears to be that the VM's themselves are separated into availability sets. However all of the instances use the same fileserver to share the underlying disk space. This file server being a significant single point of failure.
To mitigate this Azure have created the WEBSITE_LOCAL_CACHE_OPTION which will cache the contents of the file server onto the individual Web App instances. Using caching in lieu of solid, high availability engineering principles.
The problem here is that as a customer we have no visibility into this issue, we've no idea if there is a plan to fix it, or if or when it will ever be fixed since it seems unlikely that Azure is going to issue a document that admits to how badly this has been engineered, even if it is to say that it is fixed.
I also can't imagine that this issue would be any different between ASM and ARM. It seems exceptionally unlikely that there was originally a high availability solution at the backend that they scrapped when ARM came along. So it is very likely that cloud services would suffer the exact same issue.
The small upside is that now that we know this is an issue, one possible solution would be to deploy multiple web apps and have a traffic manager between them. Even if they are in the same region, different apps should have different backend file servers.
My first action would be to reply to that email, with a link to the Web Apps page, (and this question) with a copy of the quote and ask how to enable high availability within a geographic region.
After that you'll likely need to rearchitect your solution!
Availability sets
For virtual machines Azure will let you specify an availability set. An availability set will automatically split VMs into separate update and fault domains. Meaning that servers will end up in different server racks, and those server racks won't get updates at the same time. (it is a little more complex than that, but that's the basics!)
Azure Web Apps do used a shared file storage. The best way to think about it is that all the instances of your app map to the same network share that have your files. So if you modify the files by any mean (e.g. FTP, msdeploy, git, ...), all the instances instantly get the new files (since there is only one set of files).
And to answer your final question, each instance does run on a separate VM.

Best way to manage big files downloads

I'm looking for the best way to manage the download of my products on the web.
Each of them weighs between 2 and 20 Go.
Each of them is approximatively downloaded between 1 and 1000 times a day by our customers.
I've tried to use Amazon S3 but the download speed isn't good and it quickly becomes expensive.
I've tried to use Amazon S3 + CloudFront but files are too large and the downloads too rare: the files didn't stay in the cache.
Also, I can't create torrent files in S3 because the files size is too large.
I guess the cloud solutions (such as S3, Azure, Google Drive...) work well only for small files, such as image / css / etc.
Now, I'm using my own servers. It works quite well but it is really more complex to manage...
Is there a better way, a perfect way to manage this sort of downloads?
This is a huge problem and we see it when dealing with folks in the movie or media business: they generate HUGE video files that need to be shared on a tight schedule. Some of them resort to physically shipping hard drives.
When "ordered and guaranteed data delivery" is required (e.g. HTTP, FTP, rsync, nfs, etc.) the network transport is usually performed with TCP. But TCP implementations are quite sensitive to packet loss, round-trip time (RTT), and the size of the pipe, between the sender and receiver. Some TCP implementations also have a hard time filling big pipes (limits on the max bandwidth-delay product; BDP = bit rate * propagation delay).
The ideal solution would need to address all of these concerns.
Reducing the RTT usually means reducing the distance between sender and receiver. Rule of thumb, reducing your RTT by half can double your max throughput (or cut your turnaround time in half). Just for context, I'm seeing an RTT from US East Coast to US West Coast as ~80-85ms.
Big deployments typically use a content delivery network (CDN) like Akamai or AWS CloudFront, to reduce the RTT (e.g. ~5-15ms). Simply stated, the CDN service provider makes arrangements with local/regional telcos to deploy content-caching servers on-premise in many cities, and sells you the right to use them.
But control over a cached resource's time-to-live (TTL) can depend on your service level agreement ($). And cache memory is not infinite, so idle resources might be purged to make room for newly requested data, especially if the cache is shared with others.
In your case, it sounds to me like you want to meaningfully reduce the RTT while retaining full control of the cache behaviour, so you can set a really long cache TTL. The best price/performance solution IMO is to deploy your own cache servers running CentOS 7 + NGINX with proxy_cache turned on and enough disk space, and deploy a cache server for each major region (e.g. west coast and east coast). Your end users could select the region closest to them, or your could add some code to automatically detect the closest regional cache server.
Deploying these cache servers on AWS EC2 is definitely an option. Your end users will probably see much better performance than by connecting to AWS S3 directly, and there are no BW caps.
The current AWS pricing for your volume is about $0.09/GB for BW out to the internet. Assuming your ~50 files at an average of 10GB, that's about $50/month for BW from cache servers to your end users - not bad? You could start with c4.large for low/average usage regions ($79/month). Higher usage regions might cost you about ~$150/month (c4.xl), ~$300/month (c4.2xl), etc. You can get better pricing with spot instances and you can tune performance based on your business model (e.g. VIP vs Best-Effort).
In terms of being able to "fill the pipe" and sensitivity to network loss (e.g. congestion control, congestion avoidance), you may want to consider an optimized TCP stack like SuperTCP (full disclaimer, I'm the director of development). The idea here is to have a per-connection auto-tuning TCP stack with a lot of engineering behind it, so it can fill huge pipes like the ones between AWS regions, and not overreact to network loss as regular TCP often does, especially when sending to Wi-Fi endpoints.
Unlike UDP solutions, it's a single-sided install (<5 min), you don't get charged for hardware or storage, you don't need to worry about firewalls, and it won't flood/kill your own network. You'd want to install it on your sending devices: the regional cache servers and the origin server(s) that push new requests to the cache servers.
An optimized TCP stack can increase your throughput by 25%-85% over healthy networks, and I've seen anywhere from 2X to 10X throughput on lousy networks.
Unfortunately I don't think AWS is going to have a solution for you. At this point I would recommend looking into some other CDN providers like Akamai https://www.akamai.com/us/en/solutions/products/media-delivery/download-delivery.jsp that provide services specifically geared toward large file downloads. I don't think any of those services are going to be cheap though.
You may also want to look into file acceleration software, like Signiant Flight or Aspera (disclosure: I'm a product manager for Flight). Large files (multiple GB in size) can be a problem for traditional HTTP transfers, especially over large latencies. File acceleration software goes over UDP instead of TCP, essentially masking the latency and increasing the speed of the file transfer.
One negative to using this approach is that your clients will need to download special software to download their files (since UDP is not supported natively in the browser), but you mentioned they use a download manager already, so that may not be an issue.
Signiant Flight is sold as a service, meaning Signiant will run the required servers in the cloud for you.
With file acceleration solutions you'll generally see network utilization of about 80 - 90%, meaning 80 - 90 Mbps on a 100 Mbps connection, or 800 Mbps on a 1 Gbps network connection.

Deleting items from Azure queue painfully slow

My application relies heavily on a queue in in Windows Azure Storage (not Service Bus). Until two days ago, it worked like a charm, but all of a sudden my worker role is no longer able to process all the items in the queue. I've added several counters and from that data deduced that deleting items from the queue is the bottleneck. For example, deleting a single item from the queue can take up to 1 second!
On a SO post How to achive more 10 inserts per second with azure storage tables and on the MSDN blog
http://blogs.msdn.com/b/jnak/archive/2010/01/22/windows-azure-instances-storage-limits.aspx I found some info on how to speed up the communication with the queue, but those posts only look at insertion of new items. So far, I haven't been able to find anything on why deletion of queue items should be slow. So the questions are:
(1) Does anyone have a general idea why deletion suddenly may be slow?
(2) On Azure's status pages (https://azure.microsoft.com/en-us/status/#history) there is no mentioning of any service disruption in West Europe (which is where my stuff is situated); can I rely on the service pages?
(3) In the same storage, I have a lot of data in blobs and tables. Could that amount of data interfere with the ability to delete items from the queue? Also, does anyone know what happens if you're pushing the data limit of 2TB?
1) Sorry, no. Not a general one.
2) Can you rely on the service pages? They certainly will give you information, but there is always a lag from the time an issue occurs and when it shows up on the status board. They are getting better at automating the updates and in the management portal you are starting to see where they will notify you if your particular deployments might be affected. With that said, it is not unheard of that small issues crop up from time to time that may never be shown on the board as they don't break SLA or are resolved extremely quickly. It's good you checked this though, it's usually a good first step.
3) In general, no the amount of data you have within a storage account should NOT affect your throughput; however, there is a limit to the amount of throughput you'll get on a storage account (regardless of the data amount stored). You can read about the Storage Scalability and Performance targets, but the throughput target is up to 20,000 entities or messages a second for all access of a storage account. If you have a LOT of applications or systems attempting to access data out of this same storage account you might see some throttling or failures if you are approaching that limit. Note that as you saw with the posts on improving throughput for inserts these are the performance targets and how your code is written and configurations you use have a drastic affect on this. The data limit for a storage account (everything in it) is 500 TB, not 2TB. I believe once you hit the actual storage limit all writes will simply fail until more space is available (I've never even got close to it, so I'm not 100% sure on that).
Throughput is also limited at the partition level, and for a queue that's a target of Up to 2000 messages per second, which you clearly aren't getting at all. Since you have only a single worker role I'll take a guess that you don't have that many producers of the messages either, at least not enough to get near the 2,000 msgs per second.
I'd turn on storage analytics to see if you are getting throttled as well as check out the AverageE2ELatency and AverageServerLatency values (as Thomas also suggested in his answer) being recorded in the $MetricsMinutePrimaryTransactionQueue table that the analytics turns on. This will help give you an idea of trends over time as well as possibly help determine if it is a latency issue between the worker roles and the storage system.
The reason I asked about the size of the VM for the worker role is that there is a (unpublished) amount of throughput per VM based on it's size. An XS VM gets much less of the total throughput on the NIC than larger sizes. You can sometimes get more than you expect across the NIC, but only if the other deployments on the physical machine aren't using their portion of that bandwidth at the time. This can often lead to varying performance issues for network bound work when testing. I'd still expect much better throughput than what you are seeing though.
There is a network in between you and the Azure storage, which might degrade the latency.
Sudden peaks (e.g. from 20ms to 2s) can happen often, so you need to deal with this in your code.
To pinpoint this problem further down the road (e.g. client issues, network errors etc.) You can turn on storage analytics to see where the problem exists. There you can also see if the end2end latency is too big or just the server latency is the limiting factor. The former usually tells about network issues, the latter about something beeing wrong on the Queue itself.
Usually those latency issues a transient (just temporary) and there is no need to announce that as a service disruption, because it isn't one. If it has constantly bad performance, you should open a support ticket.

Azure changing hardware

I have a product which uses CPU ID, network MAC, and disk volume serial numbers for validation. Basically when my product is first installed these values are recorded and then when the app is loaded up, these current values are compared against the old ones.
Something very mysterious happened recently. Inside of an Azure VM that had not been restarted in weeks, my app failed to load because some of these values were different. Unfortunately the person who caught the error deleted the VM before it was brought to my attention.
My question is, when an Azure VM is running, what hardware resources may change? Is that even possible?
Thanks!
Answering this requires a short rundown of how Azure works.
In each data centres there are thousands of individual machines. Each machine runs a hypervisor which allows a number of operating systems to share the same underlying hardware.
When you start a role, Azure looks for available resources - disk space CPU RAM etc and boots up a copy of the appropriate OS VM in thoe avaliable resources. I understand from your question that this is a VM role - so this VM is the one you uploaded or created.
As long as your VM is running, the underlying virtual resources provided by the hypervisor are not likely to change. (the caveat to this is that windows server 2012's hyper visor can move virtual machines around over the network even while they are running. Whether azure takes advantage of this, I don't know)
Now, Azure keeps charging you for even when your role has stopped because it considers your role "deployed". So in theory, those underlying resources still "belong" to your role.
This is not guaranteed. Azure could decided to boot up your VM on a different set of virtualized hardware for any number of reasons - hardware failure being at the top of the list, with insufficient capacity being second.
It is even possible (tho unlikely) for your resources to be provided by different hardware nodes.
An additional point of consideration is that it is Azure policy that disaster recovery (or other major event) may include transferring your roles to run in a separate data centre entirely.
My point is that the underlying hardware is virtual and treating it otherwise is most unwise. Roles are at the mercy of the Azure Management Routines, and we can't predict in advance what decisions they may make.
So the answer to your question is that ALL of the underlying resources may change. And it is very, very possible.

Collecting high volume DNS stats with libpcap

I am considering writing an application to monitor DNS requests for approximately 200,000 developer and test machines. Libpcap sounds like a good starting point, but before I dive in I was hoping for feedback.
This is what the application needs to do:
Inspect all DNS packets.
Keep aggregate statistics. Include:
DNS name.
DNS record type.
Associated IP(s).
Requesting IP.
Count.
If the number of requesting IPs for one DNS name is > 10, then stop keeping the client ip.
The stats would hopefully be kept in memory, and disk writes would only occur when a new, "suspicious" DNS exchange occurs, or every few hours to write the new stats to disk for consumption by other processes.
My question are:
1. Do any applications exist that can do this? The link will be either 100 MB or 1 GB.
2. Performance is the #1 consideration by a large margin. I have experience writing c for other one-off security applications, but I am not an expert. Do any tips come to mind?
3. How much of an effort would this be for a good c developer in man-hours?
Thanks!
Jason
I suggest you to try something like DNSCAP or even Snort for capturing DNS traffic.
BTW I think this is rather a superuser.com question than a StackOverflow one.

Resources