Best way to manage big files downloads - azure

I'm looking for the best way to manage the download of my products on the web.
Each of them weighs between 2 and 20 Go.
Each of them is approximatively downloaded between 1 and 1000 times a day by our customers.
I've tried to use Amazon S3 but the download speed isn't good and it quickly becomes expensive.
I've tried to use Amazon S3 + CloudFront but files are too large and the downloads too rare: the files didn't stay in the cache.
Also, I can't create torrent files in S3 because the files size is too large.
I guess the cloud solutions (such as S3, Azure, Google Drive...) work well only for small files, such as image / css / etc.
Now, I'm using my own servers. It works quite well but it is really more complex to manage...
Is there a better way, a perfect way to manage this sort of downloads?

This is a huge problem and we see it when dealing with folks in the movie or media business: they generate HUGE video files that need to be shared on a tight schedule. Some of them resort to physically shipping hard drives.
When "ordered and guaranteed data delivery" is required (e.g. HTTP, FTP, rsync, nfs, etc.) the network transport is usually performed with TCP. But TCP implementations are quite sensitive to packet loss, round-trip time (RTT), and the size of the pipe, between the sender and receiver. Some TCP implementations also have a hard time filling big pipes (limits on the max bandwidth-delay product; BDP = bit rate * propagation delay).
The ideal solution would need to address all of these concerns.
Reducing the RTT usually means reducing the distance between sender and receiver. Rule of thumb, reducing your RTT by half can double your max throughput (or cut your turnaround time in half). Just for context, I'm seeing an RTT from US East Coast to US West Coast as ~80-85ms.
Big deployments typically use a content delivery network (CDN) like Akamai or AWS CloudFront, to reduce the RTT (e.g. ~5-15ms). Simply stated, the CDN service provider makes arrangements with local/regional telcos to deploy content-caching servers on-premise in many cities, and sells you the right to use them.
But control over a cached resource's time-to-live (TTL) can depend on your service level agreement ($). And cache memory is not infinite, so idle resources might be purged to make room for newly requested data, especially if the cache is shared with others.
In your case, it sounds to me like you want to meaningfully reduce the RTT while retaining full control of the cache behaviour, so you can set a really long cache TTL. The best price/performance solution IMO is to deploy your own cache servers running CentOS 7 + NGINX with proxy_cache turned on and enough disk space, and deploy a cache server for each major region (e.g. west coast and east coast). Your end users could select the region closest to them, or your could add some code to automatically detect the closest regional cache server.
Deploying these cache servers on AWS EC2 is definitely an option. Your end users will probably see much better performance than by connecting to AWS S3 directly, and there are no BW caps.
The current AWS pricing for your volume is about $0.09/GB for BW out to the internet. Assuming your ~50 files at an average of 10GB, that's about $50/month for BW from cache servers to your end users - not bad? You could start with c4.large for low/average usage regions ($79/month). Higher usage regions might cost you about ~$150/month (c4.xl), ~$300/month (c4.2xl), etc. You can get better pricing with spot instances and you can tune performance based on your business model (e.g. VIP vs Best-Effort).
In terms of being able to "fill the pipe" and sensitivity to network loss (e.g. congestion control, congestion avoidance), you may want to consider an optimized TCP stack like SuperTCP (full disclaimer, I'm the director of development). The idea here is to have a per-connection auto-tuning TCP stack with a lot of engineering behind it, so it can fill huge pipes like the ones between AWS regions, and not overreact to network loss as regular TCP often does, especially when sending to Wi-Fi endpoints.
Unlike UDP solutions, it's a single-sided install (<5 min), you don't get charged for hardware or storage, you don't need to worry about firewalls, and it won't flood/kill your own network. You'd want to install it on your sending devices: the regional cache servers and the origin server(s) that push new requests to the cache servers.
An optimized TCP stack can increase your throughput by 25%-85% over healthy networks, and I've seen anywhere from 2X to 10X throughput on lousy networks.

Unfortunately I don't think AWS is going to have a solution for you. At this point I would recommend looking into some other CDN providers like Akamai https://www.akamai.com/us/en/solutions/products/media-delivery/download-delivery.jsp that provide services specifically geared toward large file downloads. I don't think any of those services are going to be cheap though.

You may also want to look into file acceleration software, like Signiant Flight or Aspera (disclosure: I'm a product manager for Flight). Large files (multiple GB in size) can be a problem for traditional HTTP transfers, especially over large latencies. File acceleration software goes over UDP instead of TCP, essentially masking the latency and increasing the speed of the file transfer.
One negative to using this approach is that your clients will need to download special software to download their files (since UDP is not supported natively in the browser), but you mentioned they use a download manager already, so that may not be an issue.
Signiant Flight is sold as a service, meaning Signiant will run the required servers in the cloud for you.
With file acceleration solutions you'll generally see network utilization of about 80 - 90%, meaning 80 - 90 Mbps on a 100 Mbps connection, or 800 Mbps on a 1 Gbps network connection.

Related

How to improve download speed on Azure VMs?

My organization is spinning down its private cloud and preparing to deploy a complete analytics and data warehousing solution on Azure VMs. Naturally, we are doing performance testing so we can identify and address any unforeseen issues before fully decommissioning our datacenter servers.
So far, the only significant issue I've noticed in Azure VMs is that my download speeds don't seem to change at all no matter which VM size I test. The speed is always right around 60 Mbps downstream. This is despite guidance such as this which indicates I should see improvements in ingress based on VM size.
I have done significant and painstaking research into the issue, but everything I've read so far only really addresses "intra-cloud" optimizations (e.g., ExpressRoute, Accelerated Networking, VNet peering) or communications to specific resources. My requirement is for fast download speeds on the open internet (secure and trusted sites only). To preempt any attempts to question this requirement, I will not specifically address the reasons why I can't consider alternatives such as establishing private/dedicated connections, but suffice it to say, I have no choice but to rule out those options.
That said, I'm open to any other advice! What, if anything, can I do to improve download speeds to Azure VMs when data needs to be transferred to the VM from the open web?
Edit: Corrected comment about Azure guidance.
I finally figured out what was going on. #evilSnobu was onto something -- there was something flawed about my tests. Namely, all of them (seemingly by sheer coincidence) "throttled" my data transfers. I was able to confirm this by examining the network throughput carefully. Since my private cloud environment never provisioned enough bandwidth to hit the 50-60 Mbps ceiling that seems to be fairly common among certain hosts, it didn't occur to me that I could eventually be throttled at a higher throughput rate. Real bummer. What this experiment did teach me, is that you should NOT assume more bandwidth will solve all your throughput problems. Throttling appears to be exceptionally common, and I would suggest planning for what to do when you encounter it.

How does a service like Put.io work?

Just got invited to put.io ... it's a service that takes a torrent file (or a magnet link) as input and gives a static file available for download from it's own server. I've been trying to understand how a service like this works?
It can't be as simple as simply torrenting the site and serving it via a CDN... can it? Because the speeds it offers seems insanely fast to me
Any idea about the bandwidth implications (or the amount used) by the service?
I believe services like this typically just are running one or more bittorrent clients on beefy machines with a fast link. You only have to download the torrent the first time someone asks for it, then you can cache it for the next person to request it.
The bandwidth usage is not unreasonable, since you're caching the files, you actually end up using less bandwidth than if you would, say, simply proxy downloads for people.
I would imagine that using a CDN would not be very common. There's a certain overhead involved in that. You could possibly promote files out of your cache to a CDN once you're certain that they are and will stay popular.
The service I was involved with simply ran 14 instances if libtorrent, each on a separate drive, serving completed files straight off of those drives with nginx. Torrents were requested from the web front end and prioritized before handed over to the downloader. Each instance would download around 70 or so torrents in parallel.

Collecting high volume DNS stats with libpcap

I am considering writing an application to monitor DNS requests for approximately 200,000 developer and test machines. Libpcap sounds like a good starting point, but before I dive in I was hoping for feedback.
This is what the application needs to do:
Inspect all DNS packets.
Keep aggregate statistics. Include:
DNS name.
DNS record type.
Associated IP(s).
Requesting IP.
Count.
If the number of requesting IPs for one DNS name is > 10, then stop keeping the client ip.
The stats would hopefully be kept in memory, and disk writes would only occur when a new, "suspicious" DNS exchange occurs, or every few hours to write the new stats to disk for consumption by other processes.
My question are:
1. Do any applications exist that can do this? The link will be either 100 MB or 1 GB.
2. Performance is the #1 consideration by a large margin. I have experience writing c for other one-off security applications, but I am not an expert. Do any tips come to mind?
3. How much of an effort would this be for a good c developer in man-hours?
Thanks!
Jason
I suggest you to try something like DNSCAP or even Snort for capturing DNS traffic.
BTW I think this is rather a superuser.com question than a StackOverflow one.

Number of instances needed for windows azure application

I'm fairly new to Windows Azure and want to host a survey application that will be filled out by appr. 30.000 users simultaniously.
The application consists of 1 .aspx page that will be sent to the client once, asks 25 questions and will give a wrap-up of the given answers at the end. When the user has given the answer and hits the 'next question' buttons the given answer will be send via an .ashx handler to the server. The response is the next question and answers. The wrap-up is sent to the client after a full postback.
The answer is saved in an Azure Table that is partitioned so that each partition can hold a max of 450 users.
I would like to ask if someone can give an estimated guess about how many web-role instances we need to start in order to have this application keep running. (If that is too hard to say, is it more likely to start 5, 50 or 500 instances?)
What is a better way to go: 20 small instances or 5 large instances?
Thanks for your help!
The most obvious answer: you would be best served by testing this yourself and see how your application holds up. You can easily get performance counters and other diagnostics out of Windows Azure; for instance, you can connect Microsoft SCOM (System Center Operations Manager) to monitor your environment during test. Site Hammer is a simple load testing tool for Windows Azure (on MSDN code gallery).
Apart from this very obvious answer, I will share some guesstimates: given the type of load, you are probably better of with more small instances as opposed to a lower number of large ones, especially since you already have your storage partitioned. If you are really going to have 30K visitors simultaneously and give them a ~15 second interval between reading the questions & posting their answers you are looking at 2,000 requests per second. 10 nodes should be more than enough to handle that load. Remember that this is just a simple estimate, lacking any form of insight in your architecture, etc. For these types of loads, caching is a very good idea; it will dramatically increase the load each node can handle.
However, the best advice I can give you is to make sure that you are actively monitoring. It takes less than 30 minutes to spin up additional instances, so if you monitor your environment and/or make sure that you are notified whenever it starts to choke, you can easily upgrade your setup. Keep in mind that you do need to contact customer support to be able to go over 20 instances (this is a default limit, in place to protect you from over-spending).
Aside from the sage advice tijmenvdk gave you, let me add my opinion on instance size. In general, go with the smallest size that will support your app, and then scale out to handle increased traffic. This way, when you scale back down, your minimum compute cost is kept low. If you ran, say, a pair of extra-large instances as your baseline (since you always want minimum two instances to get the uptime SLA), your cost footprint starts at 0.12 x 8 x 2 = $1.92 per hour, even during low-traffic times. If you go with small instances, you'd be at 0.12 x 1 x 2 = $0.24 per hour.
Each VM size as associated CPU, memory, and local 9non-durable) disk storage, so pick the smallest size unit that your app works efficiently in.
For load/performance-testing, you might also want to consider a hosted solution such as Loadstorm.
How simultaneous are the requests in reality?
Will they all type the address in at exactly the same time?
That said, profile your app locally, this will enable you to estimate CPU, Network and Memory usage on Azure. Then, rather than looking at how many instances you need, look at how you can reduce the requirement! Apply these tips, and profile locally again.
Most performance tips have a tradeoff between cpu, memory or bandwith usage, the idea is to ensure that they scale equally. If you're application runs out of memory, but you have loads of CPU and network, dont
For a single page survey, ensure your html, css & js is minified, ensure its cacheable.
Combine them if possible, and to get really scaleable, push static files (css,js & images) to a CDN. This all reduces the number of requests the webserver has to deal with, and therefore reduces the number of webroles you will need = less network.
How does the ashx return the response? i.e. is it sending html, xml or json?
personally, I'd get it to return JSON, as this will require less network bandwidth, and most likely less server side processing = less mem and network.
Use Asyncronous API's to access azure storage (this uses IO completion ports to free up the iis thread to handle more requests until azure storage comes back = enabling cpu to scale)
tijmenvdk has already mentioned using queues to write. Do the list of questions change? if not, cache them, so that the app only has to read from table storage once on start-up and once for each client for the final wrap-up = saves network and cpu at the expense of memory.
All of these tips are equally applicable to a normal web application, on a single server or web-farm environment.
The point I'm trying to make is that what you can't measure, you cant improve, and measurement, improvement and cost all go hand in hand. Dynamic scaling will reduce costs, but fundamentally if your application hasn't been measured and resource usage optimised, asking how many instances you need is pointless.

Which resources should one monitor on a Linux server running a web-server or database

When running any kind of server under load there are several resources that one would like to monitor to make sure that the server is healthy. This is specifically true when testing the system under load.
Some examples for this would be CPU utilization, memory usage, and perhaps disk space.
What other resource should I be monitoring, and what tools are available to do so?
As many as you can afford to, and can then graph/understand/look at the results. Monitoring resources is useful for not only capacity planning, but anomaly detection, and anomaly detection significantly helps your ability to detect security events.
You have a decent start with your basic graphs. I'd want to also monitor the number of threads, number of connections, network I/O, disk I/O, page faults (arguably this is related to memory usage), context switches.
I really like munin for graphing things related to hosts.
I use Zabbix extensively in production, which comes with a stack of useful defaults. Some examples of the sorts of things we've configured it to monitor:
Network usage
CPU usage (% user,system,nice times)
Load averages (1m, 5m, 15m)
RAM usage (real, swap, shm)
Disc throughput
Active connections (by port number)
Number of processes (by process type)
Ping time from remote location
Time to SSL certificate expiry
MySQL internals (query cache usage, num temporary tables in RAM and on disc, etc)
Anything you can monitor with Zabbix, you can also attach triggers to - so it can restart failed services; or page you to alert about problems.
Collect the data now, before performance becomes an issue. When it does, you'll be glad of the historical baselines, and the fact you'll be able to show what date and time problems started happening for when you need to hunt down and punish exactly which developer made bad changes :)
I ended up using dstat which is vmstat's nicer looking cousin.
This will show most everything you need to know about a machine's health,
including:
CPU
Disk
Memory
Network
Swap
"df -h" to make sure that no partition runs full which can lead to all kinds of funky problems, watching the syslog is of course also useful, for that I recommend installing "logwatch" (Logwatch Website) on your server which sends you an email if weird things start showing up in your syslog.
Cacti is a good web-based monitoring/graphing solution. Very complete, very easy to use, with a large userbase including many large Enterprise-level installations.
If you want more 'alerting' and less 'graphing', check out nagios.
As for 'what to monitor', you want to monitor systems at both the system and application level, so yes: network/memory/disk i/o, interrupts and such over the system level. The application level gets more specific, so a webserver might measure hits/second, errors/second (non-200 responses), etc and a database might measure queries/second, average query fulfillment time, etc.
Beware the afore-mentioned slowquerylog in mysql. It should only be used when trying to figure out why some queries are slow. It has the side-effect of making ALL your queries slow while it's enabled. :P It's intended for debugging, not logging.
Think 'passive monitoring' whenever possible. For instance, sniff the network traffic rather than monitor it from your server -- have another machine watch the packets fly back and forth and record statistics about them.
(By the way, that's one of my favorites -- if you watch connections being established and note when they end, you can find a lot of data about slow queries or slow anything else, without putting any load on the server you care about.)
In addition to top and auth.log, I often look at mtop, and enable mysql's slowquerylog and watch mysqldumpslow.
I also use Nagios to monitor CPU, Memory, and logged in users (on a VPS or dedicated server). That last lets me know when someone other than me has logged in.
network of course :) Use MRTG to get some nice bandwidth graphs, they're just pretty most of the time.. until a spammer finds a hole in your security and it suddenly increases.
Nagios is good for alerting as mentioned, and is easy to get setup. You can then use the mrtg plugin to get alerts for your network traffic too.
I also recommend ntop as it shows where your network traffic is going.
A good link to get you going with Munin and Monit: link text
I typically watch top and tail -f /var/log/auth.log.

Resources