Azure CDN - truncating "large files" somehow?

Azure CDN - truncating "large files" somehow? - azure

Edit: This is all mysteriously working now, although I wasn't able to figure out what actually caused the issue. Might not have been the CDN at all? Leaving this here for posterity and will update if I ever see this kind of thing happen again...
I've been experimenting with using Azure CDN (Microsoft hosted, not Akamai or Verizon) to handle file downloads for a couple of Azure Web Apps, and it's been working fine until today, when it began returning truncated versions of a "large file", resulting in a PDF file that couldn't be opened (by "large file" I'm specifically referring to Azure CDN's Large File Optimisation feature).
The file works fine from the origin URL and is 8.59mb, but the same file retrieved from the CDN endpoint is exactly 8mb. Which, by a suspicious coincidence, happens to be the same as the chunk size used by the Large File Optimisation feature mentioned above. Relevant part of the documentation:
Azure CDN Standard from Microsoft uses a technique called object chunking. When a large file is requested, the CDN retrieves smaller pieces of the file from the origin. After the CDN POP server receives a full or byte-range file request, the CDN edge server requests the file from the origin in chunks of 8 MB.
... This optimization relies on the ability of the origin server to support byte-range requests
File URLs in question:
Origin
CDN
I've also uploaded the same file directly into the website's filesystem to rule out the CMS (Umbraco) and its blob-storage-filesystem stuff interfering, but it's the exact same result anyway. Here's the links for reference.
Origin
CDN
In both cases the two files are binary identical except that the file from the CDN abruptly stops at 8mb, even though the origin supports byte-range requests (verified with Postman) and the Azure CDN documentation linked above claims that
If not all the chunks are cached on the CDN, prefetch is used to request chunks from the origin
And that
There are no limits on maximum file size.
The same issue has occurred with other files over 8mb too, although these had previously worked as of last week. No changes to CDN configuration had been made since then.
What I'm thinking is happening is something like:
Client requests file download from CDN
CDN figures out that it's a "large file" and requests the first 8mb chunk from the origin
Origin replies with an 8mb chunk as requested
CDN begins returning 8mb chunk to client
CDN either doesn't request the next chunk, or origin doesn't provide it
Client only receives the first 8mb of the file
Or perhaps I'm barking up the wrong tree. Already tried turning off compression, not really sure where else to go from here. This is probably my fault, so have I misconfigured the CDN or something? I've considered purging the CDN's cache but I don't really see that as a solution and would rather avoid manual workarounds...

This optimization relies on the ability of the origin server to support byte-range requests; if the origin server doesn't support byte-range requests, requests to download data greater than 8mb size will fail.
https://learn.microsoft.com/en-us/azure/cdn/cdn-large-file-optimization

Related

Can I serve a already compressed file over Azure CDN Verizon Premium tier?

I have a JS file which has a uncompressed size of 5 MB. Compressed, it is 1-2 MB. According to the Azure docs, Verizon Premium does not support compressing files larger than 1 MB.
Can I send it compressed from the origin server, with the respective headers to be passed on? Or does the file need to be uncompressed from the origin? I think it's the former, but just want to confirm.

If the origin has resource compressed then it will also be compressed on the CDN:
enable compression on your origin server. In this case, Azure CDN passes along the compressed files and delivers them to clients that request them.
https://learn.microsoft.com/en-us/azure/cdn/cdn-improve-performance
provided it wasn't previously cached (assuming not purged or TTL not expire)

CloudFront report showing 404 URLs

There are URLs in my cloudfront that are returning 404. Once I invalidate them, all is well. I assume that at some point in time the origin server returned 404, which was cached by cloudfront.
Is there a way to generate a CloudFront report showing the URLs that are marked as missing? (404s)
Is there a way to create an alert for new ones?

on top of #alexjs answer above, one could also set the 404 cache period to 0. In my case, that was the desired behavior as 404s should never happen are usually suggest server issue.

As you've observed, CloudFront by default caches errors.
Are the 404s valid URLs? A common issue is something as part of an application which requests a URL (e.g. a file) to determine its presence before uploading / publishing. This obviously backfires a bit if you have a caching layer.
Is there a way to generate a CloudFront report showing the URLs that are marked as missing? (404s)
These will be written to CloudFront logs.
Is there a way to create an alert for new ones?
You could use CloudWatch Alarms to alert based on 4xx/5xx error rate, but this wouldn't give you granularity into the URIs themselves.
You could use a Lambda function which is invoked upon log delivery to your given S3 bucket, configured to parse them for 4xx, and then sends a notification to you (SNS/SES/etc)

Amazon CloudFront - Inavalidating files by regex, e.g. *.png

Is there a way to have Amazon CloudFront invalidation (via the management console), invalidate all files that match a pattern? e.g. images/*.png
context -
I had set cache control for images on my site, but by mistake left out the png extension in the cache directive on Apache. So .gif/.jpg files were cached on users computer but .png files WERE not.
So I fixed the apache directive and now my apache server serves png files with appropriate cache control directives. I tested this.
But the cloudfront had in past fetched those png files, SO hitting those png files via cloudfront still brings those png files with NO cache control. End Result - still no user caching for those png files
I tried to set the invalidation in Amazon CloudFront console as images/*.png. The console said completed, but I still do not get cache control directive in png files. --> Makes me believe that the invalidation did not happen.
I can set the invalidation for the complete image directory; but then I have too many image files --> I would get charged > $100 for this. So trying to avoid this.
Changing image versions so that cloudfront fetches new versions is a painful exercise in my code; doing it for say 500 png files would be a pain. --> Trying to avoid it.
Listing individual png files is also a pain --> trying to avoid it as well.
Thanks,
-Amit

If your CloudFront distribution is configured in front of an S3 bucket, you can list all of the objects in the S3 bucket, filter them with a regex pattern (e.g., /*.png/i), then use that list to construct your invalidation request.
That's what I do anyway. I hope this helps! :)

What is HTTP cache best practices for high-traffic static site?

We have a fairly high-traffic static site (i.e. no server code), with lots of images, scripts, css, hosted by IIS 7.0
We'd like to turn on some caching to reduce server load, and are considered setting the expiry of web content to be some time in the future. In IIS, we can do this on a global level via "Expire web content" section of the common http headers in the IIS response header module. Perhaps setting content to expire 7 days after serving.
All this actually does is sets the max-age HTTP response header, so far as I can tell, which makes sense, I guess.
Now, the confusion:
Firstly, all browsers I've checked (IE9, Chrome, FF4) seem to ignore this and still make conditional requests to the server to see if content has changed. So, I'm not entirely sure what the max-age response header will actually effect?! Could it be older browsers? Or web-caches?
It is possible that we may want to change an image in the site at short notice... I'm guessing that if the max-age is actually used by something that, by its very nature, it won't then check if this image has changed for 7 days... so that's not what we want either
I wonder if a best practice is to partition one's site into folders of content really won't change often and only turn on some long-term expiry for these folders? Perhaps to vary the querystring to force a refresh of content in these folders if needed (e.g. /assets/images/background.png?version=2) ?
Anyway, having looked through the (rather dry!) HTTP specification, and some of the tutorials, I still don't really have a feel for what's right in our situation.
Any real-world experience of a situation similar to ours would be most appreciated!

Browsers fetch the HTML first, then all the resources inside (css, javascript, images, etc).
If you make the HTML expire soon (e.g. 1 hour or 1 day) and then make the other resources expire after 1 year, you can have the best of both worlds.
When you need to update an image, or other resource, you just change the name of that file, and update the HTML to match.
The next time the user gets fresh HTML, the browser will see a new URL for that image, and get it fresh, while grabbing all the other resources from a cache.
Also, at the time of this writing (December 2015), Firefox limits the maximum number of concurrent connections to a server to six (6). This means if you have 30 or more resources that are all hosted on the same website, only 6 are being downloaded at any time until the page is loaded. You can speed this up a bit by using a content delivery network (CDN) so that everything downloads at once.

How can I prevent Amazon Cloudfront from hotlinking?

I use Amazon Cloudfront to host all my site's images and videos, to serve them faster to my users which are pretty scattered across the globe. I also apply pretty aggressive forward caching to the elements hosted on Cloudfront, setting Cache-Controlto public, max-age=7776000.
I've recently discovered to my annoyance that third party sites are hotlinking to my Cloudfront server to display images on their own pages, without authorization.
I've configured .htaccessto prevent hotlinking on my own server, but haven't found a way of doing this on Cloudfront, which doesn't seem to support the feature natively. And, annoyingly, Amazon's Bucket Policies, which could be used to prevent hotlinking, have effect only on S3, they have no effect on CloudFront distributions [link]. If you want to take advantage of the policies you have to serve your content from S3 directly.
Scouring my server logs for hotlinkers and manually changing the file names isn't really a realistic option, although I've been doing this to end the most blatant offenses.

You can forward the Referer header to your origin
Go to CloudFront settings
Edit Distributions settings for a distribution
Go to the Behaviors tab and edit or create a behavior
Set Forward Headers to Whitelist
Add Referer as a whitelisted header
Save the settings in the bottom right corner
Make sure to handle the Referer header on your origin as well.

We had numerous hotlinking issues. In the end we created css sprites for many of our images. Either adding white space to the bottom/sides or combining images together.
We displayed them correctly on our pages using CSS, but any hotlinks would show the images incorrectly unless they copied the CSS/HTML as well.
We've found that they don't bother (or don't know how).

The official approach is to use signed urls for your media. For each media piece that you want to distribute, you can generate a specially crafted url that works in a given constraint of time and source IPs.
One approach for static pages, is to generate temporary urls for the medias included in that page, that are valid for 2x the duration as the page's caching time. Let's say your page's caching time is 1 day. Every 2 days, the links would be invalidated, which obligates the hotlinkers to update their urls. It's not foolproof, as they can build tools to get the new urls automatically but it should prevent most people.
If your page is dynamic, you don't need to worry to trash your page's cache so you can simply generate urls that are only working for the requester's IP.

As of Oct. 2015, you can use AWS WAF to restrict access to Cloudfront files. Here's an article from AWS that announces WAF and explains what you can do with it. Here's an article that helped me setup my first ACL to restrict access based on the referrer.
Basically, I created a new ACL with a default action of DENY. I added a rule that checks the end of the referer header string for my domain name (lowercase). If it passes that rule, it ALLOWS access.
After assigning my ACL to my Cloudfront distribution, I tried to load one of my data files directly in Chrome and I got this error:

As far as I know, there is currently no solution, but I have a few possibly relevant, possibly irrelevant suggestions...
First: Numerous people have asked this on the Cloudfront support forums. See here and here, for example.
Clearly AWS benefits from hotlinking: the more hits, the more they charge us for! I think we (Cloudfront users) need to start some sort of heavily orchestrated campaign to get them to offer referer checking as a feature.
Another temporary solution I've thought of is changing the CNAME I use to send traffic to cloudfront/s3. So let's say you currently send all your images to:
cdn.blahblahblah.com (which redirects to some cloudfront/s3 bucket)
You could change it to cdn2.blahblahblah.com and delete the DNS entry for cdn.blahblahblah.com
As a DNS change, that would knock out all the people currently hotlinking before their traffic got anywhere near your server: the DNS entry would simply fail to look up. You'd have to keep changing the cdn CNAME to make this effective (say once a month?), but it would work.
It's actually a bigger problem than it seems because it means people can scrape entire copies of your website's pages (including the images) much more easily - so it's not just the images you lose and not just that you're paying to serve those images. Search engines sometimes conclude your pages are the copies and the copies are the originals... and bang goes your traffic.
I am thinking of abandoning Cloudfront in favor of a strategically positioned, super-fast dedicated server (serving all content to the entire world from one place) to give me much more control over such things.
Anyway, I hope someone else has a better answer!

This question mentioned image and video files.
Referer checking cannot be used to protect multimedia resources from hotlinking because some mobile browsers do not send referer header when requesting for an audio or video file played using HTML5.
I am sure of that about Safari and Chrome on iPhone and Safari on Android.
Too bad! Thank you, Apple and Google.

How about using Signed cookies ? Create signed cookie using custom policy which also supports various kind of restrictions you want to set and also it is wildcard.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string