I need to serve out large files (10-50GB) at low volumes from an S3 bucket. I am using CloudFront because I need Lambda#Edge to inspect the requests before they hit the s3 bucket.
From the AWS docs here I understand that I pay for Data Transfer Out to Internet. I also understand the largest object size for the CloudFront Cache is 20GB.
My question is: for files less than 20GB, does not caching these files have any impact on the cost/pricing of CloudFront? Does setting the Cache-control header to no-cache mean that the response bypasses CloudFront? I assume there must be some costs associated with the caching/storing the file on the edge servers.
There is a similar question here but the answer doesn't discuss cost/pricing of caching specifically.
For objects smaller than 20 GB, whether or not you prevent CloudFront from caching them (typically by setting Cache-Control to any combination of private, no-cache, or no-store while leaving Minimum TTL set to the default value of 0 -- although there are also other ways) would have no impact on pricing, because CloudFront doesn't charge anything for cache storage. While this might be a surprise, remember that responses are only cached in edges through which they are requested, and that CloudFront's cache is a cache, and thus ephemeral, so CloudFront can discard cached objects that don't see frequent traffic.
Preventing caching doesn't bypass CloudFront, since CloudFront is the service that is processing the entire request. It just prevents the response from being stored in the cache as it's being returned to the viewer.
You will want to verify that objects larger than 20 GB will work at all. The documentation suggests they will not.
Maximum File Size
The maximum size of a response body that CloudFront will return to the viewer is 20 GB. This includes chunked transfer responses that don't specify the Content-Length header value.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorS3Origin.html#ResponseS3MaxFileSize
Related
We are using a Standard Microsoft Azure CDN to serve images for a web application. These images are requested as /api/img?param1=aaa¶m2=bbb, so we cache every unique URL. The cache duration is 7 days. We also override the "Cache-Control" header so that the image is only cached for 1 hours by the client browser.
The problem is, the images do not stay in cache for 7 days. The first day after the images have been requested, they seem to be in CDN (I verify the X-Cache header and it returns "TCP_HIT"), however if I make the same requests 2-3 days later, around 25% of images are not cached anymore (the X-Cache header is "TCP_MISS"). The origin server receives and logs requests, so I am sure that they bypass CDN.
Is there any explanation for this? Do I have to set additional parameters for images to be cached correctly?
We use the following settings :
Caching rules "Cache every unique URL"
Rules Engine:
if URL path begins with /api/img
then Cache expiration: [cache behaviour] Override, [duration] 7 days
and then Modify response header: Overwrite, "Cache-Control", "public, max-age=3600"
From some folks on the CDN Product Group:
for all but the Verizon Premium SKU, the max-age and cache expiration are one and the same thing, so 2c overrides 2b.
The CDN reserves the right to flush entries from the CDN if they are not used - cache items are evicted using an LRU algorithm.
the Verizon Premium SKU offers the ability to have two different age values, one for browser-to-edge (the "External Max-Age") and one for edge-to-source (original expiration time, or forced override time - see the docs).
I have a Cloud Front Distribution to cache my images. my origin server is NOT S3, its some server i run.
I use these images in my website(taking the advantage of CF caching). Now to explain the problem, lets assume in my home page i am using an image called banner.png.
I visit my home page lets say from chrome for the first time - for banner.png its a cache miss, so it gets fetched fro origin and cached in CF.
After this i visit my page from FF,opera, Chromium, GET "banner.png" using postman - this all gets me the file from CF cache.
Now i GET "banner.png" using insomnia (Another rest client) - Now CF doesn't send me from cache, it goes back to origin to get the image, and reply me with **"x-cache: RefreshHit from cloudfront"**.
the difference between these 2 sets of clients are first set of clients sends "Accept-Encoding: gzip" header in the request and second client did not.
in my CF behaviour -
"Cache Based on Selected Request Headers" = NONE
Objects Automatically" = NO"Compress
.
Any pointers ?
CloudFront keeps two different copies of cache based on Accept-encoding.
One if Header contains Accept-encoding: gzip
Accept-encoding: any other value or without the header.
You can test it using curl, first without accept-encoding and second request with accept-encoding: gzip and you'll see MISS from CloudFront, this is expected with CloudFront.
The reason being is that CloudFront supports only gzip compression and it keeps this header into consideration to know if it needs to compress the response or not.
However, Your problem seems different, You're seeing Refersh from CloudFront which happens when CloudFront TTLs/Max-age expires and CloudFront is making a Condition GET to the origin to know if the content has been modified or not.
Ideally, it should be a Miss From CloudFront if no accept-header is present.
Edit: This is all mysteriously working now, although I wasn't able to figure out what actually caused the issue. Might not have been the CDN at all? Leaving this here for posterity and will update if I ever see this kind of thing happen again...
I've been experimenting with using Azure CDN (Microsoft hosted, not Akamai or Verizon) to handle file downloads for a couple of Azure Web Apps, and it's been working fine until today, when it began returning truncated versions of a "large file", resulting in a PDF file that couldn't be opened (by "large file" I'm specifically referring to Azure CDN's Large File Optimisation feature).
The file works fine from the origin URL and is 8.59mb, but the same file retrieved from the CDN endpoint is exactly 8mb. Which, by a suspicious coincidence, happens to be the same as the chunk size used by the Large File Optimisation feature mentioned above. Relevant part of the documentation:
Azure CDN Standard from Microsoft uses a technique called object chunking. When a large file is requested, the CDN retrieves smaller pieces of the file from the origin. After the CDN POP server receives a full or byte-range file request, the CDN edge server requests the file from the origin in chunks of 8 MB.
... This optimization relies on the ability of the origin server to support byte-range requests
File URLs in question:
Origin
CDN
I've also uploaded the same file directly into the website's filesystem to rule out the CMS (Umbraco) and its blob-storage-filesystem stuff interfering, but it's the exact same result anyway. Here's the links for reference.
Origin
CDN
In both cases the two files are binary identical except that the file from the CDN abruptly stops at 8mb, even though the origin supports byte-range requests (verified with Postman) and the Azure CDN documentation linked above claims that
If not all the chunks are cached on the CDN, prefetch is used to request chunks from the origin
And that
There are no limits on maximum file size.
The same issue has occurred with other files over 8mb too, although these had previously worked as of last week. No changes to CDN configuration had been made since then.
What I'm thinking is happening is something like:
Client requests file download from CDN
CDN figures out that it's a "large file" and requests the first 8mb chunk from the origin
Origin replies with an 8mb chunk as requested
CDN begins returning 8mb chunk to client
CDN either doesn't request the next chunk, or origin doesn't provide it
Client only receives the first 8mb of the file
Or perhaps I'm barking up the wrong tree. Already tried turning off compression, not really sure where else to go from here. This is probably my fault, so have I misconfigured the CDN or something? I've considered purging the CDN's cache but I don't really see that as a solution and would rather avoid manual workarounds...
This optimization relies on the ability of the origin server to support byte-range requests; if the origin server doesn't support byte-range requests, requests to download data greater than 8mb size will fail.
https://learn.microsoft.com/en-us/azure/cdn/cdn-large-file-optimization
I set up an Azure Verizon Premium CDN a few days ago as follows:
Origin: An Azure web app (.NET MVC 5 website)
Settings: Custom Domain, no geo-filtering
Caching Rules: standard-cache (doesn't care about parameters)
Compression: Enabled
Optimized for: Dynamic site acceleration
Protocols: HTTP, HTTPS, custom domain HTTPS
Rules: Force HTTPS via Rules Engine (if request scheme = http, 301 redirect to https://{customdomain}/$1)
So - this CDN has been running for a few days now, but the ADN reports are saying that nearly 100% (99.36%) of the cache status is "CONFIG_NOCACHE" (Description: "The object was configured to never be cached in accordance with customer-specific configurations residing on the edge servers, so the response was served via the origin server.") A few (0.64%) of them are "NONE" (Description: "The cache was bypassed entirely for this request. For instance, the request was immediately rejected by the token auth module, or the client request method used an uncacheable request method such as "PUT".") Also, in the "Cache Hit" report, it says "0 hits, 0 misses" for every day. Nothing is coming through the "HTTP Large" side, only "ADN".
I couldn't find these exact messages while searching around, but I've tried:
Updating cache-control header to max-age, public (ie: cache-control: public,max-age=1209600)
Updating the cache-control header to max-age (cache-control: max-age=1209600)
Updating the expires header to a date way in the future (expires: Tue, 19 Jan 2038 03:14:07 GMT)
Using different browsers so the request cache info is different. In Chrome, the request is "cache-control: no-cache" in my browser. In Firefox, it'll say "Cache-Control: max-age=0". In any case, I'd assume the users on the website wouldn't have these same settings, right?
Refreshing the page a bunch of times, and looking at the real time report to see hits/misses/cache statuses, and it shows the same thing - CONFIG_NOCACHE for almost everything.
Tried running a "worldwide" speed test on https://www.dotcom-tools.com/website-speed-test.aspx, but that had the same result - a bunch of "NOCACHE" hits.
Tried adding ADN rules to set the internal and external max age to 864000 sec (10 days).
Tried adding an ADN rule to ignore "no-cache" requests and just return the cached result.
So, the message for "NOCACHE" says it's a node configuration issue... but I haven't really even configured it! I'm so confused. It could also be an application issue, but I feel like I've tried all the different permutations of "cache-control" that I can. Here's an example of one file that I'd expect to be cached:
Ultimately, I would hope that most of the requests are being cached, so I'd see most of the requests be "TCP Hit". Maybe that's incorrect? Thanks in advance for your help!
So, I eventually figured out this issue. Apparently Azure Verzion Premium CDN ADN platform has "bypass cache" enabled by default.
To disable this behavior you need to configure additional features to your caching rules.
Example:
IF Always
Features:
Bypass Cache Disabled
Force Internal Max-Age Response 200 864000 Seconds
Ignore Origin No-Cache 200
I am using Amazon CloudFront to deliver some HDS files. I have an origin server which check the HTTP HEADER REFERER and in case is no allowed it block it.
The problem is that cloud front is removing the referer header, so it is not forwarded to the origin.
Is it possible to tell Amazon not to do it?
Within days of writing the answer below, changes have been announced to Cloudfront. Cloudfront will now pass through headers you select and can add some headers of its own.
However, much of what I stated below remains true. Note that in the announcement, an option is offered to forward all headers which, as I suggested, would effectively disable caching. There's also an option to forward specific headers, which will cause Cloudfront to cache the object against the complete set of forwarded headers -- not just the uri -- meaning that the effectiveness of the cache is somewhat reduced, since Cloudfront has no option but to assume that the inclusion of the header might modify the response the server will generate for that request.
Each of your CloudFront distributions now contains a list of headers that are to be forwarded to the origin server. You have three options:
None - This option requests the original behavior.
All - This option forwards all headers and effectively disables all caching at the edge.
Whitelist - This option give you full control of the headers that are to be forwarded. The list starts out empty, and grows as you add more headers. You can add common HTTP headers by choosing them from a list. You can also add "custom" headers by simply entering the name.
If you choose the Whitelist option, each header that you add to the list becomes part of the cache key for the URLs associated with the distribution. Adding a header to the list simply tells CloudFront that the value of the header can affect the content returned by the origin server.
http://aws.amazon.com/blogs/aws/enhanced-cloudfront-customization/
Cloudfront does remove the Referer header along with several others that are not particularly meaningful -- or whose presence would cause illogical consequences -- in the world of cached content.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html
Just like cookies, if the Referer: header were allowed to remain, such that the origin could see it and react to it, that would imply that the object should be cached based on the request plus the referring page, which would seem to largely defeat the cachability of objects. Otherwise, if the origin did react to an undesired referer and send no-cache responses, that would be all well and good until the first legitimate request came in, the response to which would be served to subsequent requesters regardless of their referer, also largely defeating the purpose.
RFC-2616 Section 13 requires that a cache return a response that has been "checked for equivalence with what the origin server would have returned," and this implies that the response be valid based on all headers in the request.
The same thing goes for User-agent and other headers an origin server might use to modify its response... if you need to react to these values at the origin, there's little obvious purpose for serving them with a CDN.
Referring page-based tests are quite a primitive measure, the way many people use them, since headers are so trivial to forge.
If you are dealing with a platform that you don't control, and this is something you need to override (with a dummy value, just to keep the existing system "happy,") then a reverse proxy in front of the origin server could serve such a purpose, with Cloudfront using the reverse proxy as its origin.
In today's newsletter amazon announced that it is now possible to forward request headers with cloudfront. See: http://aws.amazon.com/de/about-aws/whats-new/2014/06/26/amazon-cloudfront-device-detection-geo-targeting-host-header-cors/