What is HTTP cache best practices for high-traffic static site?

What is HTTP cache best practices for high-traffic static site? - iis

We have a fairly high-traffic static site (i.e. no server code), with lots of images, scripts, css, hosted by IIS 7.0
We'd like to turn on some caching to reduce server load, and are considered setting the expiry of web content to be some time in the future. In IIS, we can do this on a global level via "Expire web content" section of the common http headers in the IIS response header module. Perhaps setting content to expire 7 days after serving.
All this actually does is sets the max-age HTTP response header, so far as I can tell, which makes sense, I guess.
Now, the confusion:
Firstly, all browsers I've checked (IE9, Chrome, FF4) seem to ignore this and still make conditional requests to the server to see if content has changed. So, I'm not entirely sure what the max-age response header will actually effect?! Could it be older browsers? Or web-caches?
It is possible that we may want to change an image in the site at short notice... I'm guessing that if the max-age is actually used by something that, by its very nature, it won't then check if this image has changed for 7 days... so that's not what we want either
I wonder if a best practice is to partition one's site into folders of content really won't change often and only turn on some long-term expiry for these folders? Perhaps to vary the querystring to force a refresh of content in these folders if needed (e.g. /assets/images/background.png?version=2) ?
Anyway, having looked through the (rather dry!) HTTP specification, and some of the tutorials, I still don't really have a feel for what's right in our situation.
Any real-world experience of a situation similar to ours would be most appreciated!

Browsers fetch the HTML first, then all the resources inside (css, javascript, images, etc).
If you make the HTML expire soon (e.g. 1 hour or 1 day) and then make the other resources expire after 1 year, you can have the best of both worlds.
When you need to update an image, or other resource, you just change the name of that file, and update the HTML to match.
The next time the user gets fresh HTML, the browser will see a new URL for that image, and get it fresh, while grabbing all the other resources from a cache.
Also, at the time of this writing (December 2015), Firefox limits the maximum number of concurrent connections to a server to six (6). This means if you have 30 or more resources that are all hosted on the same website, only 6 are being downloaded at any time until the page is loaded. You can speed this up a bit by using a content delivery network (CDN) so that everything downloads at once.

Related

Is the max-age directive semantically necessary in the HSTS header?

In trying to understand the HSTS mechanism, I could not wrap my head around the max-age directive. Couldn't the presence/absence of the HSTS header be enough to tell the browser to switch to HTTP or HTTPS ?
Browsers could remember "forever" a site should be contacted through HTTPS upon first contact, until the header dissapears in a later response. Plus the preload directive is there to support browsers too.
I could not find anything in the specs explaining this. https://datatracker.ietf.org/doc/html/rfc6797
I feel like I'm missing something like a specific scenario. This is not a critic, I'd like to understand why this directive is necessary.

It allows it to be rolled out gradually. It is recommended to set it with a very small max-age first and then grow it if there are no issues. This avoids a real risk of DoS-ing yourself for any non-HTTPS sites. While that is becoming rarer, when this first came out that was a real risk as HTTP was still very much the norm.
Say for example you deployed it on https://www.example.com and that web server also responds to (and sets the HSTS header on) https://example.com. Now let’s say you haven’t set up HTTPS on http://blog.example.com (it’s an unimportant static only domain) or on http://intranet.example.com (it’s not Internet-facing). Without a max-age you potentially just blocked those sites forever until you can deploy HTTPS to them (which can be trickier than just adding a bit of server config).
And without being able to visit the site the browser also couldn’t see the header had since been removed for the reset you suggest. Plus there’s also the fact that not every resource needs to set the HSTS header - just one is sufficient (though best practice is to set it on every HTTPS resource and including redirects), so the absence of the header is not sufficient to reset it. You explicitly need to set max-age=0.
Of course, nowadays, the recommended approach is HTTPS on all subdomains (and this is pretty much becoming the norm as the public Internet is much more HTTPS now than it was - though still not yet the default) and also on intranet sites (though difficult to be sure if that latter really is the norm across companies large and small).
You are right and this could have been implemented by having max-age as optional (instead of mandatory as it is now) and site owners could remove it from the HSTS header, once ready to roll it out fully, but having a default max-age of infinity is pretty dangerous - for the same reasons as given above. Having no defaults and making the implementor think about it, hopefully, makes them consider the appropriate one, or at very least makes them realise the commitment they are making it.
Preloading is the way to make it permanent, at which point the max-age attribute is redundant for those user agent’s implementing preload lists (primarily the most popular browsers).
There is the argument that nothing is permanent in this world - domains come and go and are taken on by new parties who may or may not want to use HTTPS (at least initially) - though as I say with HTTPS becoming the norm that’s less of an issue.
Also clogging up browser cache with an infinite policy just cause they visited your website once, years ago, seems kinda rude. Though browsers could cap it (which they do - more on which later) but better to be explicit for the value you want.
Both of the above reasons btw, are reasons I don’t particularly like preloading HSTS either.
It’s also worth noting that browsers often implement a cap on the max-age (usually because it is stored as a 32-bit integer for example) so sticking an arbitrarily large value in there is not going to do what you think it will. In fact I recall discussion that one browser (Firefox?) didn’t do bounds checking so setting a value larger than that actually possible overflowed and so prevented the policy being set at all! As I said previously, preload is the way to go to make it permanent.
A max-age of six months or one year is the recommended best practice and after this there are diminishing returns in terms of security of a larger policy anyway. If a visitor is not visiting your site every 6-12 months (at which point they will refresh the HSTS policy for another max-age seconds) then chances are they’re on a new browser or device anyway.

Cache-Control headers in a Cloudflare Workers site

I'm just testing Workers sites with a Hugo-created static site. It already existed, so I used the docs’ instructions for adapting an existing site. The cache-control headers for the woff2 and css files all show up with no-cache, contrary to what I'd have expected based on https://support.cloudflare.com/hc/en-us/articles/200172516#h_a01982d4-d5b6-4744-bb9b-a71da62c160a. The Workers site in question is https://hosts-test-hugo.brycewray.workers.dev/.
I found the following at https://levelup.gitconnected.com/use-cloudflare-javascript-workers-to-deploy-you-static-generated-site-ssg-1c518e078646 but don't know if it's related:
A Cloudflare Worker is a piece of JavaScript code that runs every time you access a specific route on a website proxied by Cloudflare. The code is executed on every request before they reach Cloudflare’s cache. This means Worker responses are not cached (although requests made by the worker to other web services might be cached with the appropriate caching headers).
Does the site need to have a custom domain — i.e., rather than being a “.workers.dev” URL — before it will have normal caching behavior? Is that even related?
[Note: I am posting this here because I’ve been unsuccessful in getting a response on either the Cloudflare community forum or the Cloudflare subreddit — hoping for better results here.]

Thanks again to Cloudflare’s Kenton Varda for his answer here, without which I’d have been totally stuck. I also thank Brian Li for additional and equally valuable help he provided separately.
Adding the following for others who may find it useful . . .
The remaining problem I encountered was that I didn’t know how to assign differing cache-control values (such as one month, or 2592000) to most of the static assets while leaving the HTML at a much smaller value (like 3600 or 0). Then, a few days later, I found the answer within a comment in Issue #81 for the Cloudflare KV-Asset-Handler repository. So, now, I have the following code within my Worker’s index.js file:
options.cacheControl = {
browserTTL: 0,
edgeTTL: 0,
bypassCache: false // default
}
const filesRegex = /(.*\.(ac3|avi|bmp|br|bz2|css|cue|dat|doc|docx|dts|eot|exe|flv|gif|gz|ico|img|iso|jpeg|jpg|js|json|map|mkv|mp3|mp4|mpeg|mpg|ogg|pdf|png|ppt|pptx|qt|rar|rm|svg|swf|tar|tgz|ttf|txt|wav|webp|webm|webmanifest|woff|woff2|xls|xlsx|xml|zip))$/
if(url.pathname.match(filesRegex)) {
options.cacheControl.edgeTTL = 2592000
options.cacheControl.browserTTL = 2592000
}
If you look at that linked issue comment and wonder what’s different: the only thing is that I removed html (and, to be safe, htm) from the list of extensions. As a result, my Worker site’s HTML has zero caching while each CSS, font, or image file has a one-month cache-control setting — exactly the desired result. Note: The vast majority of the site’s images are hosted elsewhere, but I do still host a small number for favicons and fallback in general.

By default, Workers Sites does not serve a Cache-Control header, but you can customize it to do so. (EDIT to clarify: Workers Sites are cached on Cloudflare's edge by default, and support "etags" for revalidation. The Cache-Control header controls whether they are also cached in the browser without requiring revalidation.)
Note that Workers Sites works very differently from using Cloudflare with a classic origin server. If you're reading something about caching on Cloudflare, but it doesn't specifically mention Workers Sites, then it probably does not apply to Workers Sites.
With Workers Sites, your site is served by a Cloudflare Worker -- code that runs directly on Cloudflare's servers. So, you have no "origin" server behind Cloudflare, and Cloudflare's cache doesn't work in the normal way. The Worker code is completely responsible for serving the content, including setting any headers like Cache-Control.
In fact, when you create a new Workers Sites project using wrangler, the code for this Worker is generated for you -- but you are allowed to edit it! You can customize the code all you want to do whatever you want. The code for the Worker is found in your project directory under workers-site/index.js. The code looks like this -- in fact, it is initialized as a copy of that file from GitHub.
This worker code depends on a library (npm module) called #cloudflare/kv-asset-handler to do most of the work. This library can be customized to handle caching in various ways through the cacheControl option.
But where do you set this option? Well, in your worker code!
Open up workers-site/index.js and look for the part that looks like this:
async function handleEvent(event) {
const url = new URL(event.request.url)
let options = {}
/**
* You can add custom logic to how we fetch your assets
* by configuring the function `mapRequestToAsset`
*/
// options.mapRequestToAsset = handlePrefix(/^\/docs/)
The comment mentions one way that you can use options to customize how your site is served, but you can also set cacheControl here. Try adding this:
options.cacheControl = {
browserTTL: 3600 // 1 hour
}
Now re-deploy your site, and you should see assets are served with Cache-Control: max-age=3600. Of course, this means that your content may be cached in people's browsers for up to an hour (3600 seconds); you may prefer a longer or shorter period.
Note that if you aren't a programmer, this may all seem a bit daunting. Workers Sites is really designed for people who want to be able to customize how their sites are served by editing JavaScript code. For those not interested in writing code, you will be limited to the default behavior, which may or may not suit your needs.

Varnish cache and Google Tag Manager

I have no experience with Varnish, so please bear with me.
We have inserted Google Tag Manager into a clients site. The Tag Manager injects Google Analytics tracking code (and nothing else) into the page. The clients technical service provider has now complained that the Tag Manager prevents the Varnish cache from working.
My guess is that this has nothing to do with the tag manager as such but is rather caused by the cookies from Google Analytics - apparently in the default configuration pages with cookies are not cached. However since I'm not very familiar with Varnish I cannot speak with any authority in the matter.
So my question is: is there any reason why Google Tag Manager itself (not any tags inside the tag manager) would invalidate a Varnish cache on each request ? A web search turned up nothing specific regarding Varnish and GTM.
Thank you for your time,
Eike

Google Tag Manager will not interfere with Varnish cache in any way. The reason being is that the requests for Google Tag Manager are sent to google-analytics.com, not your website.
The cookies are then set by google-analytics.com and are only sent between the clients browser and google-analytics.com.
This means that Google Tag Manager does not actually have any affect on your website apart from the initial Javascript being loaded from there.

In fact varnish does not validate any cookie that is created through javascript, only caches the "set-cookie header" of the http request.
The problem you may be having is, if the "DataLayer" is placed in the html code, the values of the variables do not change as they would be in cache.
To solve this problem, we must make another http call (ex. ajax) does not to cache, it returns the variables for DataLayer.

Firefox or Chrome plugin to block and filter all outgoing connections

In Firefox or Chrome I'd like to prevent a private web page from making outgoing connections, i.e. if the URL starts with http://myprivatewebpage/ or https://myprivatewebpage/ in a browser tab, then that browser tab must be restricted so that it is allowed to load images, CSS, fonts, JavaScript, XmlHttpRequest, Java applets, flash animations and all other resources only from http://myprivatewebpage/ or https://myprivatewebpage/, i.e. an <img src="http://www.google.com/images/logos/ps_logo.png"> (or the corresponding <script>new Image(...) must not be able to load that image, because it's not on myprivatewebpage. I need a 100% and foolproof solution: not even a single resource outside myprivatewebpage can be accessible, not even at low probability. There must be no resource loading restrictions on Web pages other than myprivatewebpage, e.g. http://otherwebpage/ must be able to load images from google.com.
Please note that I assume that the users of myprivatewebpage are willing to cooperate to keep the web page private unless it's too much work for them. For example, they would be happy to install a Chrome or Firefox extension once, and they wouldn't be offended if they see an error message stating that access is denied to myprivatewebpage until they install the extension in a supported browser.
The reason why I need this restriction is to keep myprivatewebpage really private, without exposing any information about its use to webmasters of other web pages. If http://www.google.com/images/logos/ps_logo.png was allowed, then the use of myprivatewebpage would be logged in the access.log of Google's ps_logo.png, so Google's webmasters would have some information how myprivatewebpage is used, and I don't want that. (In this question I'm not interested in whether the restriction is reasonable, but I'm only interested in the technical solutions and its strengths and weaknesses.)
My ideas how to implement the restriction:
Don't impose any restrictions, just rely on the same origin policy. (This doesn't provide the necessary protection, the same origin policy lets all images pass through.)
Change the web application on the server so it generates HTML, JavaScript, Java applets, flash animations etc. which never attempt to load anything outside myprivatewebpage. (This is almost impossibly hard to foolproof everywhere on a complicated web application, especially with user-generated content.)
Over-sanitize the web page using a HTML output filter on the server, i.e. remove all <script>, <embed> and <object> tags, restrict the target of <img src=, <link rel=, <form action= etc. and also restrict the links in the CSS files. (This can prevent all unwanted resources if I can remember all HTML tags properly, e.g. I mustn't forget about <video>. But this is too restrictive: it removes all dyntamic web page functionality like JavaScript, Java applets and flash animations; without these most web applications are useless.)
Sanitize the web page, i.e. add an HTML output filter into the webserver which removes all offending URLs from the generated HTML. (This is not foolproof, because there can be a tricky JavaScript which generates a disallowed URL. It also doesn't protect against URLs loaded by Java applets and flash animations.)
Install a HTTP proxy which blocks requests based on the URL and the HTTP Referer, and force all browser traffic (including myprivatewebpage, otherwebpage, google.com) through that HTTP proxy. (This would slow down traffic to other than myprivatewebpage, and maybe it doesn't protect properly if XmlHttpRequest()s, Java applets or flash animations can forge the HTTP Referer.)
Find or write a Firefox or Chrome extension which intercepts all outgoing connections, and blocks them based on the URL of the tab and the target URL of the connection. I've found https://developer.mozilla.org/en/Setting_HTTP_request_headers and thinkahead.js in https://addons.mozilla.org/en-US/firefox/addon/thinkahead/ and http://thinkahead.mozdev.org/ . Am I correct that it's possible to write a Firefox extension using that? Is there such a Firefox extension already?
Some links I've found for the Chrome extension:
http://www.chromium.org/developers/design-documents/extensions/notifications-of-web-request-and-navigation
https://groups.google.com/a/chromium.org/group/chromium-extensions/browse_thread/thread/90645ce11e1b3d86?pli=1
http://code.google.com/chrome/extensions/trunk/experimental.webRequest.html
As far as I can see, only the Firefox or Chrome extension is feasible from the list above. Do you have any other suggestions? Do you have some pointers how to write or where to find such an extension?

I've found https://developer.mozilla.org/en/Setting_HTTP_request_headers and thinkahead.js in https://addons.mozilla.org/en-US/firefox/addon/thinkahead/ and http://thinkahead.mozdev.org/ . Am I correct that it's possible to write a Firefox extension using that? Is there such a Firefox extension already?
I am the author of the latter extension, though I have yet to update it to support newer versions of Firefox. My initial guess is that, yes, it will do what you want:
User visits your web page without plugin. Web page contains ThinkAhead block that would send a simple version header to the server, but this is ignored as plugin is not installed.
Since the server does not see that header, it redirects the client to a page to install the plugin.
User installs plugin.
User visits web page with plugin. Page sends version header to server, so server allows access.
The ThinkAhead block matches all pages that are not myprivatewebpage, and does something like set the HTTP status to 403 Forbidden. Thus:
When the user visits any webpage that is in myprivatewebpage, there is normal behaviour.
When the user visits any webpage outside of myprivatewebpage, access is denied.
If you want to catch bad requests earlier, instead of modifying incoming headers, you could modify outgoing headers, perhaps screwing up "If-Match" or "Accept" so that the request is never honoured.
This solution is extremely lightweight, but might not be strong enough for your concerns. This depends on what you want to protect: given the above, the client would not be able to see blocked content, but external "blocked" hosts might still notice that a request has been sent, and might be able to gather information from the request URL.

How can I prevent Amazon Cloudfront from hotlinking?

I use Amazon Cloudfront to host all my site's images and videos, to serve them faster to my users which are pretty scattered across the globe. I also apply pretty aggressive forward caching to the elements hosted on Cloudfront, setting Cache-Controlto public, max-age=7776000.
I've recently discovered to my annoyance that third party sites are hotlinking to my Cloudfront server to display images on their own pages, without authorization.
I've configured .htaccessto prevent hotlinking on my own server, but haven't found a way of doing this on Cloudfront, which doesn't seem to support the feature natively. And, annoyingly, Amazon's Bucket Policies, which could be used to prevent hotlinking, have effect only on S3, they have no effect on CloudFront distributions [link]. If you want to take advantage of the policies you have to serve your content from S3 directly.
Scouring my server logs for hotlinkers and manually changing the file names isn't really a realistic option, although I've been doing this to end the most blatant offenses.

You can forward the Referer header to your origin
Go to CloudFront settings
Edit Distributions settings for a distribution
Go to the Behaviors tab and edit or create a behavior
Set Forward Headers to Whitelist
Add Referer as a whitelisted header
Save the settings in the bottom right corner
Make sure to handle the Referer header on your origin as well.

We had numerous hotlinking issues. In the end we created css sprites for many of our images. Either adding white space to the bottom/sides or combining images together.
We displayed them correctly on our pages using CSS, but any hotlinks would show the images incorrectly unless they copied the CSS/HTML as well.
We've found that they don't bother (or don't know how).

The official approach is to use signed urls for your media. For each media piece that you want to distribute, you can generate a specially crafted url that works in a given constraint of time and source IPs.
One approach for static pages, is to generate temporary urls for the medias included in that page, that are valid for 2x the duration as the page's caching time. Let's say your page's caching time is 1 day. Every 2 days, the links would be invalidated, which obligates the hotlinkers to update their urls. It's not foolproof, as they can build tools to get the new urls automatically but it should prevent most people.
If your page is dynamic, you don't need to worry to trash your page's cache so you can simply generate urls that are only working for the requester's IP.

As of Oct. 2015, you can use AWS WAF to restrict access to Cloudfront files. Here's an article from AWS that announces WAF and explains what you can do with it. Here's an article that helped me setup my first ACL to restrict access based on the referrer.
Basically, I created a new ACL with a default action of DENY. I added a rule that checks the end of the referer header string for my domain name (lowercase). If it passes that rule, it ALLOWS access.
After assigning my ACL to my Cloudfront distribution, I tried to load one of my data files directly in Chrome and I got this error:

As far as I know, there is currently no solution, but I have a few possibly relevant, possibly irrelevant suggestions...
First: Numerous people have asked this on the Cloudfront support forums. See here and here, for example.
Clearly AWS benefits from hotlinking: the more hits, the more they charge us for! I think we (Cloudfront users) need to start some sort of heavily orchestrated campaign to get them to offer referer checking as a feature.
Another temporary solution I've thought of is changing the CNAME I use to send traffic to cloudfront/s3. So let's say you currently send all your images to:
cdn.blahblahblah.com (which redirects to some cloudfront/s3 bucket)
You could change it to cdn2.blahblahblah.com and delete the DNS entry for cdn.blahblahblah.com
As a DNS change, that would knock out all the people currently hotlinking before their traffic got anywhere near your server: the DNS entry would simply fail to look up. You'd have to keep changing the cdn CNAME to make this effective (say once a month?), but it would work.
It's actually a bigger problem than it seems because it means people can scrape entire copies of your website's pages (including the images) much more easily - so it's not just the images you lose and not just that you're paying to serve those images. Search engines sometimes conclude your pages are the copies and the copies are the originals... and bang goes your traffic.
I am thinking of abandoning Cloudfront in favor of a strategically positioned, super-fast dedicated server (serving all content to the entire world from one place) to give me much more control over such things.
Anyway, I hope someone else has a better answer!

This question mentioned image and video files.
Referer checking cannot be used to protect multimedia resources from hotlinking because some mobile browsers do not send referer header when requesting for an audio or video file played using HTML5.
I am sure of that about Safari and Chrome on iPhone and Safari on Android.
Too bad! Thank you, Apple and Google.

How about using Signed cookies ? Create signed cookie using custom policy which also supports various kind of restrictions you want to set and also it is wildcard.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string