Cache-Control headers in a Cloudflare Workers site

Cache-Control headers in a Cloudflare Workers site - cache-control

I'm just testing Workers sites with a Hugo-created static site. It already existed, so I used the docs’ instructions for adapting an existing site. The cache-control headers for the woff2 and css files all show up with no-cache, contrary to what I'd have expected based on https://support.cloudflare.com/hc/en-us/articles/200172516#h_a01982d4-d5b6-4744-bb9b-a71da62c160a. The Workers site in question is https://hosts-test-hugo.brycewray.workers.dev/.
I found the following at https://levelup.gitconnected.com/use-cloudflare-javascript-workers-to-deploy-you-static-generated-site-ssg-1c518e078646 but don't know if it's related:
A Cloudflare Worker is a piece of JavaScript code that runs every time you access a specific route on a website proxied by Cloudflare. The code is executed on every request before they reach Cloudflare’s cache. This means Worker responses are not cached (although requests made by the worker to other web services might be cached with the appropriate caching headers).
Does the site need to have a custom domain — i.e., rather than being a “.workers.dev” URL — before it will have normal caching behavior? Is that even related?
[Note: I am posting this here because I’ve been unsuccessful in getting a response on either the Cloudflare community forum or the Cloudflare subreddit — hoping for better results here.]

Thanks again to Cloudflare’s Kenton Varda for his answer here, without which I’d have been totally stuck. I also thank Brian Li for additional and equally valuable help he provided separately.
Adding the following for others who may find it useful . . .
The remaining problem I encountered was that I didn’t know how to assign differing cache-control values (such as one month, or 2592000) to most of the static assets while leaving the HTML at a much smaller value (like 3600 or 0). Then, a few days later, I found the answer within a comment in Issue #81 for the Cloudflare KV-Asset-Handler repository. So, now, I have the following code within my Worker’s index.js file:
options.cacheControl = {
browserTTL: 0,
edgeTTL: 0,
bypassCache: false // default
}
const filesRegex = /(.*\.(ac3|avi|bmp|br|bz2|css|cue|dat|doc|docx|dts|eot|exe|flv|gif|gz|ico|img|iso|jpeg|jpg|js|json|map|mkv|mp3|mp4|mpeg|mpg|ogg|pdf|png|ppt|pptx|qt|rar|rm|svg|swf|tar|tgz|ttf|txt|wav|webp|webm|webmanifest|woff|woff2|xls|xlsx|xml|zip))$/
if(url.pathname.match(filesRegex)) {
options.cacheControl.edgeTTL = 2592000
options.cacheControl.browserTTL = 2592000
}
If you look at that linked issue comment and wonder what’s different: the only thing is that I removed html (and, to be safe, htm) from the list of extensions. As a result, my Worker site’s HTML has zero caching while each CSS, font, or image file has a one-month cache-control setting — exactly the desired result. Note: The vast majority of the site’s images are hosted elsewhere, but I do still host a small number for favicons and fallback in general.

By default, Workers Sites does not serve a Cache-Control header, but you can customize it to do so. (EDIT to clarify: Workers Sites are cached on Cloudflare's edge by default, and support "etags" for revalidation. The Cache-Control header controls whether they are also cached in the browser without requiring revalidation.)
Note that Workers Sites works very differently from using Cloudflare with a classic origin server. If you're reading something about caching on Cloudflare, but it doesn't specifically mention Workers Sites, then it probably does not apply to Workers Sites.
With Workers Sites, your site is served by a Cloudflare Worker -- code that runs directly on Cloudflare's servers. So, you have no "origin" server behind Cloudflare, and Cloudflare's cache doesn't work in the normal way. The Worker code is completely responsible for serving the content, including setting any headers like Cache-Control.
In fact, when you create a new Workers Sites project using wrangler, the code for this Worker is generated for you -- but you are allowed to edit it! You can customize the code all you want to do whatever you want. The code for the Worker is found in your project directory under workers-site/index.js. The code looks like this -- in fact, it is initialized as a copy of that file from GitHub.
This worker code depends on a library (npm module) called #cloudflare/kv-asset-handler to do most of the work. This library can be customized to handle caching in various ways through the cacheControl option.
But where do you set this option? Well, in your worker code!
Open up workers-site/index.js and look for the part that looks like this:
async function handleEvent(event) {
const url = new URL(event.request.url)
let options = {}
/**
* You can add custom logic to how we fetch your assets
* by configuring the function `mapRequestToAsset`
*/
// options.mapRequestToAsset = handlePrefix(/^\/docs/)
The comment mentions one way that you can use options to customize how your site is served, but you can also set cacheControl here. Try adding this:
options.cacheControl = {
browserTTL: 3600 // 1 hour
}
Now re-deploy your site, and you should see assets are served with Cache-Control: max-age=3600. Of course, this means that your content may be cached in people's browsers for up to an hour (3600 seconds); you may prefer a longer or shorter period.
Note that if you aren't a programmer, this may all seem a bit daunting. Workers Sites is really designed for people who want to be able to customize how their sites are served by editing JavaScript code. For those not interested in writing code, you will be limited to the default behavior, which may or may not suit your needs.

Related

Node js - Bundler for http2

I'm currently using babel to transform es6 code to es5 and browserify to bundle it to use it in the browser. Now I've began to using a http2 server (Nginx).
Http2 is more effective when it can load multiple small files instead of one big bundle.
How to best serve multiple js files instead of one big bundle?

I know that SystemJS can load multiple files in development without bundling, and for production you can use a DepCache to define the dependence trees of the modules you are importing
https://github.com/systemjs/systemjs/blob/master/docs/production-workflows.md
This approach would require you to ditch browserfy and change to systemjs as it only uses bundles.

I see that you didn't get the answer on your question till now. Thus I try to help you in spite of HTTP/2 is new for me too (it explains the long text of my answer :-)).
Good information about HTTP/2 can be find on the page https://blog.cloudflare.com/http-2-for-web-developers/. I repeat shortly:
stop concatenating files
stop inlining assets
stop sharding domains
continue minimizing of CSS/JavaScript files
continue loading from CDNs
continue DNS prefetching via <link rel='dns-prefetch' href='...' /> included in <head>
...
I want to add two additional points about the importance of setting HTTP headers Cache-Control and Link:
think about setting Cache-Control HTTP headers (especially max-age, expires and etag) on all content of your page. See details below. I strictly recommend to read the Caching Tutorial.
set Link HTTP header to use SERVER PUSH of HTTP/2.
The setting of HTTP headers LINK: are important to use server push feature of HTTP/2 (see here, here). RFC5988 and Section 19.6.1.2 of RFC2068 describe the feature existing in HTTP 1.1 already. Everybody knows Content-Type: application/json, but in the same way one could set less known Link: <...>; rel=prefetch, described here. For example, one can use
Link: </app/script.js>; rel=preload; as=script
Link: </fonts/font.woff>; rel=preload; as=font
Link: </app/style.css>; rel=preload; as=style
Such links, set on HTML page (like index.html), will informs HTTP server to push the resources together with the response on your HTML page. As the result you save unneeded round-trips and the later requests (after parsing HTML files) and the resources will be displayed immediately. You can consider to set the LINK headers on all images from your page to improve the visibility of your page. See here additional information with nice pictures, which demonstrates the advantage of HTTP/2 server push. If you use PHP then the code could be interesting for you.
The most web developers do some optimizations steps directly or indirectly. The steps are done either during building process or by setting HTTP headers in HTTP responses. One have to review some processes switch off someone and include another one. I try to summarize my results.
you can consider to use webpack instead of browserify to exclude some dependencies from merging. I don't know browserify good enough, but I know that webpack supports externals (see here), which allows to load some modules from CDN. In the next step you can remove any merging at all, but minimize and set cache-control on all your modules.
It's strictly recommended to load CSS/JS/Fonts, which you use, and which you don't developed yourself, from CDN. You should never merge such resources with your JavaScript files (what could you probably do with browserify now). Loading of Bootstrap CSS from your server is not good idea. One should better follow advises from here and use CDN instead ol downloading of all files locally.
The main reason of the usage of CDN is very easy to understand if you examine HTTP headres of the response from https://cdnjs.cloudflare.com/ajax/libs/jquery/2.2.1/jquery.min.js for example. You will find something like cache-control: public, max-age=30672000 and expires:Mon, 06 Mar 2017 21:25:04 GMT. Chrome will shows typically Status Code:200 (from cache) and you will see no traffic over the wire. If you explicitly reload the page (by pressing F5) then you will see a response with 222 bytes and Status Code:304. In other words the file will be typically didn't loaded at all. jQuery 2.2.1 stay forever the same. The next version will have another URL. The usage of HTTPS makes sure that the user will load really jQuery 2.2.1. If it's not enough then you can use https://www.srihash.org/ to calculate sha384 value and use extended form of <link> or <script>:
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.2.1/jquery.min.js"
integrity="sha384-8C+3bW/ArbXinsJduAjm9O7WNnuOcO+Bok/VScRYikawtvz4ZPrpXtGfKIewM9dK"
crossorigin="anonymous"></script>
If the user opens your page with the link then the sha384 hash will be recalculated and verified (by Chrome and Firefox). If the file is not yet in local cache then it will be loaded really quickly too. One short remark by loading the same file from https://code.jquery.com/jquery-2.2.1.min.js one uses HTTP 1.1 today, but from https://cdnjs.cloudflare.com/ajax/libs/jquery/2.2.1/jquery.min.js be used HTTP/2 protocol. i recommend to test the protocol by choosing the CDN. You can find here the list of CDNs which supports now HTTP/2. In the same way loading Bootstrap from https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css one would uses HTTP 1.1 today, but one would use HTTP/2 by loading the same data from https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.6/css/bootstrap.min.css.
I spend many time for CDN to make clear that the most advantage of CDN is setting of cashing headers of HTTP response and the usage of immutable URLs. You can do the same in your modules too.
One should think about the time of caching of every content returned from the server. You can use URLs to your modules, which contains version number of your component (like /script/mycomponent1.1.12341) and to change the last part of version number every time on changing the module. You can set long enough value of max-age in cache-control and your components will be cached by web browser of the client.
Finally I'd recommend you to verify that you installed the latest version of OpenSSL and the latest version of nginx. I recommend to verify your web site in http://www.webpagetest.org/ and in https://www.ssllabs.com/ssltest/ to be sure that you don't forget any simple steps.

How to prevent IIS from sending cache headers with ASHX files

My company uses ASHX files to serve some dynamic images. Being it that the content type is image/jpeg, IIS sends headers with them as would be appropriate for static images.
Depending on settings (I don't know all of the settings involved, hence the question) the headers may be any of:
LastModified, ETag, Expires
Causing the browser to treat them as cacheable, which leads to all sorts of bugs with the user seeing stale images.
Is there a setting that I can set somewhere that will cause ASHX files to behave the same way as other dynamic pages, like ASPX files? Short of that, is there a setting that will allow me to, across the board, remove LastModified, Etag, Expires, etc and add a no-cache header instead?

Only solutions I've found were:
1) Adding Response.ContentControl = "no-cache" to each handler.
I don't like this because this requires all of the handlers to change and for all developers to be aware of it.
2) Setting HTTP Header override on a folder where the handlers live
I don't like this one because it requires the handlers to be in their own directory. While this may be good practice in general, unfortunately our application is not structured that way, and I cannot just move them because it would break client-facing links.
If nobody provides a better answer I'll have to accept that these are the only two choices.

Add a random generated string to the request query. This will trick the browser into thinking it is a different call. Example: document.getElementById("myimgcontl").src="myimages.ashx?15923763";.

What is HTTP cache best practices for high-traffic static site?

We have a fairly high-traffic static site (i.e. no server code), with lots of images, scripts, css, hosted by IIS 7.0
We'd like to turn on some caching to reduce server load, and are considered setting the expiry of web content to be some time in the future. In IIS, we can do this on a global level via "Expire web content" section of the common http headers in the IIS response header module. Perhaps setting content to expire 7 days after serving.
All this actually does is sets the max-age HTTP response header, so far as I can tell, which makes sense, I guess.
Now, the confusion:
Firstly, all browsers I've checked (IE9, Chrome, FF4) seem to ignore this and still make conditional requests to the server to see if content has changed. So, I'm not entirely sure what the max-age response header will actually effect?! Could it be older browsers? Or web-caches?
It is possible that we may want to change an image in the site at short notice... I'm guessing that if the max-age is actually used by something that, by its very nature, it won't then check if this image has changed for 7 days... so that's not what we want either
I wonder if a best practice is to partition one's site into folders of content really won't change often and only turn on some long-term expiry for these folders? Perhaps to vary the querystring to force a refresh of content in these folders if needed (e.g. /assets/images/background.png?version=2) ?
Anyway, having looked through the (rather dry!) HTTP specification, and some of the tutorials, I still don't really have a feel for what's right in our situation.
Any real-world experience of a situation similar to ours would be most appreciated!

Browsers fetch the HTML first, then all the resources inside (css, javascript, images, etc).
If you make the HTML expire soon (e.g. 1 hour or 1 day) and then make the other resources expire after 1 year, you can have the best of both worlds.
When you need to update an image, or other resource, you just change the name of that file, and update the HTML to match.
The next time the user gets fresh HTML, the browser will see a new URL for that image, and get it fresh, while grabbing all the other resources from a cache.
Also, at the time of this writing (December 2015), Firefox limits the maximum number of concurrent connections to a server to six (6). This means if you have 30 or more resources that are all hosted on the same website, only 6 are being downloaded at any time until the page is loaded. You can speed this up a bit by using a content delivery network (CDN) so that everything downloads at once.

How can I prevent Amazon Cloudfront from hotlinking?

I use Amazon Cloudfront to host all my site's images and videos, to serve them faster to my users which are pretty scattered across the globe. I also apply pretty aggressive forward caching to the elements hosted on Cloudfront, setting Cache-Controlto public, max-age=7776000.
I've recently discovered to my annoyance that third party sites are hotlinking to my Cloudfront server to display images on their own pages, without authorization.
I've configured .htaccessto prevent hotlinking on my own server, but haven't found a way of doing this on Cloudfront, which doesn't seem to support the feature natively. And, annoyingly, Amazon's Bucket Policies, which could be used to prevent hotlinking, have effect only on S3, they have no effect on CloudFront distributions [link]. If you want to take advantage of the policies you have to serve your content from S3 directly.
Scouring my server logs for hotlinkers and manually changing the file names isn't really a realistic option, although I've been doing this to end the most blatant offenses.

You can forward the Referer header to your origin
Go to CloudFront settings
Edit Distributions settings for a distribution
Go to the Behaviors tab and edit or create a behavior
Set Forward Headers to Whitelist
Add Referer as a whitelisted header
Save the settings in the bottom right corner
Make sure to handle the Referer header on your origin as well.

We had numerous hotlinking issues. In the end we created css sprites for many of our images. Either adding white space to the bottom/sides or combining images together.
We displayed them correctly on our pages using CSS, but any hotlinks would show the images incorrectly unless they copied the CSS/HTML as well.
We've found that they don't bother (or don't know how).

The official approach is to use signed urls for your media. For each media piece that you want to distribute, you can generate a specially crafted url that works in a given constraint of time and source IPs.
One approach for static pages, is to generate temporary urls for the medias included in that page, that are valid for 2x the duration as the page's caching time. Let's say your page's caching time is 1 day. Every 2 days, the links would be invalidated, which obligates the hotlinkers to update their urls. It's not foolproof, as they can build tools to get the new urls automatically but it should prevent most people.
If your page is dynamic, you don't need to worry to trash your page's cache so you can simply generate urls that are only working for the requester's IP.

As of Oct. 2015, you can use AWS WAF to restrict access to Cloudfront files. Here's an article from AWS that announces WAF and explains what you can do with it. Here's an article that helped me setup my first ACL to restrict access based on the referrer.
Basically, I created a new ACL with a default action of DENY. I added a rule that checks the end of the referer header string for my domain name (lowercase). If it passes that rule, it ALLOWS access.
After assigning my ACL to my Cloudfront distribution, I tried to load one of my data files directly in Chrome and I got this error:

As far as I know, there is currently no solution, but I have a few possibly relevant, possibly irrelevant suggestions...
First: Numerous people have asked this on the Cloudfront support forums. See here and here, for example.
Clearly AWS benefits from hotlinking: the more hits, the more they charge us for! I think we (Cloudfront users) need to start some sort of heavily orchestrated campaign to get them to offer referer checking as a feature.
Another temporary solution I've thought of is changing the CNAME I use to send traffic to cloudfront/s3. So let's say you currently send all your images to:
cdn.blahblahblah.com (which redirects to some cloudfront/s3 bucket)
You could change it to cdn2.blahblahblah.com and delete the DNS entry for cdn.blahblahblah.com
As a DNS change, that would knock out all the people currently hotlinking before their traffic got anywhere near your server: the DNS entry would simply fail to look up. You'd have to keep changing the cdn CNAME to make this effective (say once a month?), but it would work.
It's actually a bigger problem than it seems because it means people can scrape entire copies of your website's pages (including the images) much more easily - so it's not just the images you lose and not just that you're paying to serve those images. Search engines sometimes conclude your pages are the copies and the copies are the originals... and bang goes your traffic.
I am thinking of abandoning Cloudfront in favor of a strategically positioned, super-fast dedicated server (serving all content to the entire world from one place) to give me much more control over such things.
Anyway, I hope someone else has a better answer!

This question mentioned image and video files.
Referer checking cannot be used to protect multimedia resources from hotlinking because some mobile browsers do not send referer header when requesting for an audio or video file played using HTML5.
I am sure of that about Safari and Chrome on iPhone and Safari on Android.
Too bad! Thank you, Apple and Google.

How about using Signed cookies ? Create signed cookie using custom policy which also supports various kind of restrictions you want to set and also it is wildcard.

Implementing HTTP or HTTPS depending on page

I want to implement https on only a selection of my web-pages. I have purchased my SSL certificates etc and got them working. Despite this, due to speed demands i cannot afford to place them on every single page.
Instead i want my server to serve up http or https depending on the page being viewed. An example where this has been done is ‘99designs’
The problem in slightly more detail:
When my visitors first visit my site they only have access to non-sensitive information and therefore i want them to be presented with simple http.
Then once they login they are granted access to more sensitive information, e.g. profile information for which HTTPS is used to deliver.
Despite being logged in, if the user goes back to a non-sensitive page such as the homepage then i want it delivered using HTTP.
One common solution seems to be using the .htaccess file. The problem is that my site is relatively large meaning that to use this would require me to write a rule for every page (several hundred) to determine whether it should be server up using http or https.
And then there is the problem of defining user generated content pages.
Please help,
Many thanks,
David

You've not mentioned anything about the architecture you are using. Assuming that the SSL termination is on the webserver, then you should set up separate virtual hosts with completely seperate and non-overlapping document trees, and for preference, use a path schema which does not overlap (to avoid little accidents).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string