URL Fingerprinting / Agressive Caching with NGINX + Express

URL Fingerprinting / Agressive Caching with NGINX + Express - node.js

What is the recommended technique for handling aggressive caching and URL fingerprinting in an NGINX (proxy) and Node / Express stack?
Google recommends to "use fingerprinting to dynamically enable caching." in their best practice guidelines and this is exactly what I'm trying to achieve.
I've looked at quite a few different approaches for fingerprinting but I'm struggling to understand under what scenario these will actually generate a new fingerprint and what part of the development pipeline it's best to sit. I had previously assumed that if 'Last-Modified' changes on the file then the server will generate another fingerprint but that doesn't seem to be the case on yet. (Unless I've misconfigured)
Here are a few different approaches:
Runtime Fingerprinting
dactyloscope
static-asset
Build Fingerprinting
asset-rack
node-version-assets
CI Fingerprinting
grunt-fingerprint
grunt-asset-versioning
So a couple of questions I hope someone can answer:
Is fingerprinting even a requirement with ETags in place or are there too many holes in cross-browser support?
Assets should arguably sit on a CDN so is this problem largely deferred to a CDN provider (if so how do you update references without manual involvement)?
How does a new fingerprint get generated without manual cache clear?
What is the suggestion on where this fingerprinting will sit in the developer pipeline? I want to avoid a dependency on the likes of Grunt.js
I feel like I'm missing something blindingly obvious so if you can answer just one of these questions I'd be really grateful.

Fingerprinting and Etags are separate features for reducing load times.
Etags avoid having to resend an asset if the browser has cached it and the asset has not changed. But, a separate HTTP roundtrip is still required for the browser to send an If-None-Match and get back 304 Not Modified.
The best way of speeding up an HTTP roundtrip is to avoid making one at all. When the second page of a website uses the same assets as the first page, and those assets have far future cache expires headers, then there is no need to even make a single round trip for those assets when they are requested after the first time.
Fingerprinting is the technique of giving each asset a unique name that is derived from its content. Then, when even one bit in an asset (such as a CSS bundle) changes, its name changes, and so a browser will GET the updated asset. And, because fingerprinting uses a cryptographic hash of the contents, the unique name is calculated the same across multiple servers, as long as the asset is identical. Caches everywhere (CDNs, at ISPs, in networking equipment, or in web browsers) are able to keep a copy of each asset, but since the HTML references the unique name of each asset, only the correct version of that asset will ever be served from a cache.
Both Etags and fingerprinting are supported by every browser.
Fingerprinting is not required, it is an optimization. If you are using technologies like Stylus, Browserify, and AngularTemplateCaches that already require a build step, than adding fingerprints is cost-free.
Your HTML pages will have names like /aboutus instead of the /aboutus-sfghjs3646dhs73shwsbby3 they would get with fingerprinting. All the solutions you link to support fingerprinting of Javascript, CSS, and images, and a way to dynamically substitute in the fingerprinted name to the HTML. So, the HTML will reference /css-hs6hd73ydhs7d7shsh7w until you change a byte in the CSS, and then they will reference /css-37r7dhsh373hd73 (a different file).
Fingerprints only need to be generated when the file is modified, which should generally be on server restart or build.
I recommend Asset Rack, which supports lots of asset types, and can serve the fingerprinted assets from RAM or push them to a CDN. It generates all fingerprints each time Express is started up.

Related

How do I know my Subresource Integrity Tags (SRI) was not generated from a malicious CDN?

I understand in Angular for example that when building the project you can add a Subresource integrity (SRI) tags automatically using this command:
ng build --subresource-integrity
My question is how do I know when running that command that the hash that was created was not sourced from a malicious CDN?
I feel like it becomes a chicken and egg problem as I might never know?

SRI protects the integrity of loaded resources in a page. It can protect against a compromised store of such resources, but only as long as the page itself is not in that compromised store. The page that loads the resources is special in the sense that it has the opportunity to check SRI, almost like the source of those resources was untrusted, but you need to be able to serve the page from a trusted source.
Note that this has benefits beyond detecting a malicious CDN. What SRI provides is assurance that the resource loaded is the same as what you were intending to load when the page was created. A lot of things might happen to resources in something like a CI pipeline before being deployed somewhere, they will be stored on intermediate servers during builds, copied to CDNs and so on, SRI ensures the integrity throughout all these all the way to the client browser. But again, only as long as the page that includes those resources is not compromised.
In other words, if the page is stored in the same CDN together with the resources, SRI will not protect against compromise of the CDN - how could it, when you are trusting it to serve all of your content. But it might still protect against compromise of your pipeline in which these files make their way to the CDN.
Edit: I might have misunderstood the question.
NPM packages are signed with PGP, so getting dependencies is secure as long as you only have trusted keys.

Varnish - WebTrends statistics

We currently get web analytics for a WordPress site using WebTrends.
If we use a caching mechanism like Varnish, I would assume WebTrends would suddenly report a dramatic reduction in traffic.
Is this correct and, if so, can you avoid this problem and get the correct statistics reported by WebTrends?

In my experience, acceleration caches shouldn't interfere with your Analytics data capture, because the cached content should include all of the on-page data points (such as meta tags) as well as the WT base tag file, which the user's browser will then execute and which will then make the call to the WT data collection server.
By way of a disclaimer, I should add that I haven't got any specific experience with Varnish, but a cache that acts as a barrier to on-page JavaScript executing is basically broken, and I've personally never had a problem with one preventing analytics software from running.
The only conceivable problem I could foresee is if a cache was going to the extent of scanning pages for linked resources (such as the "no javascript" image in the noscript tag), acquiring those resources in advance, and then reconfiguring the page being served to pull those resources from the cache rather than the third party servers. In which case you might end up with spurious "no javascript" records in your data.

Just make sure that your varnish config is not removing any webtrends cookies and it should be percetly OK. By default it does not but if you use some ready-made wordpress vcl then it might be you will need to exclude these cookies together with the wordpress-specific ones in the configuration.

I need to speed up my site and reduce the number of files calls

My webhost is aking me to speed up my site and reduce the number of files calls.
Ok let me explain a little, my website is use in 95% as a bridge between my database (in the same hosting) and my Android applications (I have around 30 that need information from my db), the information only goes one way (as now) the app calls a json string like this the one in the site:
http://www.guiasitio.com/mantenimiento/applinks/prlinks.php
and this webpage to show in a web view as welcome message:
http://www.guiasitio.com/movilapp/test.php
this page has some images and jquery so I think this are the ones having a lot of memory usage, they have told me to use some code to create a cache of those files in the person browser to save memory (that is a little Chinese to me since I don't understand it) can some one give me an idea and send me to a tutorial on how to get this done?. Can the webview in a Android app keep caches of this files?
All your help his highly appreciated. Thanks

Using a CDN or content delivery network would be an easy solution if it worked well for you. Essentially you are off-loading the work or storing and serving static files (mainly images and CSS files) to another server. In addition to reducing the load on your your current server, it will speed up your site because files will be served from a location closest to each site visitor.
There are many good CDN choices. Amazon CloudFront is one popular option, though in my optinion the prize for the easiest service to setup is CloudFlare ... they offer a free plan, simply fill in the details, change the DNS settings on your domain to point to CloudFlare and you will be up and running.
With some fine-tuning, you can expect to reduce the requests on your server by up to 80%
I use both Amazon and CloudFlare, with good results. I have found that the main thing to be cautious of is to carefully check all the scripts on your site and make sure they are working as expected. CloudFlare has a simple setting where you can specify the cache settings as well, so there's another detail on your list covered.
Good luck!

How much does a single request to the server cost

I was wondering how much do you win by putting all of your css scripts and stuff that needs to be downloaded in one file?
I know that you would win a lot by using sprites, but at some point it might actually hurt to do that.
For example my website uses a lot of small icons and most of the pages has different icons after combining all those icons together i might get over 500kb in total, but if i make one sprite per page it is reduced to almost 50kb/page so that's cool.
But what about scripts js/css how much would i win by making a script for each page which has just over ~100 lines? Or maybe i wouldn't win at all?
Question, basically i want to know how much does a single request cost to download a file and is it really bad to to have many script/image files with todays modern browsers and a high speed connections.
EDIT
Thank you all for your answers, it was hard to chose just one because every answer did answer my question, I chose to reward the one that in my opinion answered my question about request cost the most directly, I will not accept any answer as correct because everyone was.

Multiple requests means more latency, so that will often make a difference. Exactly how costly that is will depend on the size of the response, the performance of the server, where in the world it's hosted, whether it's been cached, etc... To get real measurements you should experiment with your real world examples.
I often use PageSpeed, and generally follow the documented best practices: https://developers.google.com/speed/docs/insights/about.
To try answering your final question directly: additional requests will cost more. It's not necessarily "really bad" to have many files, but it's generally a good idea to combine content into a single file when you can.

Your question isn't answerable in a real generic way.
There are a few reasons to combine scripts and stylesheets.
Browsers using HTTP/1.1 will open multiple connections, typically 2-4 for every host. Because almost every site has the actual HTML file and at least one other resource like a stylesheet, script or image, these connections are created right when you load the initial URL like index.html.
TCP connections are costly. That's why browsers open directly multiple connections ahead of time.
Connections are usually limited to a small number and each connection can only transfer one file at a time.
That said, you could split your files across multiple hosts (e.g. an additional static.example.com), which increases the number of hosts / connections and can speed up the download. On the other hand, this brings additional overhead, because of more connections and additional DNS lookups.
On the other hand, there are valid reasons to leave your files split.
The most important one is HTTP/2. HTTP/2 uses only a single connection and multiplexes all file downloads over that connection. There are multiple demos online that demonstrate this, e.g. http://www.http2demo.io/
If you leave your files split, they can also be cached separately. If you have just small parts changing, the browser could just reload the changed file and all others would be answered using 304 Not Modified. You should have appropriate caching headers in place of course.
That said, if you have the resources, you could serve all your files separately using HTTP/2 for clients that support it. If you have a lot of older clients, you could fallback to combined files for them when they make requests using HTTP/1.1.

Tricky question :)
Of course, the trivial answer is that more requests takes more time, but that is not necessarily this simple.
browsers open multiple http connections to the same host, see http://sgdev-blog.blogspot.hu/2014/01/maximum-concurrent-connection-to-same.html Because that, not using parallel download but rather downloading one huge file is considered as a performance bottleneck by http://www.sitepoint.com/seven-mistakes-that-make-websites-slow/
web servers shall use gzip content-encoding whenever possible. Therefore size of the text resources such as HTML, JS, CSS are quite compressed.
most of those assets are static content, therefore a standard web server shall use etag caching on them. It means that next time the download will be like 26 bytes, since the server tells "not changed" instead of sending the 32kbyte of JavaScript over again
Because of the etag cache, the whole web site shall be cacheable (I assume you're programming a game or something like that, not some old-school J2EE servlet page).
I would suggest making 2-4 big files and download that, if you really want to go for the big files
So to put it together:
if you have only static content, then it is all the same, because etag caching will shortcut any real download from the server, server returns 304 Not modified answer
if you have some generated dynamic content (such as servlet pages), keep the JS and CSS separate as they can be etag cached separately, and only the servlet page needs to be downloaded
check that your server supports gzip content encoding for compression, this helps a lot :)
if you have multiple dynamic content (such as mutliple dynamically changing images), it makes sense to have them represented as 2-4 separate images to utilize the parallel http connections for download (although I can hardly imagine this use case in the real life)
Please, ensure that you're not serving static content dynamically. I.e. try to load the image to a web browser, open the network traffic view, reload with F5 and see that you get 304 Not modified from the server, instead of 200 OK and real traffic.
The biggest performance optimization is that you don't pull anything from the server, and it comes out of the box if used properly :)

I think #DigitalDan has the best answer.
But the question belies the real one, how do I make my page load faster? Or at least , APPEAR to load faster...
I would add something about "above the fold": basically you want to inline as much as will allow your page to render the main visible content on the first round trip, as that is what is perceived as the fastest by the user, and make sure nothing else on the page blocks that...
Archibald explains it well:
https://www.youtube.com/watch?v=EVEiIlJSx_Y

How much you win if you use any of these types might vary based on your specific needs, but I will talk about my case: in my web application we don't combine all files, instead, have 2 types of files, common files, and per page files, where we have common files that needed globally for our application, and other files that is used for its case only, and here is why.
Above is a chart request analysis for my web application, what you need to consider is this
DNS Lookup happens only once as it cached after that, however, DNS name might be cached already, then.
On each request we have:
request start + initial connection + SSL negotiation+ time to first byte + content download
The main factor here which takes majority of request time in most cases is the content download size, so if I have multiple files that all of them needed to be used in all pages, I would combine them into one file so I can save the TCP stack time, on the other hand, if I have files needed to be used in specific pages, I would make it separate so I can save the content download time in other pages.

Actually very relevant question (topic) that many web developer face.
I would also add my answer among other contributors of this question.
Introduction before going to answer
High performance web sites depending on different factors, here is some consideration:
Website size
Content type of website (primary content Text, image, video or mixture)
Traffic on your website (How many people visiting your website average)
Web-host Location vs your primary visitor location (with in your country, region and world wide), it matters a lot if you have website for Europe and your host is in US.
Web-host server (hardware) technology, I prefer SSD disks.
How web-server (software) is setup and optimized
Is it dynamic or static web site
If dynamic, how your code and database is structured and designed
By defining your need you might be able to find the proper strategy.
Regarding your question in general
What regards your website. I recommend you to look at Steve Souders 14 recommendation in his Book High Performance Web Sites.
Steve Souders 14 advice:
Make fewer HTTP requests
Use a Content Delivery Network (CDN)
Add an Expires Header
Gzip Components
Put Style-sheets at the Top
Put Scripts at the Bottom
Avoid CSS Expressions
Make JavaScript and CSS External if possible
Reduce DNS Lookups
Minify JavaScript
Avoid Redirects
Remove Duplicates Scripts
Configure ETages
Make Ajax Cacheable
Regarding your question
So if we take js/css in consideration following will help a lot:
It is better to have different codes on different files.
Example: you might have page1, page2, page3 and page4.
Page1 and page2 uses js1 and js2
Page3 uses only js3
Page4 uses all js1, js2 and js3
So it will be a good idea to have JavaScript in 3 files. You are not interested in including every thing you have that you do not use.
CSS Sprites
CSS at top and JS at the end
Minifying JavaScript
Put your JavaScript and CSS in external files
CDN, in case you use jQuery for example do not download it to your website just use the recommended CDN address.
Conclusion
I am pretty sure there is more details to write. And not all advice are necessary to implement, but it is important to be aware of those. As I mentioned before, I suggest you reading this tiny book, it gives you more details. And finally there is no perfect final solution. You need to start some where, do your best and improved it. No thing is permanent.
Good luck.

the answer to your question is it really depends.
the ultimate goal of page load optimization is to make your users feel your page load is fast.
some suggestions:
do not merge common library js css files like jquery coz they might have already cached by brower when you visited other sites so u don't even need to download them;
merge resources, but at least separate first screen required resouces and the others coz the earlier user could see some meaningful stuff, the faster they feel about your page;
if several of your pages shared some resources, separate the merged files for shared resources and page specific resources so that when you visit the second page, the shared ones might have already been cached by browser, so the page load is faster;
user might be using a phone with slow or inconsistent speed 3g/4g network, so even 50k of data or 2 more requests does make them feel different a lot;

Is really bad to have a lot of 100-lines-files and is also really bad to have just one or two big files, though for each type css/js/markup.
Desktops have mostly high speed connection, and mobile has also high latency.
Taking all the theory about this topic, i think the best approach shall be more practical, less accurate and based upon actual connection speed and device types from a statistical point of view.
For example, i think this is the best way to go today:
1) put all the stuff needed to show the first page/functionality to the user, in one file, shall be under 100KB - this is absolutely a requirement.
2) after that, split or group the files in sizes so that the latency is no longer noticeable together with the download time.
To make it simple and concrete, if we assume: time to first byte is around ~200ms, the size of each file should be between ~120KB and ~200 KB, good for the most connections of today, averaged.

Guidelines for "shareable" url security

I'm planning a webapp that will allow users to create resources without signing in. I plan on using the Google Docs / Pastebin style of security by creating unique hard-to-guess URLs. (e.g. example.com/ytasdfweoirue/)
What are some things to watch out for? What guidelines would you use in designing the token generator? What are some things I should consider? Is there a best set of characters to choose from?
My backend will likely be CouchDB, but I'm interested in platform agnostic, general guidelines and problems that might crop up in any platform.

Use PRNG
You should generate a random URL with a PRNG, not with your framework's simplest Random() function. (FYI In theory .NET GUID is not designed for security, in practice in a web app you should be fine, but you've been warned)
Do not include 3rd party resources in the "hidden" page
Ensure that the page visitors visit do not include any 3rd party resources (javascripts, images, flash animations etc.) Pretty much all of them will leak the the current URL via REFERRER and your hidden URL will be exposed to all those 3rd parties. This is same even if you are using HTTPS and included URLs are using HTTPs.
Do not include links to 3rd party websites, if you have to then take care of Referrers
Again REFERRER leaking can be a problem if the page you are serving includes links to 3rd party URLs. In which case you can either redirect them from a common page (if you do so be careful about Open Redirect vulnerabilities) or you can use a JavaScript trick to strip REFERRER.

You don't mention your technology stack, but the best option here sounds like a Guid. Just have your url:
http://whatever.com/resource/{guid}
Guids are long enough to be hard / impossible to guess or enumerate and you have a pretty strong guarantee that you won't generate two guids that are the same. As long as you aren't in javascript, your language should have a guid generator available as a built in (.net) or a library.
Here is the wikipedia page for more discussion: http://en.wikipedia.org/wiki/Globally_unique_identifier

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string