Expire workboxjs cached items based on querystring

Expire workboxjs cached items based on querystring - sharepoint

im using Workbox to pre-cache & cache resources in my sharepoint project.
ms sharepoint uses a lot of its own js / css out of the box that i would like to be cached.
Sharepoint renders the src tags for js, css with a revision ID appended to the querystring.
somethig like this:
<link rel="stylesheet" type="text/css" href="/_layouts/15/1046/styles/SuiteNav.css?rev=tyIeEoGrLkQjn4siLhDMLw%3D%3DTAG0"/>
<script type="text/javascript" src="/_layouts/15/1046/initstrings.js?rev=mwvYlbIyUbEbxtCpAg383w%3D%3DTAG0"></script>
I would like to expire those resources based on the revision (rev) querystring.
can this be done by anything out of the box in workbox js or i will need something like a custom plugin?
can you point me at the documentation / example ?
Thank you in advance

If you're dealing with resources that are not available at build time, because, for instance, those URLs only exist on the Sharepoint server, then you'll have to use runtime caching to handle them.
I'd recommend going with a cache-first strategy for resources like those that have revisioning information in their URL, since they'll (hopefully!) only change if their URL changes.
In terms of making sure that you eventually expire the older entries, that's a bit trickier to do aggressively. What I'd suggest instead is that you use a more conservative approach to expiration, with a maximum number of entries in your cache that is some multiple of the expected number of URLs that will be valid at any given time. For instance, if you know that there are likely to be ~10 URLs that are the "most recent" at any given time, then configuring cache expiration to start expiring entries once you've reach ~30 or so items in the cache would be reasonable.
Putting that all together, you could configure Workbox's runtime caching as follows (using the new v3 syntax):
workbox.routing.registerRoute(
// Customize this pattern as needed.
new RegExp('/_layouts/15/1046/'),
workbox.strategies.cacheFirst({
cacheName: 'sharepoint-assets',
plugins: [
new workbox.expiration.Plugin({maxEntries: 30}),
],
})
);

Related

How to crawl a group of websites looking for CSP issues?

We have started implementing CSP headers on some client sites, firstly in report-only mode. My disinclination to turn it on into live mode is because we may break the clients website because we missed something.
Is there a script (similar to cURL) which I can use to check for CSP issues (similar to the Google Webmaster Tools-> Console), so I can scan entire sites quickly?

In addition to being able to crawl, such a script would need to apply the CSP rules and send reports or log violations in some way. I don't know about any such scripts. In addition there are differences between browsers in how CSP is implemented, both CSP level and various implementation choices. So even though you can scan the entire website without findings there could theoretically still be violations reported from other browsers.
If your main concern is the sources loaded you could build a script that extracts all the source URLs. This would still be a time consuming task to write and fine tune regex to first filter out comments and XML namespace URLs, then apply another regex to find the actual source URL. The output would have to be checked against your policy, likely a manual process. Finally you would also need to check script and style files to find potential violations there. I have done this in a script looking for http references in client code. It is a massive task that will likely consume more hours than updating the CSP once violations occur.
As you already have a CSP in report-only mode your confidence with the policy will grow over time as real users interact with the websites. Keep the report-only policy and add a more permissive enforced policy. As confidence grows, make the enforced policy stricter by making it more similar to the report-only version. Whenever violations occur, check if they can be reproduced is just something that happens on selected clients due to specific browsers, browser extensions, rewrites by proxies etc.

You can scan entire site quickly by use a simple JavaScript:
<!DOCTYPE html>
<html>
<head></head>
<body>
Current Url: <span id='url'></span><br>
<iframe id='site' src='about:blank' onload='reload()' width='100%' height='400px'></iframe>
<script nonce="test">
var page =0;
var pages = [ // Urls from sitemap.xml
'https://example.com', 'https://example.com/csp/', 'https://example.com/es/',
'https://example.com/en/', 'https://example.com/csp/about/', 'https://example.com/contacts/'
];
function reload() {
if (page < pages.length) {
document.getElementById('url').innerHTML = pages[page];
document.getElementById('site').src = pages[page];
page = page+1;
}
else {
document.getElementById('url').innerHTML = 'All done';
document.getElementById('site').onload = "";
document.getElementById('site').src = "about:blank";
}
}
</script>
</body>
</html>
Since scanned web site is load in <iframe>, it need to remove frame-ancestors and X-Frame-Options header if present.
This script can be used in any browser, but onclick/onmouseover/etc events will not fire. Also forms will not be filled and submitted. Therefore, not all possible CSP violations will be detected by scanning.
You have to check violation reports during 2 weeks or more before switch CSP to enforce mode.
You can inject this script into scanned website itself and do not use <iframe> (use onpageload event instead of onload).

How do ensure static web site pages are fresh?

I have a static web site hosted on Amazon S3. I regularly update it. However I am finding out that many of the users accessing it are looking at a stale copy.
By the way, the site is: http://cosi165a-f2016.s3-website-us-west-2.amazonaws.com) and it's generated a ruby static site generator called nanoc (very nice by the way). It compiles the source material for the site: https://github.com/Coursegen/cosi165a-f2016 into the html, css, js and other files.
I assume that this has to do with the page freshness, and the fact that the browser is caching pages.
How do I ensure that my users see a fresh page?

One common technique is to keep track of the last timestamp when you updated static assets to S3, then use that timestamp as a querystring parameter in your html.
Like this:
<script src="//assets.com/app.min.js?1474399850"></script>
The browser will still cache that result, but if the timestamp changes, the browser will have to get a new copy.
The technique is called "cachebusting".
There's a grunt module if you use grunt: https://www.npmjs.com/package/grunt-cachebuster. It will calculate the hash of your asset's contents and use that as the filename.

Pre-render a static website from REST-api and templates?

I have a rest-api that I will use to render html using some basic templating language. I wonder if there is any good platform or service for pre-rendering HTML-files and serv them statically. For performance and scalability.
I need to pre render the pages contiously, like every 24 hours, and it should also be possible to tell the system to re-render a specific page somehow. I'm comfortable in most open-source languages, node is a favourite.

It seems to me that the most straightforward way to accomplish this is to use two tiers: a rendering server and a cache server. When cache server starts up it would crawl through every url on the rendering server and store the pre-rendered HTMLS files into its local directory. For simplicity you can mirror the "directory structure" and make the resource paths identical. In other words, for every URL on the rendering server that looks like this:
http://render.xyz/path/to/resource
You create a directory structure /path/to on the cache server and put a file resource in it.
Your end-users don't need to be aware of this architecture. They make requests to the cache server like this:
http://cache.xyz/path/to/resource
The cache server gives them the result they are looking for.
There are many ways to tell the cache server to refresh (re-generate) a page. You could add a "hidden" directory, let's call it .cache-command, and use it to handle refresh requests. For example, to tell the cache server to refresh a resource, you would use a URL like this:
http://cache.xyz/.cache-command/refresh/path/to/resource
When the cache server received that request, it would refresh the resource.
One of the advantages of this approach is that your cache server can be completely independent of the render server. They could be written in different languages, running on different hardware, or they could be part of the same nodejs application. Whatever works best for you.

cache external files eg, i.ytimg.com/vi/#code#/0.jpg with apache .htaccess?

As topic is it possible to set cache on external resources with htaccess.
I have some third party stuff on my site eg, google web elements and embedded youtube clips.
I want my google page speed to get higher.
error code from page speed:
The following resources are missing a cache validator.
http://i.ytimg.com/vi/-MfM1fVSFnM/0.jpg
http://i.ytimg.com/vi/-PxVKNJmw4M/0.jpg
http://i.ytimg.com/vi/3nxENc_msc0/0.jpg
http://i.ytimg.com/vi/5Bra7rbGb7g/0.jpg
http://i.ytimg.com/vi/5P76PKybW5o/0.jpg
http://i.ytimg.com/vi/9l9BzKfI88o/0.jpg
http://i.ytimg.com/vi/E7hvBxMB4XI/0.jpg
http://i.ytimg.com/vi/IiocozLHFis/0.jpg
http://i.ytimg.com/vi/JIHohC8fydQ/0.jpg
http://i.ytimg.com/vi/P66uwFpmQSE/0.jpg
http://i.ytimg.com/vi/TXLTbARnRdU/0.jpg
http://i.ytimg.com/vi/bPBrRzckfEQ/0.jpg
http://i.ytimg.com/vi/dajcIH9YUuI/0.jpg
http://i.ytimg.com/vi/g4roerqw090/0.jpg
http://i.ytimg.com/vi/h1imBHP3DdA/0.jpg
http://i.ytimg.com/vi/hRvW5ndLLEk/0.jpg
http://i.ytimg.com/vi/kzahftbo6Qc/0.jpg
http://i.ytimg.com/vi/lta2U3hkC4k/0.jpg
http://i.ytimg.com/vi/n1o9bGF88HY/0.jpg
http://i.ytimg.com/vi/n3csJN0wXew/0.jpg
http://i.ytimg.com/vi/q0Xu-0moeew/0.jpg
http://i.ytimg.com/vi/tPCDPKirZBM/0.jpg
http://i.ytimg.com/vi/uLxsPImMJmg/0.jpg
http://i.ytimg.com/vi/x33B_iBn2_M/0.jpg

No, it's up to them to cache it.
The best you could do would be to download them onto your server and then serve them, but that would be slower anyway!

Nope, setting cache settings for third parties is not possible unless you start passing those resources through on your server as a proxy, which you usually don't want for reasons of speed and traffic.
As far as I can see, there's nothing you can do here.

You could delay your Youtube videos from loading on the page until something like a holding image is clicked. This wouldn't cache these images when (or if) they are loaded, but they wouldn't detrimentally affect your Page Speed because they wouldn't be loaded on page load any more.

how to check if my website is being accessed using a crawler?

how to check if a certain page is being accessed from a crawler or a script that fires contineous requests?
I need to make sure that the site is only being accessed from a web browser.
Thanks.

This question is a great place to start:
Detecting 'stealth' web-crawlers
Original post:
This would take a bit to engineer a solution.
I can think of three things to look for right off the bat:
One, the user agent. If the spider is google or bing or anything else it will identify it's self.
Two, if the spider is malicious, it will most likely emulate the headers of a normal browser. Finger print it, if it's IE. Use JavaScript to check for an active X object.
Three, take note of what it's accessing and how regularly. If the content takes the average human X amount of seconds to view, then you can use that as a place to start when trying to determine if it's humanly possible to consume the data that fast. This is tricky, you'll most likely have to rely on cookies. An IP can be shared by multiple users.

You can use the robots.txt file to block access to crawlers, or you can use javascript to detect the browser agent, and switch based on that. If I understood the first option is more appropriate, so:
User-agent: *
Disallow: /
Save that as robots.txt at the site root, and no automated system should check your site.

I had a similar issue in my web application because I created some bulky data in the database for each user that browsed into the site and the crawlers were provoking loads of useless data being created. However I didn't want to deny access to crawlers because I wanted my site indexed and found; I just wanted to avoid creating useless data and reduce the time taken to crawl.
I solved the problem the following ways:
First, I used the HttpBrowserCapabilities.Crawler property from the .NET Framework (since 2.0) which indicates whether the browser is a search engine Web crawler. You can access to it from anywhere in the code:
ASP.NET C# code behind:
bool isCrawler = HttpContext.Current.Request.Browser.Crawler;
ASP.NET HTML:
Is crawler? = <%=HttpContext.Current.Request.Browser.Crawler %>
ASP.NET Javascript:
<script type="text/javascript">
var isCrawler = <%=HttpContext.Current.Request.Browser.Crawler.ToString().ToLower() %>
</script>
The problem of this approach is that it is not 100% reliable against unidentified or masked crawlers but maybe it is useful in your case.
After that, I had to find a way to distinguish between automated robots (crawlers, screen scrapers, etc.) and humans and I realised that the solution required some kind of interactivity such as clicking on a button. Well, some of the crawlers do process javascript and it is very obvious they would use the onclick event of a button element but not if it is a non interactive element such as a div. The following is the HTML / Javascript code I used in my web application www.so-much-to-do.com to implement this feature:
<div
class="all rndCorner"
style="cursor:pointer;border:3;border-style:groove;text-align:center;font-size:medium;font-weight:bold"
onclick="$TodoApp.$AddSampleTree()">
Please click here to create your own set of sample tasks to do
</div>
This approach has been working impeccably until now, although crawlers could be changed to be even more clever, maybe after reading this article :D

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string