wget solution to avoid downloading files already present in the mirror when Last-modified header is not provided - linux

I want to refresh the mirror of a website which server is not set to deliver Last-Modified response headers.
Last-modified header missing -- time-stamps turned off. Remote file exists and could contain links to other resources -- retrieving.
I do not have admin rights to that server.
I'd be happy to set wget to ignore files for which the file size is identical, but I have not found a way to accomplish that.
I want to minimize bandwidth and avoid downloading files again, even if they were different but with the same size.
How would I implement such filtering from the CLI?

Related

How to set cache-control to always check for updates but always fall back to cache if server is unreachable

I'm missing something trying to understand cache-control (e.g., from https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control).
How do I set up cache control to accomplish the following (I'll be using an .htaccess file):
If client fetches a file, it should always store it in the cache.
When client needs a file, it should always check to see if the file has been changed and download a new copy if it has changed.
If the attempt to check fails -- e.g., server down or no Internet connection -- client should always use a cached copy if available, no matter how old. Any copy is better than none.
Use Cache-Control: no-cache and set the ETag header.
The resource will be stored in the cache. This is true of any cache header other than no-store.
no-cache tells the client that it must check with the server to see if the cached copy is valid. It does this by sending a conditional request, which requires that the cached response have an ETag (or Last-Modified) header.
Using a cached copy of a resource when there's no connectivity is the default behavior. You could prevent this with the must-revalidate directive.

Amazon CloudFront - Inavalidating files by regex, e.g. *.png

Is there a way to have Amazon CloudFront invalidation (via the management console), invalidate all files that match a pattern? e.g. images/*.png
context -
I had set cache control for images on my site, but by mistake left out the png extension in the cache directive on Apache. So .gif/.jpg files were cached on users computer but .png files WERE not.
So I fixed the apache directive and now my apache server serves png files with appropriate cache control directives. I tested this.
But the cloudfront had in past fetched those png files, SO hitting those png files via cloudfront still brings those png files with NO cache control. End Result - still no user caching for those png files
I tried to set the invalidation in Amazon CloudFront console as images/*.png. The console said completed, but I still do not get cache control directive in png files. --> Makes me believe that the invalidation did not happen.
I can set the invalidation for the complete image directory; but then I have too many image files --> I would get charged > $100 for this. So trying to avoid this.
Changing image versions so that cloudfront fetches new versions is a painful exercise in my code; doing it for say 500 png files would be a pain. --> Trying to avoid it.
Listing individual png files is also a pain --> trying to avoid it as well.
Thanks,
-Amit
If your CloudFront distribution is configured in front of an S3 bucket, you can list all of the objects in the S3 bucket, filter them with a regex pattern (e.g., /*.png/i), then use that list to construct your invalidation request.
That's what I do anyway. I hope this helps! :)

Vary header when content is not gzip:ed on IIS 7 as origin for CDN

I'm trying to set up my IIS server as an origin server for a CDN. I have solved some issues already for example that IIS doesn't give gziped content to proxies (if they have the via header) and also that frequentHitThreshold problem.
My CDN supplier pointed out that another problem with IIS is that it doesn't return a "Vary" header if the client doesn't request the content gziped. According to them the problem is that if for some reason the first client that request the content doesn't want the content gziped the CDN then doesn't request a new version of the file since the Vary header doesn't indicate that it should return two different files depending on "Accept-Encoding".
My only solution so far is to add "Vary: Accept-Encoding" as a custom header but since IIS automatically add this vary header when gziped is requested so i end up with multiple values like "Vary: Accept-Encoding, Accept-Encoding".
Anyone have any solution to this? Or can confirm that it's a real issue.
This is a real issue. IIS gzip module overwrites existing Vary headers. Please vote on this MS Connect issue. Related article here.
This issue is now addressed by an official patch to IIS. To download and further info, visit http://support.microsoft.com/kb/2877816
Erez Benari, IIS PM

How to prevent IIS from sending cache headers with ASHX files

My company uses ASHX files to serve some dynamic images. Being it that the content type is image/jpeg, IIS sends headers with them as would be appropriate for static images.
Depending on settings (I don't know all of the settings involved, hence the question) the headers may be any of:
LastModified, ETag, Expires
Causing the browser to treat them as cacheable, which leads to all sorts of bugs with the user seeing stale images.
Is there a setting that I can set somewhere that will cause ASHX files to behave the same way as other dynamic pages, like ASPX files? Short of that, is there a setting that will allow me to, across the board, remove LastModified, Etag, Expires, etc and add a no-cache header instead?
Only solutions I've found were:
1) Adding Response.ContentControl = "no-cache" to each handler.
I don't like this because this requires all of the handlers to change and for all developers to be aware of it.
2) Setting HTTP Header override on a folder where the handlers live
I don't like this one because it requires the handlers to be in their own directory. While this may be good practice in general, unfortunately our application is not structured that way, and I cannot just move them because it would break client-facing links.
If nobody provides a better answer I'll have to accept that these are the only two choices.
Add a random generated string to the request query. This will trick the browser into thinking it is a different call. Example: document.getElementById("myimgcontl").src="myimages.ashx?15923763";.

Varnish caching too much files and not caching php

I'm using Varnish without touching any configuration (just the PORT forwarding to Apache to 8080).
But I got two issues:
I visit a URL of an image, I delete the image, and I visit again and it exists … Varnish cached it … how can i tell varnish to look first if the file AT LEAST exists before serving it from his cache ?
The PHP files are not being cached (I mean, the HTML content generated by the PHP). I always see in the Headers: Age: 0 … any clue ?
Thank you !
I visit a URL of an image, I delete the image, and I visit again and
it exists … Varnish cached it … how can i tell varnish to look first
if the file AT LEAST exists before serving it from his cache ?
Eh, the whole purpose of caching is not having to do the same work (like checking for existence & loading a file, or generating a PHP response) over and over again, but to reuse the generated response. Varnish never new about the existence of some file to begin with (your backend server did the math) so it can never check if 'the file at least exists'.
There are however ways to instruct varnish not to cache urls forever. For instance; if your back-end response instructs any cache to not reuse the result (certain HTTP response headers indicate this), varnish will not cache it. Varnish will be smart enough (by default) to not cache responses with cookies too (which probably answers your second question). You can tell varnish to only cache a response for a certain period (like 30 seconds), so your deletes will be picked up pretty quickly. You could PURGE urls from varnish after you changed/delete a file. If your backend server does not tell this correctly with it's response headers, your can override this behavior by writing your own .vcl file.
The PHP files are not being cached (I mean, the HTML content generated
by the PHP). I always see in the Headers: Age: 0 … any clue ?
I can guess: you're setting cookies. But it would really help if you added the response headers to your question.

Resources