CloudFront report showing 404 URLs

CloudFront report showing 404 URLs - amazon-cloudfront

There are URLs in my cloudfront that are returning 404. Once I invalidate them, all is well. I assume that at some point in time the origin server returned 404, which was cached by cloudfront.
Is there a way to generate a CloudFront report showing the URLs that are marked as missing? (404s)
Is there a way to create an alert for new ones?

on top of #alexjs answer above, one could also set the 404 cache period to 0. In my case, that was the desired behavior as 404s should never happen are usually suggest server issue.

As you've observed, CloudFront by default caches errors.
Are the 404s valid URLs? A common issue is something as part of an application which requests a URL (e.g. a file) to determine its presence before uploading / publishing. This obviously backfires a bit if you have a caching layer.
Is there a way to generate a CloudFront report showing the URLs that are marked as missing? (404s)
These will be written to CloudFront logs.
Is there a way to create an alert for new ones?
You could use CloudWatch Alarms to alert based on 4xx/5xx error rate, but this wouldn't give you granularity into the URIs themselves.
You could use a Lambda function which is invoked upon log delivery to your given S3 bucket, configured to parse them for 4xx, and then sends a notification to you (SNS/SES/etc)

Related

Bing image search API (cognitive) returns broken links?

I am using Azure cognitive service, more precisely the 'bing image search service'.
I send requests to fetch images in relation to a specific keyword.
For this, I make HTTP REST requests to the right azure endpoint:
'https://api.cognitive.microsoft.com/bing/v7.0/images/search?q=MYKEYWORD'
It works well for a lot of requests and results.
However, in some images in the json response of the service, the field 'contentUrl' gives me a broken link to the website hosting the image (404 or 403 on some different wordpress sites for example).
Therefore, my program which tried to download the image thanks to the 'contentUrl' link crashes (or has to throw at least an exception).
I guess it is because the website changed (by removing the image they were hosting) and bing didn't update its database (or the crawler didn't have time to do so).
Hence I don't know what to do :'(
Any help / advice ?

Yes, you are right, the contentUrl could be outdated cache removal.
By default, Bing returns cached content, if available. To prevent Bing
from returning cached content, set the Pragma header to no-cache (for
example, Pragma: no-cache).
You could check the Pragma header in this doc: Headers.

Google indexing Cloudfront distribution

I have a static site through Cloudfront with an S3 origin & custom domain via Route 53. All works well, except that Google has also indexed the Cloudfront distribution url (d123etc.cloudfront.net) as well as my custom domain, leading to duplicate content issues.
I've tried canonical urls, but the distribution remains indexed. It has been suggested to serve up a different robots.txt depending on what domain is being used, which sounds fine, but there is no .htaccess or web server, leaving it to a Lambda Edge function to try and send the different robots.txt.
The problem is that I can't find how in the function to determine if a request is coming from my custom domain or from the direct distribution url. I've tried white-listing the Origin, but it is not sent through when using an S3 origin. I've also tried white-listing the Referer header, but no referrer is sent through when accessing the robots.txt file as it's a direct request.
For the time-being, I'm adding a meta noindex client-side using js on page load (which I realise is too late), and also redirecting client-side to my actual domain in case someone follows the google indexed cloudfront.net domain.
Does anyone know how to detect in Lambda Edge which domain is being used to make the request? Or some other way of blocking Google from indexing the Cloudfront url, just leaving it to index the custom domain.

So I think the way to do this would be to set up a redirect on your hosted webserver. If you check the 'host' in the request header and check for cloudfront.com, send a 301 response code along with your custom domain name.
S3 has a UI way to do this:
https://medium.com/tensult/how-to-do-site-redirection-using-aws-522a4002c645
It seems you'll need a second bucket behind the same cloudfront url but without the custom domain. Then you can set it to redirect all requests to your custom domain.
The browser or bots would then stop trying cloudfront.com because it doesn't return anything, they would automatically (without the user really noticing) to my domain.xyz and all the links would link to your own domain.

Azure: WebContentNotFound on refreshing page of SPA deployed as Azure Blob Static Website with CDN

I have a SPA (built with angular) and deployed to Azure Blob Storage. Everything works fine and well as you go from the default domain but the moment I refresh any of the pages/routes, index.html no longer gets loaded and instead getting the error "the requested content does not exist"
Googling that term results in 3 results total so I'm at a loss trying to diagnose & fix this.

You can simply configure the error page to index.html in your static website:

Actually the issue was I didn't have 404.html defined -- the blob storage for SPA doesn't understand what file to serve for any other routes than the root one. So every other route will go to the 404 file. But in a SPA even the 404 goes through the index file. So all I did is mention index.html as my 404 file and all is well.

For me adding the index.html page as the Error document page did not help when navigating by url as it would still reload the app. I posted an answer elsewhere relating to rather using the Angular HashLocationStrategy and that does not cause a page reload when changing the URL manually.
Answer on other SO question

There is a new static Webapps solution by Microsoft. It is currently in preview mode but I think it is the most convenient way to use/deploy a SPA in the Azure infrastructure. You can use your custom domain with free SSL, version control, and set up a route to redirect everything to the index.html (fallback routes: https://learn.microsoft.com/en-us/azure/static-web-apps/routes) for example. see more details here: https://learn.microsoft.com/en-us/azure/static-web-apps/

Generally, you've created a CDN profile and an endpoint, but your content doesn't seem to be available on the CDN. Users who attempt to access your content via the CDN URL receive an HTTP 404 status code. You can follow these methods in troubleshooting Azure CDN endpoints that return a 404 status code
There are several possible causes, including:
The file's origin isn't visible to the CDN. The endpoint is
misconfigured, causing the CDN to look in the wrong place. The host is
rejecting the host header from the CDN. The endpoint hasn't had time
to propagate throughout the CDN.
With CDN, At the initial request, the client directly accesses to the origin server, afterward, at the following request, when you refresh the page, the client requests to the CDN cache server until their time-to-live (TTL) elapses. See Manage expiration of Azure Blob storage in Azure CDN and Control Azure CDN caching behavior with caching rules.
In this case, you may ensure websites blob content is publicly available on the Internet. After that, you may verify that your origin settings are properly configured. Verify that the values of the Origin type and Origin hostname are correct. Verify HTTP and HTTPS ports is represented as your static website is listening on. Kindly you could get more details from that troubleshooting link.

TL/DR
You could set the error document (404) to also be your index.html
This is a quick fix that will still return 404, however will also actually follow your deep link.
This isn't a 'fix'. It's more of a hack - the real fix is to add a CDN with some URL redirect rules on your hosting server. here is a great guide: https://antbutcher.medium.com/hosting-a-react-js-app-on-azure-blob-storage-azure-cdn-for-ssl-and-routing-8fdf4a48feeb
Rule itself
But to save you the click, the CDN rule using standard microsoft CDN (the cheaper one) is something like this:
(add the condition with the '+ condition' button)
If URL file extension > less than 1 extension > no case transform
(add the action with '+ Add action' button)
source pattern: '/' > Destination: '/index.html' > preserve unmatched path: no
Explanation
Ill attempt to add an explanation that I think nobody else did nicely.
What this rule will do is say any URL request that isn't for a direct file, eg.
example.com/xyz
example.com/user/xyz
example.com/tabs/post/12345
Or ANYTHING without a direct file extension (like '.png' or '.pdf' or '.html')
Then we will rewrite the URL to be 'index.html' this is the host file where the SPA has javascript to handle deep links for paths like in the example - therefore you will not get a 404 and the code will handle gracefully.

Is there any way to identify requests coming to custom origin server from CloudFront?

I'm using CloudFront with custom origin and want to redirect certain requests coming to a web app to CloudFront (clients use direct URLs, which cannot be changed to CloudFront-based URLs). In order to ensure that cache on CloudFront is updated properly, I must not redirect requests coming from CloudFront itself. Is there any way to identify such requests on origin server?
Does CloudFront add any custom headers to requests sent to origin server? Or is there any other reliable way to determine that requests is coming from CloudFront?

yes you can identify requests coming to your origin server from cloudfront by checking the useragent. the user agent would be 'Amazon CloudFront'

Update
It's an old question, but my update useful for someone research or looking for the new solution.
Recently AWS added new feature Origin Custom Headers.You can set a header with a secret value and check it on your origin server by the web server or your applications.

Update
Avinash Bijja correctly pointed out (+1) that the HTTP User-agent header would be 'Amazon CloudFront' for requests coming from Amazon CloudFront servers. Unfortunately this doesn't seem to be explicitly documented indeed, but is implicitly acknowledged by various posts in the respective forum, see e.g. the AWS Team response to User Agent String - does CF overwrite the user agent string?:
You are correct. The User-Agent field is always populated as "Amazon CloudFront".
However, it turns out this is not currently entirely reliable, insofar CloudFront sends an empty User-Agent to the origin if one is missing in the originating client request already:
I can confirm that CloudFront is not sending a User-Agent to the
origin when the original client does not send a User-Agent. We have
enhancements & fixes to User-Agent handling on our backlog, but no
release dates at this time. I've sent you a PM with further details.
These enhancements & fixes are apparently not rolled out still as of February 07 2013 at least.
These enhancements & fixes have been rolled out as of August 05 2013 (thanks webbiedave for the update!).
Initial Answer
Does CloudFront add any custom headers to requests sent to origin
server?
One would think so indeed, but at least they don't appear to be documented where I would have expected it, namely in How CloudFront Processes and Forwards Requests to Your Custom Origin Server. Given you are in control of the origin server, you might just check its HTTP access logs though?
Or is there any other reliable way to determine that requests is
coming from CloudFront?
You'll need to judge the reliability yourself, but The IP address that CloudFront forwards to the origin server is the IP addresses of a CloudFront server, not the IP address of the end user's computer. - consequently you could restrict access to the published Amazon CloudFront Public IP Ranges; however, be aware of the respective disclaimer:
The CloudFront IP addresses change frequently and we cannot guarantee
advance notice of changes. On a best-effort basis, we will provide the
list of current addresses. Customers should not use these addresses
for mission critical applications and must never hard code them in DNS
names. [emphasis mine]
Consequently you'll need to monitor this forum/post to take notice of respective changes as early as possible (if this constraint is acceptable for your use case in the first place of course).

CloudFront appears to add a X-Amz-Cf-Id header to every request before forwarding it to the origin. At least, it currently is doing that for me.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#request-custom-headers-behavior

This should probably be a comment on Reza's answer, but I can't do that :).
For completeness, here's the link to the official documentation regarding Forwarding Custom Headers, which currently claims the following.
You can configure CloudFront to include custom headers whenever it forwards a request to your origin. You can specify the names and values of custom headers for each origin, both for custom origins and for Amazon S3 buckets. Custom headers have a variety of uses, such as the following:
You can identify the requests that are forwarded to your custom origin by CloudFront. This is useful if you want to know whether users are bypassing CloudFront or if you're using more than one CDN and you want information about which requests are coming from each CDN. (If you're using an Amazon S3 origin and you enable Amazon S3 server access logging, the logs don't include header information.)

How can I prevent Amazon Cloudfront from hotlinking?

I use Amazon Cloudfront to host all my site's images and videos, to serve them faster to my users which are pretty scattered across the globe. I also apply pretty aggressive forward caching to the elements hosted on Cloudfront, setting Cache-Controlto public, max-age=7776000.
I've recently discovered to my annoyance that third party sites are hotlinking to my Cloudfront server to display images on their own pages, without authorization.
I've configured .htaccessto prevent hotlinking on my own server, but haven't found a way of doing this on Cloudfront, which doesn't seem to support the feature natively. And, annoyingly, Amazon's Bucket Policies, which could be used to prevent hotlinking, have effect only on S3, they have no effect on CloudFront distributions [link]. If you want to take advantage of the policies you have to serve your content from S3 directly.
Scouring my server logs for hotlinkers and manually changing the file names isn't really a realistic option, although I've been doing this to end the most blatant offenses.

You can forward the Referer header to your origin
Go to CloudFront settings
Edit Distributions settings for a distribution
Go to the Behaviors tab and edit or create a behavior
Set Forward Headers to Whitelist
Add Referer as a whitelisted header
Save the settings in the bottom right corner
Make sure to handle the Referer header on your origin as well.

We had numerous hotlinking issues. In the end we created css sprites for many of our images. Either adding white space to the bottom/sides or combining images together.
We displayed them correctly on our pages using CSS, but any hotlinks would show the images incorrectly unless they copied the CSS/HTML as well.
We've found that they don't bother (or don't know how).

The official approach is to use signed urls for your media. For each media piece that you want to distribute, you can generate a specially crafted url that works in a given constraint of time and source IPs.
One approach for static pages, is to generate temporary urls for the medias included in that page, that are valid for 2x the duration as the page's caching time. Let's say your page's caching time is 1 day. Every 2 days, the links would be invalidated, which obligates the hotlinkers to update their urls. It's not foolproof, as they can build tools to get the new urls automatically but it should prevent most people.
If your page is dynamic, you don't need to worry to trash your page's cache so you can simply generate urls that are only working for the requester's IP.

As of Oct. 2015, you can use AWS WAF to restrict access to Cloudfront files. Here's an article from AWS that announces WAF and explains what you can do with it. Here's an article that helped me setup my first ACL to restrict access based on the referrer.
Basically, I created a new ACL with a default action of DENY. I added a rule that checks the end of the referer header string for my domain name (lowercase). If it passes that rule, it ALLOWS access.
After assigning my ACL to my Cloudfront distribution, I tried to load one of my data files directly in Chrome and I got this error:

As far as I know, there is currently no solution, but I have a few possibly relevant, possibly irrelevant suggestions...
First: Numerous people have asked this on the Cloudfront support forums. See here and here, for example.
Clearly AWS benefits from hotlinking: the more hits, the more they charge us for! I think we (Cloudfront users) need to start some sort of heavily orchestrated campaign to get them to offer referer checking as a feature.
Another temporary solution I've thought of is changing the CNAME I use to send traffic to cloudfront/s3. So let's say you currently send all your images to:
cdn.blahblahblah.com (which redirects to some cloudfront/s3 bucket)
You could change it to cdn2.blahblahblah.com and delete the DNS entry for cdn.blahblahblah.com
As a DNS change, that would knock out all the people currently hotlinking before their traffic got anywhere near your server: the DNS entry would simply fail to look up. You'd have to keep changing the cdn CNAME to make this effective (say once a month?), but it would work.
It's actually a bigger problem than it seems because it means people can scrape entire copies of your website's pages (including the images) much more easily - so it's not just the images you lose and not just that you're paying to serve those images. Search engines sometimes conclude your pages are the copies and the copies are the originals... and bang goes your traffic.
I am thinking of abandoning Cloudfront in favor of a strategically positioned, super-fast dedicated server (serving all content to the entire world from one place) to give me much more control over such things.
Anyway, I hope someone else has a better answer!

This question mentioned image and video files.
Referer checking cannot be used to protect multimedia resources from hotlinking because some mobile browsers do not send referer header when requesting for an audio or video file played using HTML5.
I am sure of that about Safari and Chrome on iPhone and Safari on Android.
Too bad! Thank you, Apple and Google.

How about using Signed cookies ? Create signed cookie using custom policy which also supports various kind of restrictions you want to set and also it is wildcard.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string