Google indexing Cloudfront distribution

Google indexing Cloudfront distribution - amazon-cloudfront

I have a static site through Cloudfront with an S3 origin & custom domain via Route 53. All works well, except that Google has also indexed the Cloudfront distribution url (d123etc.cloudfront.net) as well as my custom domain, leading to duplicate content issues.
I've tried canonical urls, but the distribution remains indexed. It has been suggested to serve up a different robots.txt depending on what domain is being used, which sounds fine, but there is no .htaccess or web server, leaving it to a Lambda Edge function to try and send the different robots.txt.
The problem is that I can't find how in the function to determine if a request is coming from my custom domain or from the direct distribution url. I've tried white-listing the Origin, but it is not sent through when using an S3 origin. I've also tried white-listing the Referer header, but no referrer is sent through when accessing the robots.txt file as it's a direct request.
For the time-being, I'm adding a meta noindex client-side using js on page load (which I realise is too late), and also redirecting client-side to my actual domain in case someone follows the google indexed cloudfront.net domain.
Does anyone know how to detect in Lambda Edge which domain is being used to make the request? Or some other way of blocking Google from indexing the Cloudfront url, just leaving it to index the custom domain.

So I think the way to do this would be to set up a redirect on your hosted webserver. If you check the 'host' in the request header and check for cloudfront.com, send a 301 response code along with your custom domain name.
S3 has a UI way to do this:
https://medium.com/tensult/how-to-do-site-redirection-using-aws-522a4002c645
It seems you'll need a second bucket behind the same cloudfront url but without the custom domain. Then you can set it to redirect all requests to your custom domain.
The browser or bots would then stop trying cloudfront.com because it doesn't return anything, they would automatically (without the user really noticing) to my domain.xyz and all the links would link to your own domain.

Related

Using Microsoft Azure CDN with already existing site

I am fairly new to using CDN but i've found that there are two types of CDN.
You redirect your DNS to your CDN and they automatically take over the traffic as a proxy and do the caching and content delivery. No change in URLs and it's basically no work. Even hard to understand if my content is being delivered through CDN (you have to check headers or use website tools that look for it). Good example is CloudFlare
You do not redirect your DNS. You give it an origin server, then everything gets copied over to the CDN servers and you content is available on the new CDN URLs.
Now, i have a website with a lot of images. I want to use Microsoft Azure CDN. I created my profile (Standart Microsoft CDN) and created the CDN endpoint. I tested and it works fine
https://xxxx.com/images/example.png
https://xxxx.azureedge.net/images/example.png
All good - my image is there, along wiht others
So what comes next? I have an image (img src tag) for example pointing to /images/example.png. It seems like i need to change it to https://xxxx.azureedge.net/images/example.png
So my website has a lot of images and if i have to go and manually re-do all the img src tags it seems like a lot of work and what happens if i decide to move to another CDN or stop using CDN. So all this leads me to believe i might be missing a point here and not doing this correctly.
Is that the correct way a CDN like this should work? If yes, may i get some help on how can i achieve that with minimum amount of labour? re-doing all my css, js and images to the new URLs? I am using Joomla CSM.
Documentation out there on how to tackle or deal with something as easy as this are unbelievably limited.

Basically you are right. Mainly, CDN services will basically "pull" static content (for example images) from your website, and then serve them from multiple locations (servers) to your visitors from your provided CDN url. For example:
Your origin url
mydomain.com/image.jpg
CDN url
mycdn.cdnservice.com/image.jpg
If the URL was the SAME as your existing url, then it wouldn't really work as a CDN now would it. There are often options so that you can use your own subdomain, for example cdn.mydomain.com/image.jpg, but it's still a change of URL. Most CMS's will often have options, or at least plugins, to set CDN url for static assets, which will dynamically replace the paths to point to the CDN url. If you have set file paths manually, then these will need to be replaced manually also with the full CDN path.
There are a few hacks like server rewrite which might allow you to use the same URL, but this is not recommended to pursue. Generally speaking, using a CDN requires changing url to your static assets.
Option #2 is to use a reverse proxy CDN service like Cloudflare. This requires changing your nameservers to route ALL your traffic through Cloudflare, and then Cloudflare will work as a CDN for static assets without you having to change url paths. However, it must be noted that Cloudflare is much more than just a CDN, and you can't really control how your assets are cached on their CDN/servers.

Is it possible to rewrite/proxy POST requests in Netlify?

Netlify noob here.
I'm currently migrating an old Ruby on Rails app to use Netlify for a static site. There are some legacy static pages that we want to keep on our old code base, and these legacy static pages make POST requests to our server.
It seems like redirects for POST requests aren't possible (see the W3 documentation for 301/302 redirects- If the 301 status code is received in response to a request other than GET or HEAD, the user agent MUST NOT automatically redirect the request unless it can be confirmed by the user, since this might change the conditions under which the request was issued.), but I was wondering if this is different when you proxy/rewrite the URL.
Currently, we rewrite a user's request to www.domain.com/legacy_slug via Netlify's splat redirect (similar to what the author of this blog post did). Is it possible for this redirect to work as well when the user sends a POST request to www.domain.com, causing it to go to Netlify? Or would I have to change the client's code to POST to <different_subdomain>.domain.com/legacy_slug and migrate the POST endpoint to the different subdomain?

Proxies (https://www.netlify.com/docs/redirects/#proxying) accept POSTs, redirects (https://www.netlify.com/docs/redirects/#basic-redirects) or rewrites (HTTP 200 that transform from one path to another, both on netlify-hosted sites), no.
Kind of a subtle distinction. So - I'd send the POST to some other path (not some other domain - just /place-we-post-to on your Netlify site, and use a proxy redirect to get to your remote service (/place-we-post-to https://legacybackend.com 200 in _redirects)

Use cookie-free domains - subdomain solution not working

I'm trying to optimize an html webpage, and one of the suggestions from yslow is:
Use cookie-free domains There are 11 components that are not
cookie-free
So I followed one of the standard solutions I've seen and created a subdomain static.mysite.com and put the images there.
But I'm still getting the exact same problem -- a cookie is still being delivered with each image, and same yslow message.
So how do I get this subdomain to be cookie free?

If you are using subdomain for cookie-free delivery then your main page has to use www prefix.

I had the same problem. The subdomain simply didn't work, so I used a different domain name and it solved the problem.

When the browser makes a request for a static image and sends cookies together with the request, the server doesn't have any use for those cookies. So they only create network traffic for no good reason. You should make sure static components are requested with cookie-free requests. Create a subdomain and host all your static components there.
If your domain is www.example.org, you can host your static components on static.example.org. However, if you've already set cookies on the top-level domain example.org as opposed to www.example.org, then all the requests to static.example.org will include those cookies. In this case, you can buy a whole new domain, host your static components there, and keep this domain cookie-free. Yahoo! uses yimg.com, YouTube uses ytimg.com, Amazon uses images-amazon.com and so on.
Another benefit of hosting static components on a cookie-free domain is that some proxies might refuse to cache the components that are requested with cookies. On a related note, if you wonder if you should use example.org or www.example.org for your home page, consider the cookie impact. Omitting www leaves you no choice but to write cookies to *.example.org, so for performance reasons it's best to use the www subdomain and write the cookies to that subdomain.
Source - http://developer.yahoo.com/performance/rules.html
EDIT
If you set your cookies on a top-level domain (e.g. yourwebsite.com) all of your sub-domains (e.g. static.yourwebsite.com) will also include the cookies that are set. Therefore, in this case, it is required that you use a separate domain name to deliver your static content if you want to use cookie-free domains. However, if you set your cookies on a www subdomain such as www.yourwebsite.com, you can create another subdomain (e.g. static.yourwebsite.com) to host all of your static files which will no longer result in any cookies being sent.
For Wordpress you can use this config:
define("WP_CONTENT_URL", "http://static.yourwebsite.com");
define("COOKIE_DOMAIN", "www.yourwebsite.com");
Details - https://www.keycdn.com/support/how-to-use-cookie-free-domains/
EDIT 2
You will need to move your static content over to the wp-content folder of your newly created subdomain!

why do CDNs always use a seperate host instead of subdomain?

a mirroring CDN can't have the same hostname as you application server, because you need a way for the CDN to explicitly reference the application.
Why, in general, do sites like facebook run their CDN on a totally seperate host, not just a subdomain like cdn.facebook.com? example: http://profile.ak.fbcdn.net/hprofile-ak-snc4/173706_6103645_790537_q.jpg
Is the reason, that they can construct resource URLs with many different hostnames, to avoid the 4-connections-per-host limit on some browsers?

If your domain is www.example.org, you can host your static components on static.example.org. However, if you've already set cookies on the top-level domain example.org as opposed to www.example.org, then all the requests to static.example.org will include those cookies.
From: http://developer.yahoo.com/performance/rules.html#cookie_free

Because user generated content can contain nasties that may be able to access data hosted on the primary domain.
It also stops things like cookies and authentication getting sent in the request to CDN content.
Preventing users from inserting
scripts, and at the same time allowing
user submitted html is extremely
difficult to do on the server side -
ergo we must have sandboxing.
Borrowed from a fairly old whatwg post

Question regarding sitemaps

I am storing my sitemaps in my web folder. I want web crawlers (Googlebot etc) to be able to access the file, but I dont necessarily want all and sundry to have access to it.
For example, this site (stackoverflow.com), has a site index - as specified by its robots.txt file (https://stackoverflow.com/robots.txt).
However, when you type https://stackoverflow.com/sitemap.xml, you are directed to a 404 page.
How can I implement the same thing on my website?
I am running a LAMP website, also I am using a sitemap index file (so I have multiple site maps for the site). I would like to use the same mechanism to make them unavailable via a browser, as described above.

First, decide which networks you want to get your actual sitemap.
Second, configure your web server to grant requests from those networks for your sitemap file, and configure your web server to redirect all other requests to your 404 error page.
For nginx, you're looking to stick something like allow 10.10.10.0/24; into a location block for the sitemap file.
For apache, you're looking to use mod_authz_host's Allow directive in a <Files> directive for the sitemap file.

You can check the user-agent header the client sends, and only pass the sitemap to known search bots. However, this is not really safe since the user-agent header is easily spoofed.

Stack Overflow presumably checks two things when deciding who gets access to the sitemaps:
The USER_AGENT string
The originating IP address
both will probably be matched against a database of known legitimate bots.
The USER_AGENT string is pretty easy to check in a server side language; it is also very easy to fake. More info:
For how to check the USER_AGENT string Way to tell bots from human visitors?
For instructions on IP checking Google: Google Webmaster Central: How to verify Googlebot
Related: Allowing Google to bypass CAPTCHA verification - sensible or not?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string