In AWS Cloudfront, how to invalidate the home page only? - amazon-cloudfront

How to invalidate only the home page (https://mywebsite/) from cloudfront? Usually subsequest path, then we would say images/image1.png or /images/*. But if it is the home page, when how do we mention the object path?

There is no way to invalidate only the home page, when the home page is requested like http://example.com or https://example.com/ (etc.) (If it is requested like http://example.com/index.htm it can be invalidated using an invalidation path of /index.htm)
I have had this confirmed by Amazon via a support case where I asked this specific question. They confirmed, the only way to invalidate the home page, when requested like http://example.com or https://example.com/ (etc.), is to invalidate the whole distribution/site using an invalidation path of /* (Trying an invalidation path of / has no effect.)
The good news is that an invalidation path of /* only counts as one path, so the cost of using it (if one has gone above the free allowance of 1,000 paths each month) is only the same as a more specific path like /index.htm for example.

Related

new to .htaccess, how to redirect specific page to mainpage

I'm new to .htaccess file.
My site is hosted on 1and1 and by default it shows www.mydomain.com/defaultsite when nothing is uploaded to my account. Now I've uploaded my wp site and have managed to make it go to index, but if someone inputs in the url www.domain.com/defaultsite he will still get the wrong place.
How can I manage this issue with .htaccess file so that any request to defaultsite will take the user to www.mydomain.com ?
I'm not a 1and1 user, but this could be a DNS cache issue. First, check your document root for the presence of a directory called defaultsite. If it exists, remove it. If not, then you can attempt removing it using mod_rewrite. Insert this rule immediately after RewriteEngine On in your .htaccess file:
RewriteRule ^defaultsite/?$ http://yourdomain.com/ [R=302,L]
If it's working for you, you can safely change 302 to 301 to make in permanent and cache-able.
I have also seen comments referring to an index.html file in the document root. If you see one, delete it - it could be that, internally, 1and1 maps defaultsite to index.html.
Also, it will help for you to clear the cache in your browser when testing. If using Chrome, you can open the Developer Tools (Ctrl+Shift+I), click on the settings cog at the top right of the panel that opens, and check 'Disable cache (while DevTools is open)`.
I had a similar issue and was pulling out my hair trying to figure this out. 1&1 is hosting while Namecheap holds my domain. I was able to access my page without /defaultsite on Safari and mobile Chrome. But on desktop Chrome I was being redirected to /defaultsite.
To remedy this I cleared my cache, flushed my DNS cache, and cleared my browsing history. Not sure if the latter 2 were necessary but having done all three it did help resolve this issue.

Security concerns using robots.txt

I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.
But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:
# robots.txt for Sites
# Do Not delete this file.
User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/
that will tell anyone the pages I don't want anyone to go to.
Any idea how to avoid this?
PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.
PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.
URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.
What you can do is:
Secure access to the functionality behind those URLs
Ask people nicely not to visit them
There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.
Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.
That's really all you can do.
You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.
If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.
Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.
You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).
Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.
So to disallow /php/ipnius726.php, you could use the following robots.txt:
User-agent: *
Disallow: /php/ipn
This will block all URLs whose path starts with /php/ipn, for example:
http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php
This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.
http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html

Can .htaccess be configured to retain the same address on different pages?

Im configuring a desktop and mobile version of my site and was looking to use js to test for browser dimensions and then load the relevant version, however the problem is if someone shares a link from the mobile version and sends it to a desktop user then they circumvented the check. Is there a way to configure .htaccess (or some other method) to have the address bar show 'mysite.com' even though i would be loading 'mysite.com/mobile.htm'? I know i can always use media queries but that has the downfall of loading unused assets, so this method would be alot better.
Use a rewrite instead of a redirect. With a redirect, the browser is instructed to go to another address. With a URL rewrite, the server just responds with the contents of a different URL.
For just this page it will be simple, but it could be complicated, based on your site.
Another way is to include a little JS in every page to make sure you are on the right one for the device and redirect to the other if not. It would help if there was some pattern to easily determine the corresponding page.

Magento .htaccess Removal of index.php from Store Urls

I do not want my URLs on my Store to contain index.php. All my product pages can be accessed via addition of index.php and without it which is not good and I want to get rid of index.php from the urls.
For example, both of these urls display the same content, but I only want users to ever be directed to the second URL:
http://www.pudu.com.au/index.php/outback-mens-denim-cargo-shorts.html
&
http://www.pudu.com.au/outback-mens-denim-cargo-shorts.html
Likewise for content at the top level of the site:
http://www.pudu.com.au/index.php
&
http://www.pudu.com.au/
Magento adds the system endpoint (index.php) to the URL path based on a configuration setting. In the admin, navigate to System > Configuration. In the Web section, open the Search Engines Optimization group and set Use Web Server Rewrites to "Yes". You'll need to flush the configuration cache and redindex URL rewrites thereafter.

How does Concrete5 arrange it's absolute paths?

I've been asked to figure out how the Concrete5 system works for an employer, and I can't figure something out.
I have Concrete5 installed to a directory on the server called /realprofessionals. When the Concrete5 system makes new pages, it gives them their own absolute paths, for instance:
http://www.wmcpartners.com/realprofessionals/footer
However, it hasn't actually made a folder in the /realprofessionals directory called footer. So how does that work? How can http://www.wmcpartners.com/realprofessionals/footer be a working link?
Short answer: All page requests are actually going through the one and only index.php file. Page content is stored in the database, not in files on the server.
Long answer:
Concrete5 (and most PHP-based CMS's for that matter) work like this: all requests are routed through the index.php file. This routing is enforced with some mod_rewrite rules in the .htaccess file. The rules say "for any request, don't actually go to that page, but instead go to index.php and pass the rest of the requested path as $_GET parameters". Then in the index.php code (or some other code that is included by the index.php file), the requested page is determined based on the path that was put into the $_GET parameters by Apache (as per the mod_rewrite rule in .htaccess), and the appropriate content is retrieved from the database.
Storing content in the database as opposed to files on the server has several advantages. For example, you can re-use the same html template -- header, footer, sidebar -- on every page, and if you change the template it will automatically be reflected on all pages it's used on. Also, it makes it easier to shuffle pages around and to give them whatever URL you want (e.g. no ".php" extension at the end, or /2010/11/date/based/paths/for/blog/posts).
The disadvantage of course is that every request requires many database queries, but for most sites (those without zillions of page views), the trade-off is well worth it (and various types of caching can help reduce the performance hit).
Jordan's answer is excellent, I would add that you probably don't see index.php in the url because you've enabled pretty URLs (type 'pretty' on concrete5's searchbox to check that).
Anyhow, the best way to programmatically add link to internal pages is:
<a href="<?=$this->url('page-name');?>">
page name
</a>
It works both on localhost and online, with or without pretty URLs.
(For the page-name go to dashboard/full sitemap/page-name/properties/page paths and location.)

Resources