robots.txt needs only certain files and folders and disallow everything - .htaccess

I want robots.txt to allow only index.php and images folder and disallow all other folders, is this possible?
This is my code:
User-agent: *
Allow: /index.php
Allow: /images
Disallow: /
Secondly, is it possible to do the same job with htaccess?

First, be aware that the "Allow" option is actually a non-standard extension and is not supported by all crawlers. See the wiki page (in the "Nonstandard extensions" section) and the robotstxt.org page.
This is currently a bit awkward, as
there is no "Allow" field. The easy way is to put all files to be
disallowed into a separate directory, say "stuff", and leave the one
file in the level above this directory:
Some major crawlers do support it, but frustratingly they handle it in different ways. For example. Google prioritises Allow statements by matching characters and path length, whereas Bing prefers you to just put the Allow statements first. The example you've given above will work in both cases, though.
Bear in mind those crawlers who do not support it will simply ignore it, and will therefore just see your "Disallow" rule, effectively stopping them from indexing your entire site! You have to decide if the extra work moving files around (or writing a long list of Disallow rules for all your subdirectories) is really worth the bonus of getting indexed by the lesser crawlers. Probably not.
Ref htaccess, you can't really do anything useful with it here. You'd have to match the user agent against a large list of known bots and you'd just end up missing some - or worse, blocking real users.

Yes, that code is correct. The robots.txt file is read from top to bottom so as long as the disallow is on the bottom you won't run into problems. This is because it matches the first rule, if the disallow was on the top then it wouldn't ever reach the allow statements.
Edit/Sidenote:
This is only for "good" (Googlebot, Bingbot etc..) robots which follow the standard. Plenty of other robots either misinterpret the robots.txt file or just completely ignore it.

Related

does Options All -Indexes in htaccess prevent search engines from indexing the files in a folder

At the risk of getting hit for a duplicate question, I typed my question in the title and read all of the questions/answers returned but I am still a bit confused.
I want to run Options All -Indexes in htaccess but I want to make sure that the search engines can index the images in the image gallery folders.
This question appeared to answer the question...
Keep Options All -Indexes but allow access to a specific folder
But then one of the answers stated that
But if they go to that directory they won't see file listing
Does this mean the files will no be able to be indexed because there is no index file, or does it mean the file can be indexed but the search engine simply will not see an index of the files?
Thanks in advance,
Pete
If you block access to the directory listing, a search engine crawler cannot see the image files to index them,
UNLESS they are linked to by SOME accessible file that is visible.

Setting environment specific htaccess rules

So I usually want to set htaccess rules slightly differently based on what server it is on, eg live or development.
The ErrorDocument usually needs to be different, as well as some of the AddType and SetHandlers bits.
Is there any way I can make this a bit more dynamic and get it to auto detect based on the URL, then set a variable and have if conditionals further down in the htaccess?
Want to do this entirely from URL detection instead of setting parameters with apache please :)
No there isn't any way to set those things via some url detection. You can't make normal if conditions surrounding some of the things you want (AddType SetHandlers and ErrorDocument).
You could use env variables and mod rewrite but I don't think you'll like the end result. You'll have to do something like this using env|E=[!]VAR[:VAL] syntax
If you were in the httpd.conf or vhost file you might be able to separate your different setups by using <directory> sections </directory>. But Directory is not allowed in htaccess.
Also I wouldn't do this in a production environment anyways since something could go wrong and I would think the detection is slower and not needed. Perhaps you may want to look into a build script you run to create/deploy your different setups for development/production depending on hostname and other factors.

Customize default directory display using .htaccess

I want to customize the default listing of directory including the header and footer of the page. I tried to search this information however most of the information included how to disable the access to directories or provide the access to directories. My concern is to change the font, include some images in header(i.e change the layout of the page as it displays by default and also put a logout option)with directories being displayed.
any help would be highly appreciated!
I think this post might be helpful:
http://perishablepress.com/better-default-directory-views-with-htaccess/
Although, you mention above that you need to add a logout option as well, which means there will need to be some backend coding. What programming language are you using? It would be much easier to use the backend language (like PHP, .Net, Java, etc) to grab the contents of the files/directories, that way you have full control. .htaccess files are nice, but with what you're trying to do, it seems to involve much more than the capabilities of only strictly using .htaccess files.
Either way, the post above should help you "change the font, include some images in the header and change the layout of the page", just not allow the login/logout.

How can I find files that aren't needed on my site so I can delete them?

I'm developing a website, and after testing different ways to do things, I know that I have many files on my site that are not being used, including HTML/PHP files, images, stylesheets, and external scripts. Is there some program I can use or something so I can find all of the files that I don't need so I can delete them?
I need to find all files that are safe to delete, don't have anything to do with the site anymore, and that deleting them won't have any effect on how my site works.
I've tried finding orphaned files in Dreamweaver, but it lists a lot of files that I do actually need.
Here's one idea: Crawl the site and create a list of every file you can find, then check anything that's not on that list. Wikipedia has a list of crawlers including some open source ones.
Xenu's linksleuth is the easiest way I've found.
http://home.snafu.de/tilman/xenulink.html
After you do the scan you have the option to put in your FTP info. If you do so, it will also generate a list of files that are not accessible (orphans).
How would you qualify unnecessary? That's something you need to be sure of before beginning this. I guess one way to garbage collect your site is to delete files not being referenced by any other files.
The idea with the crawler #Brendan to get all files that actually are used is very nice.
Then you can start deleting files from your website and after that use a program to find any broken links in your website like Xenu or LinkTiger or then one you prefer.
You can connect with some ftp application, and delete files manual. This is the safest way, because scripts and programs don't know what is needed and what not...
This did not exist at the time this question was asked, but there is a Python script called weborphans designed for this purpose.
Here's a blog entry by the author with some more info: Finding orphaned files on websites

.htaccess or other URL Case Sensitive

My server is Case Sensitive, and id like to turn it to inSensitive.
Example of what I mean is
lets say I upload Fruit.php
Well then going to this file wont work:
www.website.com/fruit.php
but this one will:
www.website.com/Fruit.php
Is there a way so Fruit.php and fruit.php will work? also with the directories. i.e:
/Script/script.php
/script/Script.php
You need to use the mod_speling (sic) apache module:
http://httpd.apache.org/docs/1.3/mod/mod_speling.html
In .htaccess
<IfModule mod_speling.c>
CheckCaseOnly On
CheckSpelling On
</IfModule>
The CheckSpelling operative makes Apache perform a more involved effort to find a match e.g. correcting common spelling mistakes
Case sensitivity depends on the file system, not Apache. There is a partial solution, however. mod_rewrite can coerce everything to lowercase (or uppercase) like so:
RewriteMap tolowercase int:tolower
RewriteRule ^(.*)$ ${tolowercase:$1}
Reference: http://httpd.apache.org/docs/2.2/mod/mod_rewrite.html#rewritemap
Unfortunately, this only works if all your files are lowercase, while you specifies mixed case filenames (Fruit.php.) Are you comfortable renaming all the files in your project lowercase?
UNIX-servers are case-sensitive - they distinguish between upper-case and lowercase letters in file names and folder names. So if you move your website from a windows to a UNIX-server (when you change web host for instance), you risk getting a certain amount of "Page not found"-errors (404 errors), because directories and other websites linking to yours sometimes get the cases wrong (typically writing the first letter of folder names in upper-case etc.). This javascript-based custom 404-error page solves the problem by converting URL's into lowercase.
You can get the script from http://www.forbrugerportalen.dk/sider/404casescript.js
Happy coding !!!!!!!

Resources