What is called first - robots.txt or mod_rewrite in htaccess

What is called first - robots.txt or mod_rewrite in htaccess - .htaccess

I need some help. I'm not sure about the order on request for mod_rewrite and robots.txt.
Some urls belong to a rewrite rule:
/index.php?id=123 to /home
Other urls don't have a rewrite:
/index.php?id=444
I made this entry to my robots.txt:
User-agent: *
Disallow: /index.php?id
Will the site with /home be indexed by search engines?

The robots.txt file is interpreted by the client (spider), and they don't know what rewrites you have in your system. Thus, spiders would not fetch URLs from your site if they look like the pattern in robots.txt but would if they found the same content through /home.

Related

Can I keep robots.txt in a contextpath and give a 301 redirect?

My website uses a contextpath (eg: www.example.com/abc). The robots.txt is available at www.example.com/abc/robots.txt and I have given a 301 redirect in webserver to redirect www.example.com/robots.txt to www.example.com/abc/robots.txt.
My question is whether the search engines be able to read the robots.txt file since it has a 301 redirect?

Found that the search engines are honoring the 301 redirect and reading the file from the subfolder.

Robots.txt should be on root level
https://example.com/robots.txt - Correct
https://blog.example.com/robots.txt - Correct
https://example.com/abc/robots.txt - Not Correct
https://blog.example.com/abc/robots.txt - Not Correct
If it is on sub directory/sub folder then it will return 404 error(Because they make calls only on root directory), and Google will ignore your robots.txt completely if it is return 301 or 404 error.

Access file robots.txt from subdomains on main domain?

I use codeigniter and my robots.txt file located in the root, but it can be accessed only from main domain. Search robots trying to access it from subdomains(i use it for locales):
Example:
my.com/robots.txt - OK
en.my.com/robots.txt - FAIL
How can I redirect from xx.my.com/robots.txt to my.com/robots.txt ?
Thanks.

Try using a RedirectMatch
RedirectMatch 301 /robots.txt http://my.com/robots.txt

Htaccess to use the hosting for live testing

I would use the hosting for live testing, but I want to protect access and prevent search engine indexing.
For example (server directory structure) within public_html:
_private
_bin
_cnf
_log
_ ... (more default directories hosting)
testpublic
css
images
index.html
I want index.html is visibile to everyone and all other directories (except "testpublic") are hidden, protected access and search engines not to index.
The directory "testpublic" I wish it was public but may not be indexed in search engines, not sure if this is possible.
To do understand that I need 2 files .htaccess.
One general in "public_html" and other specific for "testpublic".
The .htaccess general (public_html) I think it should be something like:
AuthUserFile /home/folder../.htpasswd
AuthName "test!"
AuthType Basic
require user admin123
< FilesMatch "index.html">
Satisfy Any
< / FilesMatch>
Can anyone help me create the files with the appropriate properties? Thank you!

You can use a robots.txt file in your root folder. All standards-abiding robots will obey this file and not index your files and folders.
Example Robots.txt that tells all (*) crawlers to move on and index nothing.
User-agent: *
Disallow: /
You could use .htaccess files to fine tune what your server (assuming Apache) serves out and what directory indexes are visible. In which case you would add
IndexIgnore *
To your .htaccess file to disallow indexes.
Updated (Credit to https://stackoverflow.com/users/1714715/samuel-cook):
If you want to specifically stop a bot/crawler and know its USER AGENT string you can do so in your .htaccess
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteRule ^.* - [F,L]
</IfModule>
Hope this helps.

Which .htaccess file should I be using for 301 redirects?

This is one of those super-simple questions that I can't seem to google an answer for, so apologies in advance.
When I ftp into my (shared) server, I have a file structure like this:
Root (/)
/public_html
/newdomain.com
I had an old website that lived in /public_html, it had heaps of content and excellent SEO. We changed our name and our domain (which lives in /newdomain.com, a folder inside /public_html), and set 301 redirects from all the old content to the new website.
I tried doing this myself, but it didn't work at all, so I got my host's techsupport to do it for me. There are several .htaccess files on my server though, and I don't know which ones are actually effective and which aren't.
Root has its own .htaccess file
public_html has its own .htaccess file
/newdomain.com DOESN'T have its own .htaccess file
Redirection 1 (currently is in both root and public_html's .htaccess files, and works)
I want to redirect http://olddomain.com/whatever -> http://newdomain.com/whatever (I've currently got each individual page doing its own separate 301 versus a single rule doing this). Achieved with Redirect 301 /article-name-here/ http://www.newsite.com/article-name-here/
Redirection 2 (currently is in both root and public_html's .htaccess files, and doesn't work).
I also want to do some internal redirections of http://newdomain.com/oldpage.html -> http://newdomain.com/newpage.html. I've tried redirection public_html's .htaccess file like so:
Redirect 301 http://newsite.com/badpage.html http://newsite.com/goodpage.html
But it's not working. Do I need to set up a new .htaccess in the newsite.com folder on my server? Or am I just completely missing the mark here?

Redirection 1
To redirect everything, just remove the article name:
Redirect 301 / http://www.newsite.com/
Or if you don't want to redirect the root (i.e. requests for /), then:
RedirectMatch 301 ^/(.+)$ http://www.newsite.com/$1
Redirection 2
If the /newdomain directory is the document root for http://newdomain.com/, then you'll need to create a new htaccess file there and include:
Redirect 301 /badpage.html /goodpage.html

can i use robots.txt while handling my site with htaccess

I am using htaccess in my site, such that all the request to my site will be redirected to index page in my root directory. No other file in my site can be accessed because my htaccess will restrict it. My doubt is, when I use robots.txt file, will the search engines be able to reach the robots.txt file in my domain?. Or must i modify my htaccess file to allow the search engines to read the robots.txt file. If so help me in finding that specific code for htaccess.

I suppose you're using some sort of rewriting. You can exclude files from beeing processed by mod_rewrite with the following rule:
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^(.*) $1 [L]
In this case all files that do exist in the webroot and are called directly will not be processed any further. Replace _%{REQUEST_FILENAME} with robots.txt and you should be fine.
Best wishes,
Fabian

How about trying if you can yourself access the robots.txt via a web browser? If you can, then the search engines can, an vice versa (unless you're using some IP- or browser-specific redirections or such). That is, http://yoursite.com/robots.txt

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string