Webmaster Tools Crawler 403 errors - http-status-code-403

Google Webmaster Tools is reporting 403 errors for some folders on the websites server for example:
http://www.philaletheians.co.uk/Study%20notes/
The folder isnt forbidden so dont understand why it would be 403 errors for Googles Crawler?
How come the Google Crawler is trying to browser the actual folders and not just going straight to the files in that folder? Is this somthing to do with robots.txt ?

Make sure is there any actual place or document to be present if some one request that url. I've browsed through your site and could not found a link that directs to http://www.philaletheians.co.uk/Study%20notes/
Also it seems, all the study notes are inside this "Study%20notes" directory.So actual this link will not work anyway. So check the google web master tools's link from to find where this broken link situate and cure it.

Have you set default document correctly in your web server? In apache, this comes in the DirectoryIndex setting (and defaults to index.html). Also, in general it might be better to strip off spaces etc.. from your traversable directory names (the %20 you are seeing between Study and notes is a url-encoded space character), so as to keep your URLs clean to your visitors and search engine bots.

Related

Can Search Engine read robots.txt if it's read access is restricted?

I have added robots.txt file and added some lines to restrict some folders.Also i added restriction from all to access that robots.txt file using .htaccess file.Can Search engines read content of that file?
This file should be freely readable. Search engine are like visitors on your website. If a visitor can't see this file, then the search engine will not be able to see it either.
There's absolutely no reason to try to hide this file.
Web crawlers need to be able to HTTP GET your robots.txt, or they will be unable to parse the file and respect your configuration.
The answer is no! But the simplest and safest too, is still to try:
https://support.google.com/webmasters/answer/6062598?hl=en
The robots.txt Tester tool shows you whether your robots.txt file
blocks Google web crawlers from specific URLs on your site. For
example, you can use this tool to test whether the Googlebot-Image
crawler can crawl the URL of an image you wish to block from Google
Image Search.

Block or redirect website page URLs using .htaccess

I am having some issues with spam links visiting my site returning a 404 error.
My site was hacked with a secret spam links folder on public_html that redirected users to pornographic sites, those links were plastered across the internet. I have since remedied the malware issue, but have several hundred visitors hitting a 404 page because these links no longer exist, messing up all my analytics accounts, using bandwidth, etc.
I have searched for a way to block (so that they never hit my website) anyone that tries to access these URL paths, but cannot possibly redirect every single link (there were over 2000) using a wildcard, or something. My search led me here: Block Spam Referrer Traffic and it is not quite the solution I need.
The searches go to pages like this: www.mywebsite.com/spampage/morespam/ (which have been deleted and are now 404 errors)
There are several iterations of the /spampage/ and /morespam/ urls.
The referrer is generally a google search, so I can't block the referrer using .htaccess. I'd like to somehow block www.mywebsite.com/spampage/*/ and all iterations.
Apologies, I am by no means a programmer. I do appreciate any help that can be offered.
Update#1:
Seems that perhaps the best way is to block these links/directories using the robots.txt file, I have done so and will report back if I have success!
Update#2:
Reporting back. I am new to this all, so I was going about the solution wrong in my original question. Essentially, I found that I needed all of the links de-indexed, as they were generating all the traffic by being indexed by google. I was able to request de-indexing of the directories in question manually through the google webmaster tools account. One requirement for de-indexing was to have the robots.txt on the site block the directories in question from being crawled. Once I did that I submitted the request to remove the directory from the google index. Those pages were taken off in about 3 hours by google (thanks google!), so it was pretty quick once I found out the proper way to go about it. No .htaccess editing needed. Once the pages were no longer index, traffic went back down to normal levels and my keywords, etc, will be back to normal.

How to stop search engines from crawling the whole website?

I want to stop search engines from crawling my whole website.
I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.
So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.
I know in the robots.txt you can tell search engines not to crawl certain directories.
Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?
Is this best done with robots.txt or is it better done by .htaccess or other?
Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.
If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:
Header set X-Robots-Tag noindex,nofollow
This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):
<meta name="robots" content="noindex,nofollow" />
Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.
In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.
It is best handled with a robots.txt file, for just bots that respect the file.
To block the whole site add this to robots.txt in the root directory of your site:
User-agent: *
Disallow: /
To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.
Below are the .htaccess rules to restrict everyone except your people from your company IP:
Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all
If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.
That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.
You could bake it into your website itself, or use HTTP Basic Authentication.
https://www.httpwatch.com/httpgallery/authentication/
In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:
User-agent: *
Disallow: /example-page/
The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:
<meta name="robots" content="noindex, nofollow" />

404 handler and dynamic pages that really don't exists... bad for SEO?

We have an IIS 404 asp.net handler that renders pages when an html page is not found. It uses the page's URL to query our Databases and builds rich relevant content on the fly. From what I can tell in the IIS logs and anaylyzing the pages from web browser tools there is NO indication the page does not actually exist and was dynamically generated.
In these cases is IIS actually sending a 404 to the client? Is there a redirect of any kind actually happening? Will Search engines punish me for this?
It's been 2 months and Google has indexed everything, but Bing and Yahoo have not indexed anything dynamic dispite my submitting various Directory pages, Sitemaps and Feeds with all my links. My home page is indexed on all search engines and has all my links. When I search very unique keywords in those links, I can see that bing and yahoo do see them on my Home Page links - but only there.
Is there anything I can run or check to make sure my dynamic pages are not somehow viewed as bad by Search engines? Any way to check if a 404 (whatever a 404 actually is to a client besides just another page) is returned to crawlers?
Many Thanks.
Is there anything I can run or check to make sure my dynamic pages are
not somehow viewed as bad by Search engines?
Dynamic pages are just fine. Most of the content on the Internet is dynamically produced. The search engines don't care if content is dynamic and, in fact, they usually do not know content is dynamic as all they see if the URL and the HTML that is produced by that URL.
Any way to check if a 404 (whatever a 404 actually is to a client
besides just another page) is returned to crawlers?
Use a tool like Firebug or the built in developer tools in Chrome to view your HTTP headers. Crawlers see the same headers a browser would see so that is an easy way to tell what headers your pages are sending out.

How to move pages around and rename them while not breaking incoming links from external sites that still use the poorly formed URLs

update
Here is the situation:
I'm working on a website that has no physical folder structure. Nothing had been planned or controlled and there were about 4 consecutive webmasters.
Here is an example of an especially ugly directory
\new\new\pasite-new.asp
most pages are stored in a folder with the same name as the file, for maximum redundancy.
\New\10cap\pasite-10cap.asp
\QL\Address\PAsite-Address.asp
each of these [page directories]? (I don't know what else to call them) has an include folder, the include folder contains the same *.inc files in every case, just copied about 162 times for each page directory. The include folder was duplicated so that the
<!--#include file="urlstring"--> would work correctly due to lack of understanding of relative paths, and the #inclue virtual directive or using server.execute()
Here is a picture if my explanation was lacking.
Here are some of my limitations:
The site is written in ASP classic
Server is Windows Server 2003 R2 SP2 , IIS 6 (According to my resource)
I have no access to the IIS server
I would have to go through a process to add any modules or features to iis
What changes can I make that would allow me to move pages around and rename them while not breaking incoming links from external sites that still use the poorly formed URLs?
To make my question more specific.
How can I move the file 10cap.asp from \new\10cap\ to a better location like \ and rename the file to someting like saveourhomescap.asp and not break any incoming links and finally, not have to leave a dummy 10cap.asp page in the original location with a redirect to the new page.
Wow, that's a lot of limitations to deal with.
Can you setup a custom error page? If so you can add some code into a custom error page that would redirect users to the new page. So maybe you create a custom 404 page, and in that page you grab the query string variable and based on that send the user to the correct "new" page. That would allow you to delete all of the old pages.
Here is a pretty good article on this method: URL Rewriting for Classic ASP
Well, you have a lot of limitations and especially no access to the IIS server hurts. An ISAPI module for URL rewriting is not an option here (IIS) and equally a custom 404 page where you could read the referer and forward with a HTTP 301 won't work (IIS).
I would actually recommend you to go through the process and let them install:
An ISAPI URL rewriting module
or if that doesn't work (for any reason):
Let them point the HTTP 404 of your web to a custom 404.asp, read the referer and redirect with a HTTP 301 (Moved Permanently) to your new location.
If none of this is an option for you, I can think about another possibility. I haven't actually tried that so I'm not 100% sure if it will work, but in theory it sounds good ;)
You could make in your global.asa in the Session_OnStart event a Response.Redirect or change the header of your response to a HTTP 301. This will actually only work for new users and not fix real 404 errors. Sorry, for the pseudo code, but it's a while ago that I had anything to do with classic ASP and I think you'll get what I mean ;)
sub Session_OnStart
' here should be a Select Case switch or something like that
Response.Redirect("newlocation.asp")
' or if that will work, this would be better (again with switch)
Response.Status = "301 Moved Permanently"
Response.AddHeader "Location", "http://company.com/newlocation.asp"
end sub
Hope that helps.
I recommend using URL Rewrite for that, see the following blog about it, in particular "Site Reorganization":
http://blogs.msdn.com/b/carlosag/archive/2008/09/02/iis7urlrewriteseo.aspx
For more info about URL Rewrite see: http://www.iis.net/download/URLRewrite
You can try ISAPIRewrite since it's classic ASP + IIS6
http://www.isapirewrite.com/
They have a lite version which is free, probably good enough for your use.
urlrewrite will only work if you can install a dll on the server
one of these articles will help
http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&hs=qRR&q=url+rewrite+classic+asp&btnG=Search&aq=f&oq=&aqi=g-m1
basically you have to point 404 errors to an error page which will parse the incoming querystring / post info and redirect user to correct location with incoming parameters added.
variations on that theme will be found in the examples fro google.

Resources