Why google finds a page excluded by robots.txt? - .htaccess

i'm using robots.txt to exclude some pages from spiders.
User-agent: *
Disallow: /track.php
When i search something refeered to this page, google says: "A description for this result is not available because of this site's robots.txt – learn more."
It means that the robots.txt is working.. but why the link to the page is still found by the spider? I'd like to have no link to the 'track.php' page... how i should setup the robots.txt? (or something like .htaccess and so on..?)

Here's what happened:
Googlebot saw, on some other page, a link to track.php. Let's call that page "source.html".
Googlebot tried to visit your track.php file.
Your robots.txt told Googlebot not to read the file.
So Google knows that source.html links to track.php, but it doesn't know what track.php contains. You didn't tell Google not to index track.php; you told Googlebot not to read and index the data inside track.php.
As Google's documentation says:
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
There's not a lot you can do about this. For your own pages, you can use the x-robots-tag or noindex meta tag as described in that documentation. That will prevent Googlebot from indexing the URL if it finds a link in your pages. But if some page that you don't control links to that track.php file, then Google is quite likely to index it.

Related

Effect of robots.txt

I understand that naming a file to disallow in robots.txt will stop well behaved crawlers from scanning that file's content, but does it (also) stop the file being listed as a search result?
No, both Google and Bing will not stop indexing the file just because it appears in robots.txt:
A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.
https://developers.google.com/search/docs/advanced/robots/intro
It is important to understand that this not by definition implies that a page that is not crawled also will not be indexed. To see how to prevent a page from being indexed see this topic.
https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec

how stop indexing links with included subfolders

My real links on my website should index in google as (example):
www.mywebsite.com/title,id,sometext,sometext
unexpectedly google search indexing my website with subfolders whitch should not occur for example:
www.mywebsite.com/include/title,id,sometext,sometext
www.mywebsite.com/img/min/title,id,sometext,sometext
and so on
how can i stop these actions from indexing. What i have to change on htaccess or robots.txt? Help me, thanks
You need to update your robots.txt to prevent bots from browsing those pages and you should set a noindex on these pages to remove them from rankings. You may also want to explore canonical links if the same page can be served from multiple URLs.

Interaction between robots.txt and meta robots tags

There are other questions here on SO about what happens if you have both meta robots and I thought I understood what was happening until I came across this answer on Google Webmasters site: https://support.google.com/webmasters/answer/93710
Here's what it says:
Important! For the noindex meta tag to be effective, the page must not
be blocked by a robots.txt file. If the page is blocked by a
robots.txt file, the crawler will never see the noindex tag, and the
page can still appear in search results, for example if other pages
link to it.
This is saying that if another site links to my page then my page will be indexed even if I have that page blocked by a robots.txt.
The implication from this is the only way to stop my page being indexed is to allow it in robots.txt and use a meta robots tag to stop it being indexed. This seems to completely defeat the purpose of robots.txt
Disallow in robots.txt is for preventing crawling (= a bot visits your page), not for preventing indexing (= the link to your page, possibly with metadata, gets added to a database).
If you block crawling of a page in robots.txt, you convey that bots should not visit the page (e.g., because there’s nothing interesting to see, or because it would waste your resources), and not that the URL to that page should be considered a secret.
The original specification of robots.txt doesn’t define a way to prevent indexing. Google seems to support a Noindex field in robots.txt, but just as an "experimental feature" that’s not documented yet.

site google tag does not show all results

If I go to this url
http://sppp.rajasthan.gov.in/robots.txt
I get
User-Agent: *
Disallow:
Allow: /
That means that crawlers are allowed to fully access the website and index everything, then why site:sppp.rajasthan.gov.in on google search shows me only a few pages, where it contains lots of documents including pdf files.
There could be a lot of reasons for that.
You don't need a robots.txt for blanket allowing crawling. Everything is allowed by default.
http://www.robotstxt.org/robotstxt.html doesn't allow blank Disallow lines:
Also, you may not have blank lines in a record, as they are used to delimit multiple records.
Check google webmasters tools to see if some pages have been dissallowed for crawling.
Submit a sitemap to google.
Use "Fetch as google" to see if google can even see the site properly.
Try manually submitting a link through the fetch as google interface.
Looking closer at it.
Google doesn't know how to navigate some of the links on the site. Specifically http://sppp.rajasthan.gov.in/bidlist.php the bottom navigation uses onclick javascript that gets dynamically loaded and it doesn't change the URL so google couldn't link to page 2 it even if it wanted to.
On the bidlist you can click into a bid list detailing the tender. These don't have public URLs. Google has no way of linking into them.
The PDFs I looked at were image scans in sanskrit put into PDF documents. While Google does OCR PDF documents (http://googlewebmastercentral.blogspot.sg/2011/09/pdfs-in-google-search-results.html) it's possibly they can't do it with sanskrit. You'd be more likely to fidn them if they contained proper text as opposed to images.
My original points remain though. Google should be able to find http://sppp.rajasthan.gov.in/sppp/upload/documents/5_GFAR.pdf which is on the http://sppp.rajasthan.gov.in/actrulesprocedures.php page. If you have a question about why a specific page might be missing, I'll try to answer it.
But basically the website does some bizarre non-standard things, this is exactly what you need a sitemap for. Contrary to popular belief sitemaps are not for SEO, it's for when google can't locate your pages.

404 handler and dynamic pages that really don't exists... bad for SEO?

We have an IIS 404 asp.net handler that renders pages when an html page is not found. It uses the page's URL to query our Databases and builds rich relevant content on the fly. From what I can tell in the IIS logs and anaylyzing the pages from web browser tools there is NO indication the page does not actually exist and was dynamically generated.
In these cases is IIS actually sending a 404 to the client? Is there a redirect of any kind actually happening? Will Search engines punish me for this?
It's been 2 months and Google has indexed everything, but Bing and Yahoo have not indexed anything dynamic dispite my submitting various Directory pages, Sitemaps and Feeds with all my links. My home page is indexed on all search engines and has all my links. When I search very unique keywords in those links, I can see that bing and yahoo do see them on my Home Page links - but only there.
Is there anything I can run or check to make sure my dynamic pages are not somehow viewed as bad by Search engines? Any way to check if a 404 (whatever a 404 actually is to a client besides just another page) is returned to crawlers?
Many Thanks.
Is there anything I can run or check to make sure my dynamic pages are
not somehow viewed as bad by Search engines?
Dynamic pages are just fine. Most of the content on the Internet is dynamically produced. The search engines don't care if content is dynamic and, in fact, they usually do not know content is dynamic as all they see if the URL and the HTML that is produced by that URL.
Any way to check if a 404 (whatever a 404 actually is to a client
besides just another page) is returned to crawlers?
Use a tool like Firebug or the built in developer tools in Chrome to view your HTTP headers. Crawlers see the same headers a browser would see so that is an easy way to tell what headers your pages are sending out.

Resources