Robots meta and robots.txt - meta-tags

I'm using conditional statement in my php header to exclude some of files from being followed by robots.
However temporarily I have to block some of these pages because my website is under performing. At this stage I've used robots.txt to exclude them. But they still have meta index no follow.
Would that contradiction be seen bad by Google?

If you are blocking the pages in robots.txt, then any crawler that obeys robots.txt will never load the page, and will therefore never see the robots meta tags. The meta tags are effectively ignored.

Related

Sharethis different results for domain with and without www

I put sharethis on my site, and if I go to the site andrewwelch.info without the www, then the shares are different from if I go to www.andrewwelch.info. How can I make sure that this doesn't happen?
ShareThis is rendered inside an IFRAME, and will use the parent frame's URL to determine the page someone is sharing.
You can add span tags with a st_url attribute to specify a canonical URL to use for a given page. An example is:
<span class="st_sharethis" st_url="http://sharethis.com" st_title="Sharing is great!"></span>
See here for more details.
As a side note: To improve your search engine rankings you should ensure your site doesn't present two different versions of each page. Search engines may reduce the relevancy of your site in results if this is the case. For example, the content of the following pages (and every other page on your site) are the same:
http://andrewwelch.info/
http://www.andrewwelch.info/
You need to fix this by choosing whether you want the "www" or not, then using one of the following methods:
Use a "canonical" meta tag to tell search engines which page is the one you want indexed.
Respond to requests for the "www" or "non-www" hostname with a 301 redirect to the other.

Made changes to robots.txt but search engines still say description not available

Most of the questions I see are trying to hide the site from being indexed by search engines. For myself, I'm attempting the opposite.
For the robots.txt file, I've put the following:
# robots.txt
User-agent: *
Allow: /
# End robots.txt file
To me, this means that the search engines are allowed to search the directory. However, when I test it out it still displays the website as "A description for this result is not available because of this site's robots.txt" but when I clicked on the link, it's displaying the above code.
I'm guessing it's because it takes awhile for Google and Bing to catch up? Or am I doing something wrong?
If it's because they haven't caught up to the changes made yet (these changes were made yesterday afternoon), then does anyone have a rough estimate to when the changes will be reflected?
Yeah, it takes some time until search engines crawl your pages resp. your robots.txt again. There can be no serious estimate, as it depends on too many factors. Some search engines offer a service in their webmaster tools to recrawl specific pages, but there is no guarantee that this happens shortly.
Note that your robots.txt is equivalent to:
# robots.txt
User-agent: *
Disallow:
# End robots.txt file
(Many parsers know/understand Allow, but it is not part of the original robots.txt specification.)
And this robots.txt is equivalent to no robots.txt at all (or an empty robots.txt), because Disallow: (= allowing all URLs) is the default.

robots.txt disallow subdirectory without showing its name to robots

I'm stuck on a problem with robots.txt.
I want to disallow http://example.com/forbidden and allow any other subdirectory of http://example.com. Normally the syntax for this would be:
User-agent: *
Disallow: /forbidden/
However, I don't want malicious robots to be able to see that the /forbidden/ directory exists at all - there is nothing linking to it on the page, and I want to it be completely hidden to everybody except those that know it's there in the first place.
Is there a way to accomplish this? My first thought was to place a robots.txt on the subdirectory itself, but this will have no effect. If I don't want my subdirectory to be indexed by either benign or malicious robots, am I safer listing it on the robots.txt or not listing or linking to it at all?
Even if you don’t link to it, crawlers may find the URLs anyhow:
someone else could link to it
some browser toolbars fetch all visited URLs and send them to search engines
your URLs could appear in (public) Referer logs of linked pages
etc.
So you should block them. There are two variants (if you don’t want to use access control):
robots.txt
meta-robots
(both variants only work for polite bots, of course)
You could use robots.txt without using the full folder name:
User-agent: *
Disallow: /fo
This would block all URLs starting with fo. Of course you would have to find a string that doesn’t match with other URLs you still want to be indexed.
However, if a crawler finds a blocked page somehow (see above), it may still add the URL to its index. robots.txt only disallows crawling/indexing of the page content, but using/adding/linking the URL is not forbidden.
With the meta-robots, however, you can even forbid indexing the URL. Add this element to the head of the pages you want to block:
<meta name="robots" content="noindex">
For files other than HTML there is the HTTP header X-Robots-Tag.
You're better off not listing it in robots.txt at all. That file is purely advisory; well-behaved robots will abide by the requests it makes, while rude or malicious ones may well use it as a list of potentially interesting targets. If your site contains no links to the /forbidden/ directory, then no robot will find it in any case save one which carries out the equivalent of a dictionary attack, which can be addressed by fail2ban or some similar log trawler; this being the case, including the directory in robots.txt will at best have no additional benefit, and at worst clue in an attacker to the existence of something he might otherwise not have found.

Robots.txt in the root directory, will that override the Meta tag or will the Meta tag override the robots.txt file?

I do not want any search engines from indexing my website, so I put robots.txt in the root directory; will that override the Meta tag or will the Meta tag override the robots.txt file?
The reason for asking this question is that some pages may have the Meta tag telling robots to index, follow, however I have moved the site to a sub-domain name witch I am still tweaking the site before it’s it goes live to replace the old site and I do not want to have to remove all of the Meta tag telling robots to index, follow then when the site is ready have to replace the Meta tag telling robots to index, follow so I’m think that the robots.txt is the quickest, easiest and does not alter the site other than tell robots not to index, follow if that’s what I in the text file.
Well, if the robots.txt disallows crawling the directory that contains the document, then presumably they won't ever get to the document, so there's no issue.
If there are "follow" attributes in an HTML link, the robot will queue those URLs for crawling, but then when it actually tries to crawl it will see the block in robots.txt and not crawl.
The short answer is: robots.txt will prevent a well-behaved crawler from following a link, regardless of where it got that link or what attributes were associated with that link when it was found.

Add .html when rewriting URL in htaccess?

I'm in the process of rewriting all the URLs on my site that end with .php and/or have dynamic URLs so that they're static and more search engine friendly.
I'm trying to decide if I should rewrite file names as simple strings of words, or if I should add .html to the end of everything. For example, is it better to have a URL like
www.example.com/view-profiles
or
www.example.com/view-profiles.html
???
Does anyone know if the search engines favor doing it one way or another? I've looked all over Stack Overflow (and several other resources) but can't find an answer to this specific question.
Thanks!
SEO optimized URLs should be according to this logic (listed in priority)
unique (1 URL == 1 ressource)
permanent (they do not change)
manageable (1 logic per site section, no complicated exceptions)
easily scaleable logic
short
with a targeted keyword phrase
based on this
www.example.com/view-profiles
would be the better choice.
said that:
google has something i call "dust crawling prevention" (see paper: "do not crawl in dust" from this google http://research.google.com/pubs/author6593.html) so if google discovers a URL it must decide if it is worth crawling that specific page.
as google gives URLs with an .html a "bonus" credit of trust "this is an HTML page i probably want to crawl it".
said that: if your site mostly consists out of HTML pages that have actual textual content , this "bonus" is not needed.
i personally only add the .html to HTML sitemap pages that consists only out of long lists and only if i have a few millions of it, as i have seen a slightly better crawlrate above these pages. for all other pages i strictly keep the Franzsche URL logic mentioned above.
br
franz, austria, vienna
p.s.: please see https://webmasters.stackexchange.com/ for not programming related SEO questions

Resources