Robots.txt in the root directory, will that override the Meta tag or will the Meta tag override the robots.txt file? - meta-tags

I do not want any search engines from indexing my website, so I put robots.txt in the root directory; will that override the Meta tag or will the Meta tag override the robots.txt file?
The reason for asking this question is that some pages may have the Meta tag telling robots to index, follow, however I have moved the site to a sub-domain name witch I am still tweaking the site before it’s it goes live to replace the old site and I do not want to have to remove all of the Meta tag telling robots to index, follow then when the site is ready have to replace the Meta tag telling robots to index, follow so I’m think that the robots.txt is the quickest, easiest and does not alter the site other than tell robots not to index, follow if that’s what I in the text file.

Well, if the robots.txt disallows crawling the directory that contains the document, then presumably they won't ever get to the document, so there's no issue.
If there are "follow" attributes in an HTML link, the robot will queue those URLs for crawling, but then when it actually tries to crawl it will see the block in robots.txt and not crawl.
The short answer is: robots.txt will prevent a well-behaved crawler from following a link, regardless of where it got that link or what attributes were associated with that link when it was found.

Related

WebURK without any kind of subdirectories

It is my first time seeing something like this.
Does anyone know, what the name/kind/type of the website is that does not have any kind of subdirectories on the web-URL page, and it always just stays as a plain domain name, and how it was made, and how it can be avoided since I need to send an API call to one of those subdirectories?
Example:
I have a website let's call it example.net. It has UI page and it has a home page, which should look like this in a browser: example.net/home, or it has a /shipment option inside of the UI page. So the URL should look like this:
example.net/shipment and it has one more subdirectory inside for example /report, and if I select it, it should look like this: example.net/shipment/report (something like this).
And open up that subdirectory, but again web-URL link on a website continues to stay just as a example.net all the time.
And for some reason whatever subdirectory I would go on a website, Web-browser URL will remain as a hello-world.net all the time without any kind of changes subdirectories on a web-browser URL.
It is an internal website, so I can not post examples of it from work here.
Does anyone knows, what the name of that kind of set?
How it can be avoided? Since I need to send an API request to one of the subdirectories?
I am not a developer, and I am new to IT, so I am not really sure, what the name of this, and how does it works.
If you are on example.net/shipment and you want to link to a subdirectory, the link needs to include that subdirectory. You have two possibilities:
Root relative links: <a href=/shipment/report>
Absolute links: <a href=https://example.com/shipment/report>
If you shipment directory has a trailing slash (example.net/shipment/), you a third possibility. (Note this only works with a shipment URL that is different than what you specified in your question.)
Document relative links: <href=report>
There is no name for websites that don't have subdirectories that I know of. Websites are often set up like this to make the URLs easy to type and remember which helps with SEO.

Robots meta and robots.txt

I'm using conditional statement in my php header to exclude some of files from being followed by robots.
However temporarily I have to block some of these pages because my website is under performing. At this stage I've used robots.txt to exclude them. But they still have meta index no follow.
Would that contradiction be seen bad by Google?
If you are blocking the pages in robots.txt, then any crawler that obeys robots.txt will never load the page, and will therefore never see the robots meta tags. The meta tags are effectively ignored.

Sharethis different results for domain with and without www

I put sharethis on my site, and if I go to the site andrewwelch.info without the www, then the shares are different from if I go to www.andrewwelch.info. How can I make sure that this doesn't happen?
ShareThis is rendered inside an IFRAME, and will use the parent frame's URL to determine the page someone is sharing.
You can add span tags with a st_url attribute to specify a canonical URL to use for a given page. An example is:
<span class="st_sharethis" st_url="http://sharethis.com" st_title="Sharing is great!"></span>
See here for more details.
As a side note: To improve your search engine rankings you should ensure your site doesn't present two different versions of each page. Search engines may reduce the relevancy of your site in results if this is the case. For example, the content of the following pages (and every other page on your site) are the same:
http://andrewwelch.info/
http://www.andrewwelch.info/
You need to fix this by choosing whether you want the "www" or not, then using one of the following methods:
Use a "canonical" meta tag to tell search engines which page is the one you want indexed.
Respond to requests for the "www" or "non-www" hostname with a 301 redirect to the other.

robots.txt disallow subdirectory without showing its name to robots

I'm stuck on a problem with robots.txt.
I want to disallow http://example.com/forbidden and allow any other subdirectory of http://example.com. Normally the syntax for this would be:
User-agent: *
Disallow: /forbidden/
However, I don't want malicious robots to be able to see that the /forbidden/ directory exists at all - there is nothing linking to it on the page, and I want to it be completely hidden to everybody except those that know it's there in the first place.
Is there a way to accomplish this? My first thought was to place a robots.txt on the subdirectory itself, but this will have no effect. If I don't want my subdirectory to be indexed by either benign or malicious robots, am I safer listing it on the robots.txt or not listing or linking to it at all?
Even if you don’t link to it, crawlers may find the URLs anyhow:
someone else could link to it
some browser toolbars fetch all visited URLs and send them to search engines
your URLs could appear in (public) Referer logs of linked pages
etc.
So you should block them. There are two variants (if you don’t want to use access control):
robots.txt
meta-robots
(both variants only work for polite bots, of course)
You could use robots.txt without using the full folder name:
User-agent: *
Disallow: /fo
This would block all URLs starting with fo. Of course you would have to find a string that doesn’t match with other URLs you still want to be indexed.
However, if a crawler finds a blocked page somehow (see above), it may still add the URL to its index. robots.txt only disallows crawling/indexing of the page content, but using/adding/linking the URL is not forbidden.
With the meta-robots, however, you can even forbid indexing the URL. Add this element to the head of the pages you want to block:
<meta name="robots" content="noindex">
For files other than HTML there is the HTTP header X-Robots-Tag.
You're better off not listing it in robots.txt at all. That file is purely advisory; well-behaved robots will abide by the requests it makes, while rude or malicious ones may well use it as a list of potentially interesting targets. If your site contains no links to the /forbidden/ directory, then no robot will find it in any case save one which carries out the equivalent of a dictionary attack, which can be addressed by fail2ban or some similar log trawler; this being the case, including the directory in robots.txt will at best have no additional benefit, and at worst clue in an attacker to the existence of something he might otherwise not have found.

how to make nutch crawler crawl

i have some doubt in nutch
while i used the wiki i am asked to edit the crawl-urlfilter.txt
+^http://([a-z0-9]*\.)*apache.org/
and i am asked to create an url folder and an list of url...
do i need to create all the links in crawl-urlfilter.txt and in the list of url ...
Yes and no.
crawl-urlfiler.txt act as a filter, so only urls on apache.org will ever be crawled in your example
The url folder gives the 'seed' urls where to let the crawler start.
So if you want the crawler to stay in a set of sites, you will want to make sure they have a positive match with the filter... otherwise it will crawl the entire web. This may mean you have to put the list of sites in the filter

Resources