robots.txt disallow subdirectory without showing its name to robots - security

I'm stuck on a problem with robots.txt.
I want to disallow http://example.com/forbidden and allow any other subdirectory of http://example.com. Normally the syntax for this would be:
User-agent: *
Disallow: /forbidden/
However, I don't want malicious robots to be able to see that the /forbidden/ directory exists at all - there is nothing linking to it on the page, and I want to it be completely hidden to everybody except those that know it's there in the first place.
Is there a way to accomplish this? My first thought was to place a robots.txt on the subdirectory itself, but this will have no effect. If I don't want my subdirectory to be indexed by either benign or malicious robots, am I safer listing it on the robots.txt or not listing or linking to it at all?

Even if you don’t link to it, crawlers may find the URLs anyhow:
someone else could link to it
some browser toolbars fetch all visited URLs and send them to search engines
your URLs could appear in (public) Referer logs of linked pages
etc.
So you should block them. There are two variants (if you don’t want to use access control):
robots.txt
meta-robots
(both variants only work for polite bots, of course)
You could use robots.txt without using the full folder name:
User-agent: *
Disallow: /fo
This would block all URLs starting with fo. Of course you would have to find a string that doesn’t match with other URLs you still want to be indexed.
However, if a crawler finds a blocked page somehow (see above), it may still add the URL to its index. robots.txt only disallows crawling/indexing of the page content, but using/adding/linking the URL is not forbidden.
With the meta-robots, however, you can even forbid indexing the URL. Add this element to the head of the pages you want to block:
<meta name="robots" content="noindex">
For files other than HTML there is the HTTP header X-Robots-Tag.

You're better off not listing it in robots.txt at all. That file is purely advisory; well-behaved robots will abide by the requests it makes, while rude or malicious ones may well use it as a list of potentially interesting targets. If your site contains no links to the /forbidden/ directory, then no robot will find it in any case save one which carries out the equivalent of a dictionary attack, which can be addressed by fail2ban or some similar log trawler; this being the case, including the directory in robots.txt will at best have no additional benefit, and at worst clue in an attacker to the existence of something he might otherwise not have found.

Related

WebURK without any kind of subdirectories

It is my first time seeing something like this.
Does anyone know, what the name/kind/type of the website is that does not have any kind of subdirectories on the web-URL page, and it always just stays as a plain domain name, and how it was made, and how it can be avoided since I need to send an API call to one of those subdirectories?
Example:
I have a website let's call it example.net. It has UI page and it has a home page, which should look like this in a browser: example.net/home, or it has a /shipment option inside of the UI page. So the URL should look like this:
example.net/shipment and it has one more subdirectory inside for example /report, and if I select it, it should look like this: example.net/shipment/report (something like this).
And open up that subdirectory, but again web-URL link on a website continues to stay just as a example.net all the time.
And for some reason whatever subdirectory I would go on a website, Web-browser URL will remain as a hello-world.net all the time without any kind of changes subdirectories on a web-browser URL.
It is an internal website, so I can not post examples of it from work here.
Does anyone knows, what the name of that kind of set?
How it can be avoided? Since I need to send an API request to one of the subdirectories?
I am not a developer, and I am new to IT, so I am not really sure, what the name of this, and how does it works.
If you are on example.net/shipment and you want to link to a subdirectory, the link needs to include that subdirectory. You have two possibilities:
Root relative links: <a href=/shipment/report>
Absolute links: <a href=https://example.com/shipment/report>
If you shipment directory has a trailing slash (example.net/shipment/), you a third possibility. (Note this only works with a shipment URL that is different than what you specified in your question.)
Document relative links: <href=report>
There is no name for websites that don't have subdirectories that I know of. Websites are often set up like this to make the URLs easy to type and remember which helps with SEO.

Robots meta and robots.txt

I'm using conditional statement in my php header to exclude some of files from being followed by robots.
However temporarily I have to block some of these pages because my website is under performing. At this stage I've used robots.txt to exclude them. But they still have meta index no follow.
Would that contradiction be seen bad by Google?
If you are blocking the pages in robots.txt, then any crawler that obeys robots.txt will never load the page, and will therefore never see the robots meta tags. The meta tags are effectively ignored.

Made changes to robots.txt but search engines still say description not available

Most of the questions I see are trying to hide the site from being indexed by search engines. For myself, I'm attempting the opposite.
For the robots.txt file, I've put the following:
# robots.txt
User-agent: *
Allow: /
# End robots.txt file
To me, this means that the search engines are allowed to search the directory. However, when I test it out it still displays the website as "A description for this result is not available because of this site's robots.txt" but when I clicked on the link, it's displaying the above code.
I'm guessing it's because it takes awhile for Google and Bing to catch up? Or am I doing something wrong?
If it's because they haven't caught up to the changes made yet (these changes were made yesterday afternoon), then does anyone have a rough estimate to when the changes will be reflected?
Yeah, it takes some time until search engines crawl your pages resp. your robots.txt again. There can be no serious estimate, as it depends on too many factors. Some search engines offer a service in their webmaster tools to recrawl specific pages, but there is no guarantee that this happens shortly.
Note that your robots.txt is equivalent to:
# robots.txt
User-agent: *
Disallow:
# End robots.txt file
(Many parsers know/understand Allow, but it is not part of the original robots.txt specification.)
And this robots.txt is equivalent to no robots.txt at all (or an empty robots.txt), because Disallow: (= allowing all URLs) is the default.

Robots.txt in the root directory, will that override the Meta tag or will the Meta tag override the robots.txt file?

I do not want any search engines from indexing my website, so I put robots.txt in the root directory; will that override the Meta tag or will the Meta tag override the robots.txt file?
The reason for asking this question is that some pages may have the Meta tag telling robots to index, follow, however I have moved the site to a sub-domain name witch I am still tweaking the site before it’s it goes live to replace the old site and I do not want to have to remove all of the Meta tag telling robots to index, follow then when the site is ready have to replace the Meta tag telling robots to index, follow so I’m think that the robots.txt is the quickest, easiest and does not alter the site other than tell robots not to index, follow if that’s what I in the text file.
Well, if the robots.txt disallows crawling the directory that contains the document, then presumably they won't ever get to the document, so there's no issue.
If there are "follow" attributes in an HTML link, the robot will queue those URLs for crawling, but then when it actually tries to crawl it will see the block in robots.txt and not crawl.
The short answer is: robots.txt will prevent a well-behaved crawler from following a link, regardless of where it got that link or what attributes were associated with that link when it was found.

Add .html when rewriting URL in htaccess?

I'm in the process of rewriting all the URLs on my site that end with .php and/or have dynamic URLs so that they're static and more search engine friendly.
I'm trying to decide if I should rewrite file names as simple strings of words, or if I should add .html to the end of everything. For example, is it better to have a URL like
www.example.com/view-profiles
or
www.example.com/view-profiles.html
???
Does anyone know if the search engines favor doing it one way or another? I've looked all over Stack Overflow (and several other resources) but can't find an answer to this specific question.
Thanks!
SEO optimized URLs should be according to this logic (listed in priority)
unique (1 URL == 1 ressource)
permanent (they do not change)
manageable (1 logic per site section, no complicated exceptions)
easily scaleable logic
short
with a targeted keyword phrase
based on this
www.example.com/view-profiles
would be the better choice.
said that:
google has something i call "dust crawling prevention" (see paper: "do not crawl in dust" from this google http://research.google.com/pubs/author6593.html) so if google discovers a URL it must decide if it is worth crawling that specific page.
as google gives URLs with an .html a "bonus" credit of trust "this is an HTML page i probably want to crawl it".
said that: if your site mostly consists out of HTML pages that have actual textual content , this "bonus" is not needed.
i personally only add the .html to HTML sitemap pages that consists only out of long lists and only if i have a few millions of it, as i have seen a slightly better crawlrate above these pages. for all other pages i strictly keep the Franzsche URL logic mentioned above.
br
franz, austria, vienna
p.s.: please see https://webmasters.stackexchange.com/ for not programming related SEO questions

Resources