What pages I need to stop crawling with Robots.txt

What pages I need to stop crawling with Robots.txt - security

I developed a website and I make it online, but when I accessing to the logs files I get
Message : 'No route matched.' Stack Trace :
Core\Router->dispatch('robots.txt')
after a small search in google I found that the robot.txt file is more important for search engine (Google, Bing ... etc)
and it help to stop crawling some pages, but my questions is what I need to block with it ( which pages ) and how to stop some specific routes.
for example I have routes of administration always start with /ad-dash
Example : /ad-dash/administration/index
and when I make the search I see that some tutorial block the about, privacy, terms pages
My question is : I want to know which pages I need to stop crawling with Robots.txt ?

Related

Effect of robots.txt

I understand that naming a file to disallow in robots.txt will stop well behaved crawlers from scanning that file's content, but does it (also) stop the file being listed as a search result?

No, both Google and Bing will not stop indexing the file just because it appears in robots.txt:
A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.
https://developers.google.com/search/docs/advanced/robots/intro
It is important to understand that this not by definition implies that a page that is not crawled also will not be indexed. To see how to prevent a page from being indexed see this topic.
https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec

Google Crawl Error > apple-app-site-association > No App, URL or attempt to link > Why?

I have:
a simple static website;
hosted on a shared server;
with SSL;
which I have recently redesigned.
Google tells me there were two url crawl errors for my website:
apple-app-site-association;
.well-known/apple-app-site-association
For reference, here is the error report for the first (the second is the same):
Not found
URL:
https://mywebsite.com/apple-app-site-association
Error details
Last crawled: 5/5/16
First detected: 5/5/16
Googlebot couldn't crawl this URL because it points to a non-existent page. Generally, 404s don't harm your site's performance in search, but you can use them to help improve the user experience. Learn more
From looking around here, these appear to be related to associating an apple app with related website.
I have never tried to implement any sort of "apple app / site association" - at least not intentionally.
I can't for the life of me figure out where these links are coming from.
I will be removing these urls but am concerned the error may arise again.
I have looked at several related questions here, but they seem to be for errors from people trying to make that verification - which I haven't - or from people querying why their server logs show requests to these urls.
Can anyone shed any light on why this is happening?

So the answer is that Googlebot is now searching for these urls when they crawl your site, and as part of their effort to map associations between sites and their related apps. See: https://firebase.google.com/docs/app-indexing/ios/app
It seems that Googlebot hasn't (at this time) been told not to return a crawl error if the url /folder is not there.
Here is a link to an answer to a very similar (but slightly different) question that gives more detail if you are so inclined: https://stackoverflow.com/a/35812024/4806086

How to discover amount of pages of an external website

I have to make an offer for a new website. It should be based on the amount of pages there are in the existing site. There is no sitemap present.
Question: how can i get the total amount of pages inside an external website that not belong to me?

Have you tried to reach a potential sitemap.xml file (http:www.yourwebsite.com/sitemap.xml)?
You can test pages discovery with an online sitemap generator : https://www.xml-sitemaps.com/
You can also try a Google research like this : site:www.yourwebsite.com. You'll see all indexed pages.

Finding all pages on domain with NodeJS

I'm trying to find all the pages on a domain with Node.
I was searching on Stackoverflow, but all i found is this thread for Ruby: Find all the web pages in a domain and its subdomains - I have the same question, but for Node.
I've also googled the question, but all I find are scrapers that do not find the links to scrape themselves. I was also searching for something like "sitemap generator", "webpage robot", "automatic scraper", "getting all pages on domain with Node" but it didn't bring any result.
I have a scraper that needs an array of links it will be processing and for example I have a page www.example.com/products/ where I want to find all existing sub-pages, e.g. www.example.com/products/product1.html, www.example.com/products/product2.html etc.
Could you give me a hint how can I implement it in Node?

Have a look at Crawler (https://www.npmjs.org/package/crawler). You can use it to crawl the website and save the links.
Crawler is a web spider written with Nodejs. It gives you the full
power of jQuery on the server to parse a big number of pages as they
are downloaded, asynchronously. Scraping should be simple and fun!

If a page is not linked to the main website, can search engines find it?

I want to put a secret page in my website (www.mywebsite.com). The page URL is "www.mywebsite.com/mysecretpage".
If there is no clickable link to this secret page in the home page (www.mywebsite.com), can search engines still find it?

If you want to hide from a web crawler: http://www.robotstxt.org/robotstxt.html
A web crawler collects links, and looks them up. So if your not linking to the site, and no one else is, the site won't be found on any search engine.
But you can't be sure, that someone looking for your page won't find it. If you want secret data, you should use a script of some kind, to grant access to those, who shall get access.
Here is a more useful link : http://www.seomoz.org/blog/12-ways-to-keep-your-content-hidden-from-the-search-engines

No. A web spider crawls based on links from previous pages. If no page is linking it, search engine wouldn't be able to find it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

What pages I need to stop crawling with Robots.txt - security

Related

Effect of robots.txt

Google Crawl Error > apple-app-site-association > No App, URL or attempt to link > Why?

How to discover amount of pages of an external website

Finding all pages on domain with NodeJS

If a page is not linked to the main website, can search engines find it?

Categories

Resources