Ensure that Nutch has crawled all pages of a particular domain - nutch

I am using Nutch to collect all the data from a single domain. How can I ensure that Nutch has crawled every page under a given domain?

This is not technically possible. Since there is no limit on the number of different pages that you can have under the same domain. This is especially true for dynamic generated websites. What you could do is look for a sitemap.xml and ensure that all of those URLs are crawled/indexed by Nutch. Since the sitemap is the one indicating that those are the URLs you could use them as guide for what needs to be crawled.
Nutch has a sitemap processor that will inject all the URLs from the sitemap to the current crawldb (i.e it will "schedule" those URLs to be crawled).
As a hint, even Google enforces a maximum number of URLs to be indexed from the same domain when doing a deep crawl. This is usually referred to as a crawl budget.

Related

Effect of robots.txt

I understand that naming a file to disallow in robots.txt will stop well behaved crawlers from scanning that file's content, but does it (also) stop the file being listed as a search result?
No, both Google and Bing will not stop indexing the file just because it appears in robots.txt:
A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.
https://developers.google.com/search/docs/advanced/robots/intro
It is important to understand that this not by definition implies that a page that is not crawled also will not be indexed. To see how to prevent a page from being indexed see this topic.
https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec

How to discover amount of pages of an external website

I have to make an offer for a new website. It should be based on the amount of pages there are in the existing site. There is no sitemap present.
Question: how can i get the total amount of pages inside an external website that not belong to me?
Have you tried to reach a potential sitemap.xml file (http:www.yourwebsite.com/sitemap.xml)?
You can test pages discovery with an online sitemap generator : https://www.xml-sitemaps.com/
You can also try a Google research like this : site:www.yourwebsite.com. You'll see all indexed pages.

Block Bots from crawling one of my sites on a multistore multidomain prestashop

Hello i have a multistore multidomain prestashop installation with main domain example.com and i want to block all bots from crawling a subdomain site subdomain.example.com made for resellers where they can buy at lower prices because the content is duplicate to the original site, and i am not exacly sure how to do it. Usualy if i want to block the bots for a site i would use
User-agent: *
Disallow: /
But how do i use it without hurting the whole store ? and is it possible to block the bots from the htacces too ?
Regarding your first question:
If you don't want search engines to gain access to the subdomain (sub.example.com/robots.txt), using a robots.txt file ON the subdomain is the way to go. Don't put it on your regular domain (example.com/robots.txt) - see Robots.txt reference guide.
Additionally, I would verify both domains in Google Search Console. There you can monitor and control the indexation of the subdomain and main domain.
Regarding your second question:
I've found a SO thread here which explains what you want to know: Block all bots/crawlers/spiders for a special directory with htaccess.
We use a canonical URL to tell the search engines where to find the original content.
https://yoast.com/rel-canonical/
A canonical URL allows you to tell search engines that certain similar
URLs are actually one and the same. Sometimes you have products or
content that is accessible under multiple URLs, or even on multiple
websites. Using a canonical URL (an HTML link tag with attribute
rel=canonical) these can exist without harming your rankings.

How to properly split a site?

Suppose I have a new verison of a website:
http://www.mywebsite.com
and I have would like to keep the older site in a sub-directory and treat it seperately:
http://www.mywebsite.com/old/
My new site has a link to the old one on the main page, but not vice-versa.
1) Should I create 2 sitemaps? One for the new and one for the old?
2) When my site gets crawled, how can I limit the path of the crawler? In other words, since the new site has a link to the old one, the crawler will reach the old site. If I do the following in my robots.txt:
User-agent: *
Disallow: /old/
I'm worried that it won't crawl the old site (using the 2nd sitemap) since it's blocked. Is that correct?
1) You could include all URLs in one file, or you could create separate files. One could understand a sitemap as "per (web) site", e.g. see http://www.sitemaps.org/:
In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL
Since you now have two sites, you may create two sitemaps. But again, I don't think that it is strictly defined that way.
2) Well, if you block the URLs in robots.txt, these URLs won't be visited by conforming bots. It doesn't mean that these URLs will never be indexed by search engines, but the pages (= the content) will not.

Index all the intranet with nutch

I use Nutch and i would like index an intranet, but how to make sure everything on the intranet will be indexed ?
Thanks.
If you know all the urls of the intranet, then write a robots.txt (or an equivalent page with all the urls and point the crawler to it).
If you don't then you cannot be never secure that you'll have crawled all the urls, because you cannot verify it after the crawling.
In the last case the best chance is to do the crawl at the maximum depth.
Regards

Resources