Nutch and crawling millions of websites - nutch

Can we use nutch 1.10 in order to crawl millions of websites with several number of round?
I don't really understand the database created when I launch nutch 1.10. Is this enough to crawl the important data from the site?
I have a file with a list of url's that take 2 gigabytes.

Yes, you can. This is essentially the goal of nutch. However, crawling millions of websites takes time and space, and in order to do, you need to setup the environment correctly.
In nutch 1.X the "crawling database", e.g. which urls visited, what is the urls frontier (next urls to visit),etc. is persisted to the hadoop filesystem. This is the place where you'll first inject your list of urls.
In addition, in order to view to indexed data, you can use solr (or elasticsearch).
I recommend first going through the nutch 1.x tutorial with a short list of urls and getting to know how to use nutch and the plugins.
After that, setup an hadoop cluster with tutorials from the hadoop site, and crawl away!

Related

Seed URL for Apache Nutch Web Crawling

Apache Nutch recommends http://rdf.dmoz.org/rdf/content.rdf.u8.gz as seed URLs for web crawling. However, they have shut down the website. Is there any alternative seed URLs for web crawling?
I would recommend taking a look at http://commoncrawl.org. I think they offer a really comprehensive dataset.

why does nutch always create the linkdb, even though it's not need for content fetching?

I am reading thru the chapter on nutch in hadoop, the definitive guide. I understand the concept of ranking a page using inverse link. However, I don't see that playing a role when you just want to crawl a few sites. Since creation of the linkdb is a map reduce job, it's bound to take up a lot of computing resources. I am just wondering why is linkdb always generated when most of nutch use cases is just getting web content for designated urls.
That is because Nutch uses the page rank (which is being calculated using link information) to prioritize crawling. For instance, a link that has a high page rank will be crawled before than the one with low page rank.
Nutch was designed to be used as a large scale web crawler,therefore calculating page rank and scoring web pages with it was and still an important component. If you are crawling a few sites, then you probably should use scrappy (a python library).
I hope that answers your question.

Apache Nutch: crawl only new pages for semantics analysis

I plan to tune up Nutch 2.2.X such way, that after initial crawling of the list of sites I launch the crawler daily and get HTML or plain text of new pages appeared on those sites this day only. Number of sites: hundreds.
Please be noted, that I'm not interested on updated, only new pages. Also I need new pages only starting from a date. Let's suppose it is the date of "Initial crawling".
Reading documentation and searching the Web iI got following questions can't find anywhere else:
What backend I should better use for Nutch for my task? I need page's text only once, then I never return to it. MySQL seems isn't an option as it is not supported by gora anymore. I tried use HBase, but seems I have to rollback to Nutch 2.1.x to get it working correctly. What are your ideas? How I may minimize disk space and other resources utilization?
May I perform my task not using indexing engine, like Solr? Not sure I need store large fulltext indexes. May Nutch >2.2 be launched without Solr and does it needs specific options for launching such way? Tutorials aren't clearly explain this question: everybody needs Solr, except me.
If I'd like to add a site to the crawling list, how I better perform it? Let's suppose I already crawling a list of sites and want to add a site to the list to monitor it from now. So I need to crawl the new site skipping pages content to add it to WebDB, and then run daily crawl as usual. For Nutch 1.x it may be possible to perform separate crawls and then merge them. How it may looks like for Nutch 2.x?
May this task be performed without custom plugins, and may it be performed with Nutch at all? Probably, I may write a custom plugin which detects somehow is the page already indexed, or it is new, and we need put the content to XML, or a database, etc. Should I write the plugin at all, or there is a way to solve the task with lesser blood? And how the plugin's algorithm may look like, if there is no way to live without it?
P.S. There is a lot of Nutch questions/answers/tutorials, and I honestly searched in the Web for 2 weeks, but haven't found answers to questions above.
I'm not using solr too. I just checked this documentation: https://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
It seems like there are command prompts that can show the data fetched using WebDB. I'm new to Nutch but I just follow this documentation. Check it out.

Downloading all web hosts on the Internet

I'm working on a project where we work with a distributed crawler to crawl and download hosts found with web content on them. We have a few million hosts at this point, but we're realizing it's not the least expensive thing in the world. Crawling takes time and computing power, etc. etc. So instead of doing this ourselves, we're looking into if we can leverage an outside service to get URLs.
My question is, are there services out there that provide massive lists of web hosts and/or just massive lists of constantly updated URLS (which we can then parse to get hosts)? Stuff I've already looked into:
1) Search engine APIs - typically all of these search engine APIs will (understandably) not just let you download their entire index.
2) DMOZ and Alexa top 1 million - These don't have near enough sites for what we are looking to do, though they're a good start for seed lists.
Anyone have any leads? How would you solve the problem?
Maybe CommonCrawl helps. http://commoncrawl.org/
Common Crawl is a huge open database of crawled websites.

What are the benefits of having an updated sitemap.xml?

The text below is from sitemaps.org. What are the benefits to do that versus the crawler doing their job?
Sitemaps are an easy way for
webmasters to inform search engines
about pages on their sites that are
available for crawling. In its
simplest form, a Sitemap is an XML
file that lists URLs for a site along
with additional metadata about each
URL (when it was last updated, how
often it usually changes, and how
important it is, relative to other
URLs in the site) so that search
engines can more intelligently crawl
the site.
Edit 1: I am hoping to get enough benefits so I canjustify the development of that feature. At this moment our system does not provide sitemaps dynamically, so we have to create one with a crawler which is not a very good process.
Crawlers are "lazy" too, so if you give them a sitemap with all your site URLs in it, they are more likely to index more pages on your site.
They also give you the ability to prioritize your pages so the crawlers know how frequently they change, which ones are more important to keep updated, etc. so they don't waste their time crawling pages that haven't changed, missing ones that do, or indexing pages you don't care much about (and missing pages that you do).
There are also lots of automated tools online that you can use to crawl your entire site and generate a sitemap. If your site isn't too big (less than a few thousand urls) those will work great.
Well, like that paragraph says sitemaps also provide meta data about a given url that a crawler may not be able to extrapolate purely by crawling. The sitemap acts as table of contents for the crawler so that it can prioritize content and index what matters.
The sitemap helps telling the crawler which pages are more important, and also how often they can be expected to be updated. This is information that really can't be found out by just scanning the pages themselves.
Crawlers have a limit to how many pages the scan of your site, and how many levels deep they follow links. If you have a lot of less relevant pages, a lot of different URLs to the same page, or pages that need many steps to get to, the crawler will stop before it comes to the most interresting pages. The site map offers an alternative way to easily find the most interresting pages, without having to follow links and sorting out duplicates.

Resources