Seed URL for Apache Nutch Web Crawling - nutch

Apache Nutch recommends http://rdf.dmoz.org/rdf/content.rdf.u8.gz as seed URLs for web crawling. However, they have shut down the website. Is there any alternative seed URLs for web crawling?

I would recommend taking a look at http://commoncrawl.org. I think they offer a really comprehensive dataset.

Related

Nutch and crawling millions of websites

Can we use nutch 1.10 in order to crawl millions of websites with several number of round?
I don't really understand the database created when I launch nutch 1.10. Is this enough to crawl the important data from the site?
I have a file with a list of url's that take 2 gigabytes.
Yes, you can. This is essentially the goal of nutch. However, crawling millions of websites takes time and space, and in order to do, you need to setup the environment correctly.
In nutch 1.X the "crawling database", e.g. which urls visited, what is the urls frontier (next urls to visit),etc. is persisted to the hadoop filesystem. This is the place where you'll first inject your list of urls.
In addition, in order to view to indexed data, you can use solr (or elasticsearch).
I recommend first going through the nutch 1.x tutorial with a short list of urls and getting to know how to use nutch and the plugins.
After that, setup an hadoop cluster with tutorials from the hadoop site, and crawl away!

why does nutch always create the linkdb, even though it's not need for content fetching?

I am reading thru the chapter on nutch in hadoop, the definitive guide. I understand the concept of ranking a page using inverse link. However, I don't see that playing a role when you just want to crawl a few sites. Since creation of the linkdb is a map reduce job, it's bound to take up a lot of computing resources. I am just wondering why is linkdb always generated when most of nutch use cases is just getting web content for designated urls.
That is because Nutch uses the page rank (which is being calculated using link information) to prioritize crawling. For instance, a link that has a high page rank will be crawled before than the one with low page rank.
Nutch was designed to be used as a large scale web crawler,therefore calculating page rank and scoring web pages with it was and still an important component. If you are crawling a few sites, then you probably should use scrappy (a python library).
I hope that answers your question.

When transferring a website from a server to another server/heroku, does your google ranking and indexes become affected?

I have a django application that I'm hosting on a digital ocean server. I have quite a bit of content that I've built up on there which google has indexed. Now if I transfer that django app to heroku and then point the domains DNS to heroku will that preserve all the indexing done by search engines? I believe the answer is yes but I just want to be sure im not missing anything
Correct; Google isn't aware of what your technology stack looks like, only of the HTML output for each page.
Make sure that the transfer doesn't result in a slower page load time or broken links, though.

Apache Nutch: crawl only new pages for semantics analysis

I plan to tune up Nutch 2.2.X such way, that after initial crawling of the list of sites I launch the crawler daily and get HTML or plain text of new pages appeared on those sites this day only. Number of sites: hundreds.
Please be noted, that I'm not interested on updated, only new pages. Also I need new pages only starting from a date. Let's suppose it is the date of "Initial crawling".
Reading documentation and searching the Web iI got following questions can't find anywhere else:
What backend I should better use for Nutch for my task? I need page's text only once, then I never return to it. MySQL seems isn't an option as it is not supported by gora anymore. I tried use HBase, but seems I have to rollback to Nutch 2.1.x to get it working correctly. What are your ideas? How I may minimize disk space and other resources utilization?
May I perform my task not using indexing engine, like Solr? Not sure I need store large fulltext indexes. May Nutch >2.2 be launched without Solr and does it needs specific options for launching such way? Tutorials aren't clearly explain this question: everybody needs Solr, except me.
If I'd like to add a site to the crawling list, how I better perform it? Let's suppose I already crawling a list of sites and want to add a site to the list to monitor it from now. So I need to crawl the new site skipping pages content to add it to WebDB, and then run daily crawl as usual. For Nutch 1.x it may be possible to perform separate crawls and then merge them. How it may looks like for Nutch 2.x?
May this task be performed without custom plugins, and may it be performed with Nutch at all? Probably, I may write a custom plugin which detects somehow is the page already indexed, or it is new, and we need put the content to XML, or a database, etc. Should I write the plugin at all, or there is a way to solve the task with lesser blood? And how the plugin's algorithm may look like, if there is no way to live without it?
P.S. There is a lot of Nutch questions/answers/tutorials, and I honestly searched in the Web for 2 weeks, but haven't found answers to questions above.
I'm not using solr too. I just checked this documentation: https://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
It seems like there are command prompts that can show the data fetched using WebDB. I'm new to Nutch but I just follow this documentation. Check it out.

Downloading all web hosts on the Internet

I'm working on a project where we work with a distributed crawler to crawl and download hosts found with web content on them. We have a few million hosts at this point, but we're realizing it's not the least expensive thing in the world. Crawling takes time and computing power, etc. etc. So instead of doing this ourselves, we're looking into if we can leverage an outside service to get URLs.
My question is, are there services out there that provide massive lists of web hosts and/or just massive lists of constantly updated URLS (which we can then parse to get hosts)? Stuff I've already looked into:
1) Search engine APIs - typically all of these search engine APIs will (understandably) not just let you download their entire index.
2) DMOZ and Alexa top 1 million - These don't have near enough sites for what we are looking to do, though they're a good start for seed lists.
Anyone have any leads? How would you solve the problem?
Maybe CommonCrawl helps. http://commoncrawl.org/
Common Crawl is a huge open database of crawled websites.

Resources