Apache Nutch: crawl only new pages for semantics analysis

Apache Nutch: crawl only new pages for semantics analysis - nutch

I plan to tune up Nutch 2.2.X such way, that after initial crawling of the list of sites I launch the crawler daily and get HTML or plain text of new pages appeared on those sites this day only. Number of sites: hundreds.
Please be noted, that I'm not interested on updated, only new pages. Also I need new pages only starting from a date. Let's suppose it is the date of "Initial crawling".
Reading documentation and searching the Web iI got following questions can't find anywhere else:
What backend I should better use for Nutch for my task? I need page's text only once, then I never return to it. MySQL seems isn't an option as it is not supported by gora anymore. I tried use HBase, but seems I have to rollback to Nutch 2.1.x to get it working correctly. What are your ideas? How I may minimize disk space and other resources utilization?
May I perform my task not using indexing engine, like Solr? Not sure I need store large fulltext indexes. May Nutch >2.2 be launched without Solr and does it needs specific options for launching such way? Tutorials aren't clearly explain this question: everybody needs Solr, except me.
If I'd like to add a site to the crawling list, how I better perform it? Let's suppose I already crawling a list of sites and want to add a site to the list to monitor it from now. So I need to crawl the new site skipping pages content to add it to WebDB, and then run daily crawl as usual. For Nutch 1.x it may be possible to perform separate crawls and then merge them. How it may looks like for Nutch 2.x?
May this task be performed without custom plugins, and may it be performed with Nutch at all? Probably, I may write a custom plugin which detects somehow is the page already indexed, or it is new, and we need put the content to XML, or a database, etc. Should I write the plugin at all, or there is a way to solve the task with lesser blood? And how the plugin's algorithm may look like, if there is no way to live without it?
P.S. There is a lot of Nutch questions/answers/tutorials, and I honestly searched in the Web for 2 weeks, but haven't found answers to questions above.

I'm not using solr too. I just checked this documentation: https://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
It seems like there are command prompts that can show the data fetched using WebDB. I'm new to Nutch but I just follow this documentation. Check it out.

Related

Nutch and crawling millions of websites

Can we use nutch 1.10 in order to crawl millions of websites with several number of round?
I don't really understand the database created when I launch nutch 1.10. Is this enough to crawl the important data from the site?
I have a file with a list of url's that take 2 gigabytes.

Yes, you can. This is essentially the goal of nutch. However, crawling millions of websites takes time and space, and in order to do, you need to setup the environment correctly.
In nutch 1.X the "crawling database", e.g. which urls visited, what is the urls frontier (next urls to visit),etc. is persisted to the hadoop filesystem. This is the place where you'll first inject your list of urls.
In addition, in order to view to indexed data, you can use solr (or elasticsearch).
I recommend first going through the nutch 1.x tutorial with a short list of urls and getting to know how to use nutch and the plugins.
After that, setup an hadoop cluster with tutorials from the hadoop site, and crawl away!

why does nutch always create the linkdb, even though it's not need for content fetching?

I am reading thru the chapter on nutch in hadoop, the definitive guide. I understand the concept of ranking a page using inverse link. However, I don't see that playing a role when you just want to crawl a few sites. Since creation of the linkdb is a map reduce job, it's bound to take up a lot of computing resources. I am just wondering why is linkdb always generated when most of nutch use cases is just getting web content for designated urls.

That is because Nutch uses the page rank (which is being calculated using link information) to prioritize crawling. For instance, a link that has a high page rank will be crawled before than the one with low page rank.
Nutch was designed to be used as a large scale web crawler,therefore calculating page rank and scoring web pages with it was and still an important component. If you are crawling a few sites, then you probably should use scrappy (a python library).
I hope that answers your question.

web crawling ,ruby,python,cassandra

I need to write a script that insert 1-million records of username or emails by crawling the web, into database.
The script may be any types like python,ruby,php etc.
Please let me know is it possible ?if possible please provide the information how can I build the script.
Thanks

You should also look at Apache Nutch and Apache Gora which would do what you're looking for. Nutch does the actual crawling which Gora stores the results in Cassandra, Hive or MySQL

Its possible may take some time though depending on your machine's performance and your internet connection.You could use PHP's cURL library to automatically send Web requests and then you could easily parse the data using a library for example :simplHtmlDOM or using native PHP DOM. But beware of running out of memory, also I highly recommend running the script from shell rather than a web browser. Also consider using multi curl functions, to fasten the process.
This is extreamly easy and fast to implement, although multi-threading would give a huge performance boost in this scenario, so I suggest using one of the other languages you proposed. I know you could do this in Java easily using Apache HttpClient library and manipulate the DOM and extract data using native x-path support, regex or use one of the many third party dom implementations in Java.
I strongly recommend also checking out Java library HtmlUnit, where it could make your life much easier, but you could maybe take a performance hit for that. A good multi-threading implementation would give a huge performance boost but a bad one could make your program run worse.
Here is some resources for python:
http://docs.python.org/library/httplib.html
http://www.boddie.org.uk/python/HTML.html
http://www.tutorialspoint.com/python/python_multithreading.htm

I would add a little on crawl side.
you said crawl the web. So here the crawling direction (i.e. after fetching a page, which link to visit next becomes very important). But if you already have a list of webpages (called seed URLs list) with you then you simply need to download them and parse out reqd. data. If you just need to parse email addresses, then regex would be your option. Because html does not have any tag for emails, then htmldom parser wouldnt help you.

how to find all the urls/ pages on mysite.com

i have a website that i now support and need to list all live pages/ url's.
is there a crawler i can use to point to my homepage and have it list all the pages/url's that it finds.
then i can delete any that dont make their way into this listing as they will be orphan pages/url's that have never been cleaned up?
I am using DNN and want to kill un-needed pages.

Since you're using a database-driven CMS, you should be able to do this either via the DNN admin interface or by looking directly in the database. Far more reliable than a crawler.

Back in the old days I used wget for this exact purpose, using its recursive retrieval functionality. It might not be the most efficient way, but it was definitely effective. YMMV, of course, since some sites will return a lot more content than others.

What are the benefits of having an updated sitemap.xml?

The text below is from sitemaps.org. What are the benefits to do that versus the crawler doing their job?
Sitemaps are an easy way for
webmasters to inform search engines
about pages on their sites that are
available for crawling. In its
simplest form, a Sitemap is an XML
file that lists URLs for a site along
with additional metadata about each
URL (when it was last updated, how
often it usually changes, and how
important it is, relative to other
URLs in the site) so that search
engines can more intelligently crawl
the site.
Edit 1: I am hoping to get enough benefits so I canjustify the development of that feature. At this moment our system does not provide sitemaps dynamically, so we have to create one with a crawler which is not a very good process.

Crawlers are "lazy" too, so if you give them a sitemap with all your site URLs in it, they are more likely to index more pages on your site.
They also give you the ability to prioritize your pages so the crawlers know how frequently they change, which ones are more important to keep updated, etc. so they don't waste their time crawling pages that haven't changed, missing ones that do, or indexing pages you don't care much about (and missing pages that you do).
There are also lots of automated tools online that you can use to crawl your entire site and generate a sitemap. If your site isn't too big (less than a few thousand urls) those will work great.

Well, like that paragraph says sitemaps also provide meta data about a given url that a crawler may not be able to extrapolate purely by crawling. The sitemap acts as table of contents for the crawler so that it can prioritize content and index what matters.

The sitemap helps telling the crawler which pages are more important, and also how often they can be expected to be updated. This is information that really can't be found out by just scanning the pages themselves.
Crawlers have a limit to how many pages the scan of your site, and how many levels deep they follow links. If you have a lot of less relevant pages, a lot of different URLs to the same page, or pages that need many steps to get to, the crawler will stop before it comes to the most interresting pages. The site map offers an alternative way to easily find the most interresting pages, without having to follow links and sorting out duplicates.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string