Incremental crawling in Nutch - nutch

I'm new to Nutch and am doing a POC with Nutch 1.9. I am only trying to crawl my own site to set up a search on it. I find that the first crawl I do only crawls one page. The second crawls 40 pages, the third 300. the increments reduce and it crawls around 400 pages overall. Does anyone know why it doesn't just do the full crawl of the website on the first run? I used the nutch tutorial (http://wiki.apache.org/nutch/NutchTutorial) and am running using the script as per section 3.5.
I'm also finding with multiple runs it doesn't crawl the whole site anyway - GSA brings back over 900 pages for the same site - nutch brings back 400.
Thanks kindly
Jason

Upto my knowledge,
Nutch crawl the known links and getting inlinks and outlinks from the known pages then add those links into db for next crawl. It seems why nutch didn't crawl all pages at single run.
Incremental crawling means to crawl only new or updated pages and leaves the unmodified pages.
Nutch cralws only limited page because of your configuration settings. change it to crawl all pages. See here
If you want to make a search for one website, then take a look at Aperture. It will crawl whole website at single run. It provides incremental support.

Why don't you use the Nutch mailing list? you'd get a larger audience and quicker answers from fellow Nutch users.
What value are you setting for the number of rounds when using the crawl script? Setting it to 1 means that you won't go further than the URLs in the seed list. Use a large value to crawl the whole site in a single call to the script.
The difference in the total number of URLs could be the max oulinks per page param as Kumar suggested but it could also be due to the URL filtering.

Related

How do you crawl external links on a found page?

I used the example on installing nutch from their wiki. I was able to crawl multiple pages pulled from dmoz easily. But is there a configuration that can be done to crawl external links it finds on a page, or write those external links to a file to be crawled next?
What is the best way to follow links on a page to index that page as well with nutch? If I were executing the bin/nutch via python, could I get back all the external links it found, and create a new crawl list to run again? What would you do?
First, make sure that the parameter 'db.ignore.external.links' is set to false. Also, in the file 'regex-urlfilter.txt', add rules for the external links you wish to be crawled OR add +. as the last rule. The +. rule will make the crawler follow ALL links. If you use that last option, beware that you risk crawling all the Web!

Software for building a sitemap

If I had to create a content inventory for a website that doesn't have a sitemap, and I do not have access to modify the website, but the site is very large. How can I build a sitemap out of that website without having to browse it entirely ?
I tried with Visio's sitemap builder, but it fails great time.
Let's say for example: I Want to create a sitemap of Stackoverflow.
Do you guys know a software to build it ?
You would have to browse it entirely to search every page for unique links within the site and then put them in an index.
Also for each unique link you find within the site you then need to visit that page and search for more unique links.
You would use a tool such as HtmlAgilityPack to easily grab urls and extract links from them.
I have written an article which touches on the extracting links part of the problem:
http://runtingsproper.blogspot.com/2009/11/easily-extracting-links-from-snippet-of.html
I would register all your pages in a Database, and then just output them all on a page (php - sql). Maybe even indexing software could help you! First of all, just make sure all your pages are linked up and submit it to google still!
Just googled and found this one.
http://www.xml-sitemaps.com/
Looks pretty interesting!
There is a pretty big collection of XML Sitemaps generators (assuming that's what you want to generate -- not a HTML sitemap page or something else?) at http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators
In general, for any larger site, the best solution is really to grab the information directly from the source, for example from the database that powers the site. By doing that you can get the most accurate and up-to-date Sitemap file. If you have to crawl the site to get the URLs for a Sitemap file, it will take quite some time for a larger site and it will load the server during that time (it's like someone visiting all pages in your site). Crawling the site from time to time to determine if there are crawlability issues (such as endless calendars, content hidden through forms, etc) is a good idea, but if you can, it's generally better to get the URLs for the Sitemap file directly.

Add Site and Page Description to SharePoint Search Index

As part of a SharePoint solution, the functionality for users to create new web sites and publishing pages (programmatically) via a button click has been added. I need to ensure that the Description field for the newly created sites and pages is indexed by SharePoint Search. What is the best way to do this?
Please note, I am NOT interested in starting a new crawl. I just want to ensure that whenever the next scheduled crawl occurs, the contents of these fields will be searchable.
Thanks, MagicAndi
I'm guessing you mean how can you ensure the site is indexed immediately?
Generally, crawls are scheduled which means your new site will only be added to the search index after the next crawl is done. So if your incremental crawl happens every hour you may have to wait up to an hour for it to appear in the search index.
However, given that your new sites are being added programatically you could also programatically start an incremental crawl if it is vital for it to start appearing in search results immediately. There are details how to do this in this article.
Update:
The site title and description should be indexed automatically by the next crawl. If this isn't happening, then you don't have a Content Source that covers that site so you need to create/update one to cover the new sites and make sure it has a crawl schedule. If the new sites are created in separate site collections consider putting them on a Managed Path.
In our SharePoint system we have a terrabyte of data with 100,000 site collections and probably 20 new site collections added every day. We only have one content source that points to the root of the site and everything gets indexed automatically.
It sounds like you're missing a content source or a crawl schedule.
It turns out that the site description is included in the crawl by default. I tested the search default properties by creating a new site and assigning a unique text string to the description. After the next incremental crawl, I was able to search and find the unique string via the default SharePoint search.
I have not yet tested if the page description is included in the search scope by default, but I'm prepared to guess that it is. I will update my answer as soon as I get a chance to test this.

Sharepoint search of external RSS feeds

I want my sharepoint site to allow a user to search content in a known collection of RSS feeds. I figure conceptually a few ways to do this
crawl the feeds at their source (Yikes!)
Pull the full articles into my sharepoint site, then let my crawler crawl it
Make use of an existing index (like google)
search the full articles, on demand, using something like a google utility (my preference)
So can I somehow, from my sharepoint site, allow a user to search the full articles from a couple dozen, named, rss feeds
thanks
Cary
I don't see why there is a problem with crawling the feeds at their source? That would seem to be reasonable.
It is fairly easy to create a content source to point at the feed and select the correct indexing schedule. If that does not work then you can try a more complicated approach.
Be aware that copying the content of another website to host on your own could have copyright implications (not too mention the risk that any inflammatory content would appear to be published on your own site).
--update--
Try reading the target sites robots.txt to see if (it even has one) it has a desired frequency. Otherwise it depends on the depth of the site you would be crawling.
If you are crawling just the rss feed xml, I suspect you could do that every hour without annoying anyone. Otherwise if you reach into each article, you may want to limit that. It really depends a lot on any relationship you have with the target site and type of site you are hitting.
Checkout this article for a little more info on how SharePoint deals with robots.txt
(p.s. the target site did not put the articles on the web so no one would read them)
The out of the box crawler will respect robots.txt and there are provisions for crawler impact rules that will lessen the chance that SharePoint will perform a beat down on the external site.

Google Custom Search not indexing Dynamic Pages

I am trying to use Google Custom Search to provide search capabilities to an informational site.
About the site:
Content is generated dynamically
URL Access to content is search engine friendly (i.e. site.com/Info/3/4/45)
Sitemap (based on RSS feed) submitted
and accepted by web master tools. It
notes that no pages were indexed.
Annotations sucessfully submitted based on the RSS feed
Problem:
There are no results for any keywords that appear on the pages that were submitted.
Questions:
Why is Google not indexing the submitted pages?
What could I be doing wrong?
Custom Search with basic settings is principally same thing as standard search with site:your.website. Does standard search give you expected results?
Note, that Google doesn't index pages immediately. It takes some time. Check if your site is already indexed.
Yeah it took about 2 weeks for Google to pick up all my pages after I submitted a site map. But you should see a few pages indexed after a couple days.

Resources