I used the example on installing nutch from their wiki. I was able to crawl multiple pages pulled from dmoz easily. But is there a configuration that can be done to crawl external links it finds on a page, or write those external links to a file to be crawled next?
What is the best way to follow links on a page to index that page as well with nutch? If I were executing the bin/nutch via python, could I get back all the external links it found, and create a new crawl list to run again? What would you do?
First, make sure that the parameter 'db.ignore.external.links' is set to false. Also, in the file 'regex-urlfilter.txt', add rules for the external links you wish to be crawled OR add +. as the last rule. The +. rule will make the crawler follow ALL links. If you use that last option, beware that you risk crawling all the Web!
Related
I'm new to Nutch and am doing a POC with Nutch 1.9. I am only trying to crawl my own site to set up a search on it. I find that the first crawl I do only crawls one page. The second crawls 40 pages, the third 300. the increments reduce and it crawls around 400 pages overall. Does anyone know why it doesn't just do the full crawl of the website on the first run? I used the nutch tutorial (http://wiki.apache.org/nutch/NutchTutorial) and am running using the script as per section 3.5.
I'm also finding with multiple runs it doesn't crawl the whole site anyway - GSA brings back over 900 pages for the same site - nutch brings back 400.
Thanks kindly
Jason
Upto my knowledge,
Nutch crawl the known links and getting inlinks and outlinks from the known pages then add those links into db for next crawl. It seems why nutch didn't crawl all pages at single run.
Incremental crawling means to crawl only new or updated pages and leaves the unmodified pages.
Nutch cralws only limited page because of your configuration settings. change it to crawl all pages. See here
If you want to make a search for one website, then take a look at Aperture. It will crawl whole website at single run. It provides incremental support.
Why don't you use the Nutch mailing list? you'd get a larger audience and quicker answers from fellow Nutch users.
What value are you setting for the number of rounds when using the crawl script? Setting it to 1 means that you won't go further than the URLs in the seed list. Use a large value to crawl the whole site in a single call to the script.
The difference in the total number of URLs could be the max oulinks per page param as Kumar suggested but it could also be due to the URL filtering.
I have some txt log files where i print out some important activities for my site.
These files ARE NOT referenced from any link within my site, so it's only me i know the url
(they contain current date in the filname so i have one for each day).
Question: will google index these kind of files?
I think google indexes only the pages whom urls are on the site.
Can you confirm my assumption? I just do not want others to find the link from google etc:)
In theory they shouldn't. If they aren't linked from anywhere they shouldn't be able to find them. However I'm not sure if stuff can make its way into the index by virtue of having the google toolbar installed. Definitely I've had some unexpected stuff turn up in search engines. The only safe way would be to password protect the folder.
Google can not index pages that it doesn't know they exist, so it won't index these, unless someone posts the url's to google, or place them on some website.
If you want to be sure, just disallow indexing for the files (in /robots.txt).
Best practice is to use the robots.txt to prevent the google crawler from indexing files you don't want to show up.
This description from Google Webmaster Tools is very helpful and leads you through the process of creating such a file:
https://support.google.com/webmasters/answer/6062608
edit: As it was pointed out in the comments there is no guarantee that the robots.txt is used so password-protecting the folders is also a good idea.
If I had to create a content inventory for a website that doesn't have a sitemap, and I do not have access to modify the website, but the site is very large. How can I build a sitemap out of that website without having to browse it entirely ?
I tried with Visio's sitemap builder, but it fails great time.
Let's say for example: I Want to create a sitemap of Stackoverflow.
Do you guys know a software to build it ?
You would have to browse it entirely to search every page for unique links within the site and then put them in an index.
Also for each unique link you find within the site you then need to visit that page and search for more unique links.
You would use a tool such as HtmlAgilityPack to easily grab urls and extract links from them.
I have written an article which touches on the extracting links part of the problem:
http://runtingsproper.blogspot.com/2009/11/easily-extracting-links-from-snippet-of.html
I would register all your pages in a Database, and then just output them all on a page (php - sql). Maybe even indexing software could help you! First of all, just make sure all your pages are linked up and submit it to google still!
Just googled and found this one.
http://www.xml-sitemaps.com/
Looks pretty interesting!
There is a pretty big collection of XML Sitemaps generators (assuming that's what you want to generate -- not a HTML sitemap page or something else?) at http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators
In general, for any larger site, the best solution is really to grab the information directly from the source, for example from the database that powers the site. By doing that you can get the most accurate and up-to-date Sitemap file. If you have to crawl the site to get the URLs for a Sitemap file, it will take quite some time for a larger site and it will load the server during that time (it's like someone visiting all pages in your site). Crawling the site from time to time to determine if there are crawlability issues (such as endless calendars, content hidden through forms, etc) is a good idea, but if you can, it's generally better to get the URLs for the Sitemap file directly.
I am trying to find a search feature that searches all contents, including articles, links, posts, etc. in Joomla. Where is it located?
I am talking about search feature in administration page, not home page. I want to be able to figure out where the content is coming from and its location.
I haven't seen any search capabilities directly in the administration console for Joomla.
The standard search extension you add to the actual site should give you this information though.
I am using Microsoft SharePoint Search (MOSS) to search all pages on a website.
My problem is that when you search for a word that appears in the header, footer, menu or tag cloud section of the website, that word will appear on every page, so the search server will bring you a list of results for that search term: every page on the website.
Ideally I want to tell the search server to ignore certain HTML sections in its search index.
This website seems to describe my problem, and a guy says "why not hide those sections of your website if the User Agent is the search server.
The problem with that approach is that most of the sections I hide contain links to other pages (menu's and tag clouds) and so the crawler will hit a dead end and won't crawl very far.
Anyone got any suggestions on how to solve this problem?
I'm not sure if i'm reading this correctly. You DON'T want Search to include parts of your site in the index, but you DO want it to go into that section and follow any links in it?
I think the best way is to indeed exclude those section based on user agent (i.e. add them to a usercontrol and if the user agent is MS Search you don't render the section).
Seeing as these sections would be the same on every page, it's okay to exclude them when the search crawler comes by.
Just create ONE page (i.e. a sitemap :-D). that does include all the links a normal user would see in the footer / header / etc. The crawler could then use that page to follow links deeper into your site. This would be a performance boost as well, seeing as the crawler only encounters the links once instead of on every page.