How to omit JavaScript and comments using nutch crawl? - nutch

I am a newbie at this, trying to use Nutch 1.2 to fetch a site. I'm using only a Linux console to work with Nutch as I don't need anything else. My command looks like this
bin/nutch crawl urls -dir crawled -depth 3
Where the folder urls is were I have my links and I do get the results to the folder crawled.
And when I would like to see the results I type:bin/nutch readseg -dump crawled/segments/20110401113805 /home/nutch/dumpfiles
This works very fine, but I get a lot of broken links.
Now, I do not want Nutch to follow JavaScript links, only regular links, could anyone give me a hint/help on how to do that?
I've tried to edit the conf/crawl-urlfilter.txt with no results. I might have typed wrong commands!
Any help appreciated!

beware there are two different filter files, one for the one stop crawl command and the other for the step-by-step commands.
For the rest just build a regex that will match the urls you want to skip, add minus before and you shoulb de done.

Related

How can I get the tags of stackoverflow for solr index?

Recently, I used nutch-1.11 and solr-4.10.4 to set up a crawler, I can crawl data by sequential nutch commands, but now my problem is how can I to fetch the specified data, like tags of questions of stackoverflow for example, then I can use these data for solr indexing for my some purpose? I try to configure and modify the "local/conf/nutch-site" but doesn't work for me, I'm a newer for Nnutch!
Nutch fetches urls, so what you could do is point it to a page which might contain all the links to the questions with that tag.
For example
https://stackoverflow.com/questions/tagged/nutch?sort=newest, this page contains links to all questions having Nutch as the tag. Now by crawling 2 or more rounds will make Nutch fetch all outlinks from this page.

Crawling different sites with nutch 1.8

i am using nutch 1.8 to crawling information from sites who has different patterns from same field. I was writing plugins for each of that sites the , but when i start nutch, just first plugin is matching with all sites, others as they are not exists.
If the first plugin is not matched with site, skip to next one and check them, etc until you find the right plugin for site?
Not clear why you are getting this. Are you writing a HTMLParseFilter? What you could do is to exit the parse method if the current document's URL does not match a given pattern or alternatively pass some metadata from the seeds which you could use to determine which HTMLParseFilter implementation to use.
BTW you'd get a more relevant audience by posting on the Nutch user list (see http://nutch.apache.org/mailing_lists.html)

Add metadata to Crawldb dump

I'm starting with Nutch (trunk version) and I'm spinning around the code without seeing something that seems obvious.
I want to extract the resource of every URLs crawled ( eg: https://stackoverflow.com/questions/ask ===> /question/ask ) expecting two results:
1. Post the information as an additional field to a Solr instance. I have solved this problem writing an IndexingFilter plugin and works perfectly.
2. Dumping this information as metadata when the next command it's thrown: bin/nutch readdb -dump crawldb
And at this second point it's where I'm stucked. Reading documentation and other examples it seems I have to use the CrawlDatum but I don't know in what class I have to modify in order to show this information when a dump is made.
Maybe someone knows where to touch in order to achieve this?
Some help will be appreciated!

TYPO3 - Indexed Search and how to index extension

I use indexed_search and RealUrl and I need it to show the whole url in the search result.
Right now it is only showing that part of the url which is related to pages and not the part that is related to my extension.
Now it shows: domain.dk/products/
But it should show: domain.dk/products/product/product-title
I dont know whether it is in RealUrl configuration or in Indexed Search I should make som changes.
There are some pretty good explanations on the web, showing how to index database/extension records with the crawler extension. Try this one as a start, it shows everything step by step and with screenshots, so I guess it should be useful.
If this is not enough, there are ready-to-use examples for tt_news and other extensions in the crawler documentation.

Solr and web site indexing to create a site search

I was trying to build a 'site search' on a simple http site.
I have a site, lets call it www.mycompany.com, that is pure html.
Is there an easy way to use solr to index the entire site to build a full text search using solr as the engine?
I googled for a bit and could not find anything specific of the type:
Do A
Do B
...
profit!
Let me also know if I am a bit off with what is solr for :P
Thanks in advance.
Solr is only for indexing and searching text, it does not have a crawler since it's out the project's scope.
However take a look at Nutch, which is a crawler and not too hard to setup initially.
Nutch and Solr can be integrated if you need some Solr-specific feature to search the index.
$ bin/solr create -c corename
$ bin/post -c corename https://siteurl.com -recursive 2 -delay 1
This would do a basic index of the site but it would not be the best. If you want simple then there it is. It can be done.
I think this only works on solr 5+.
Two other options you might want to look at are Crawl Anywhere and Heritrix

Resources