Crawling issue in Apache Nutch - nutch

I am working on a web project. My purpose is to take a web page and decide that it is blog or not. In order to perform this i have to use crawler and i use apache nutch. I execute all of orders in Apache Nutch Tutorial Page but i have failed. For the bin/nutch crawl urls -dir crawl -depth 3 -topN 50 command my result is:
solrUrl is not set, indexing will be skipped...
2014-04-25 15:29:11.324 java[4405:1003] Unable to load realm info from SCDynamicStore
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 50
Injector: starting at 2014-04-25 15:29:11
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2014-04-25 15:29:13, elapsed: 00:00:02
Generator: starting at 2014-04-25 15:29:13
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl
My urls/seed.txt file is:
http://yahoo.com/
My regex-urlfilter.txt file is:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^http://([a-z0-9]*\.)*yahoo.com/
My nutch-site.xml file is:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>Baris Spider</value>
<description>This crawler is used to fetch text documents from
web pages that will be used as a corpus for Part-of-speech-tagging
</description>
</property>
</configuration>
What is wrong?

bin/nutch crawl urls -dir crawl -depth 3 is deprecated use this instead bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
for example bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/your_core
As i read your log you havn't provided url to solr (nutch need it to send crawled documents ). So you have to download and launch Solr before using these commands. follow this apache solr link to read how to setup and install solr

Related

use .htaccess to complete document extension

On my last server I had a .htaccess file that completed the urls without redirecting, I didn't quite understand how this was configured, but I am sure it was in the .htaccess file.
All the documentations I read until now only handle one file extension to be replaced (.htm gets redirected to .html).
Lets say I have following files on my server:
0: 001.jpg
1: 002.jpg
2: 002.png
3: staff.php
when I request example.com/001 file 0 gets displayed, in my searchbar is no extension to be seen
when I request example.com/002 I get the errorcode 404, since there are 2 files starting with "002"
when I request example.com/002.j I get the image 1
when I request example.com/staff I get the output of the staff.php
How can I let .htaccess do that?

Using Scrapy to crawl all pages that are linked to a website with any depth we want

I am interested to know if it is possible to crawl all pages and links on a website with any amount of depth even if after following a few links the top URL changes? Here is an example:
Top URL: www.topURL.com
has 3 links: www.topURL.com/link1, www.topURL.com/link2 and www.topURL.com/link3
Then if we click on www.topURL.com/link1 it takes us to a page that itself has
2 links on it: www.topURL.com/link4 and www.topURL.com/link5
but if we click on www.topURL.com/link4 it takes us to a page that has the following 2 links: www.anotherURL.com/link1 and www.thirdURL.com/link1
Can scrapy, or any python crawler/spider start from www.topURL.com and then follows links and end up on www.thirdURL.com/link1?
Is there a limit on how deep it can go?
Is there any code example to show me how to do it?
Thanks for the help.
Take a look at scrapy's CrawlSpider spider class
CrawlSpider is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules.
To accomplish your goal you'd simply have to set very basic rules:
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract and follow all links!
Rule(LinkExtractor(callback='parse_item'), follow=True),
)
def parse_item(self, response):
self.log('crawling'.format(response.url))
The crawler above will crawl every url that matches the allowed_domains on the website and callback to parse_item.
It should be noted that by default LinkeExtractor ignores media pages (like pdf, mp4 etc.)
To expand on depth subjects, scrapy does have DEPTH restriction settings but defaults are set to 0 (aka infinite depth)
https://doc.scrapy.org/en/0.9/topics/settings.html#depth-limit
# settings.py
DEPTH_LIMIT = 0
Also scrapy by default crawls depth first but if you want faster coverage breadth first might improve that: https://doc.scrapy.org/en/0.9/topics/settings.html#depth-limit
# settings.py
SCHEDULER_ORDER = 'BFO'

Nutch fetches already fetched URLs

I am trying to crawl website using Nutch. I use commands:
inject for URLs injection to DB
loop of generate/fetch/parse/updatedb
I noticed what Nutch fetches already fetched URLs on each loop iteration.
Config I have made:
added filter to regex-urlfilter.txt
Added config to nutch-site.xml:
http.agent.name set value MyNutchSpider
http.robots.agents set value to MyNutchSpider
file.content.limit -1
http.content.limit -1
ftp.content.limit -1
fetcher.server.delay set value to 1.0
fetcher.threads.fetch set value to 1
parser.character.encoding.default
plugin.includes added protocol protocol-httpclient
set storage.data.store.class to use custom storage
I use commands:
bin/nutch generate -topN 10
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
I have tried versions of Nutch 2.2.1 with MySQL and 2.3 with MongoDB. Result is same already fetched URLs are re-feched on each crawl loop iteration.
What I should to do to fetch all not crawled URLs?
This is an open issue for Nutch 2.X. I faced it this weekend too.
The fix is scheduled for release 2.3.1: https://issues.apache.org/jira/browse/NUTCH-1922.

Nutch not working on linux environment

I've installed Nutch on linux system. When I go in directory 'bin' and run ./nutch, it shows following -
Usage: nutch COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
index run the plugin-based indexer on parsed segments and linkdb
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
clean remove HTTP 301 and 404 documents from indexing backends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
The above output shows that Nutch is correctly installed on this system. But next when I run ./nutch crawlresult urls -dir crawl -depth 3 I get following output -
./nutch: line 272: /usr/java/jdk1.7.0/bin/java: Success
whereas I was expecting the nutch to begin crawling and show the logs. Please tell me what is wrong?
You can't run this command "crawlresult". You have to use a command from the command list of nutch script. if you crawl with nutch, you use this tutorials

Nutch : How to separate url result from the depth : 1 and the result from depth : 2

After I run this command in nutch:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
I get a list of urls, just say 50 urls , but anyone know to separate all the url by the depth.
So I will get the result:
URL from depth 1 = 5 urls
url
url
url
......
URL from depth 2 = 15 urls
url
url
url
......
Something like that, is there anyone already solved this problem?
Is there an function in nutch to solved this problem?
Any help will be appreciate.
There is no in-built function in nutch to do this. But simple hack will be to run the nutch command with dept 1, copy the web table and then run again for deth 1. So you will have 2 versions of the nutch web-table corresponding to each round

Resources