Nutch fetches already fetched URLs - nutch

I am trying to crawl website using Nutch. I use commands:
inject for URLs injection to DB
loop of generate/fetch/parse/updatedb
I noticed what Nutch fetches already fetched URLs on each loop iteration.
Config I have made:
added filter to regex-urlfilter.txt
Added config to nutch-site.xml:
http.agent.name set value MyNutchSpider
http.robots.agents set value to MyNutchSpider
file.content.limit -1
http.content.limit -1
ftp.content.limit -1
fetcher.server.delay set value to 1.0
fetcher.threads.fetch set value to 1
parser.character.encoding.default
plugin.includes added protocol protocol-httpclient
set storage.data.store.class to use custom storage
I use commands:
bin/nutch generate -topN 10
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
I have tried versions of Nutch 2.2.1 with MySQL and 2.3 with MongoDB. Result is same already fetched URLs are re-feched on each crawl loop iteration.
What I should to do to fetch all not crawled URLs?

This is an open issue for Nutch 2.X. I faced it this weekend too.
The fix is scheduled for release 2.3.1: https://issues.apache.org/jira/browse/NUTCH-1922.

Related

Curl Get Swift Storage files saved by Spark

As it known Apache spark saves files by parts i.e foo.csv/part-r-00000..
I save files on Swift Object storage now I want want to get the files using Openstack swift API but when I do curl on foo.csv I get zero file
How I download the contents of the file.
You can take any REST client and list content of the object store. Don't do curl on 'foo.txt', since it's zero size object. You need to list container with prefix 'foo.txt', this will return you all the parts.
Alternatively you can use Apache Spark and read foo.txt (Spark will automatically list and return all the parts)

Crawling issue in Apache Nutch

I am working on a web project. My purpose is to take a web page and decide that it is blog or not. In order to perform this i have to use crawler and i use apache nutch. I execute all of orders in Apache Nutch Tutorial Page but i have failed. For the bin/nutch crawl urls -dir crawl -depth 3 -topN 50 command my result is:
solrUrl is not set, indexing will be skipped...
2014-04-25 15:29:11.324 java[4405:1003] Unable to load realm info from SCDynamicStore
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 50
Injector: starting at 2014-04-25 15:29:11
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2014-04-25 15:29:13, elapsed: 00:00:02
Generator: starting at 2014-04-25 15:29:13
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl
My urls/seed.txt file is:
http://yahoo.com/
My regex-urlfilter.txt file is:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^http://([a-z0-9]*\.)*yahoo.com/
My nutch-site.xml file is:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>Baris Spider</value>
<description>This crawler is used to fetch text documents from
web pages that will be used as a corpus for Part-of-speech-tagging
</description>
</property>
</configuration>
What is wrong?
bin/nutch crawl urls -dir crawl -depth 3 is deprecated use this instead bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
for example bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/your_core
As i read your log you havn't provided url to solr (nutch need it to send crawled documents ). So you have to download and launch Solr before using these commands. follow this apache solr link to read how to setup and install solr

Nutch not working on linux environment

I've installed Nutch on linux system. When I go in directory 'bin' and run ./nutch, it shows following -
Usage: nutch COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
index run the plugin-based indexer on parsed segments and linkdb
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
clean remove HTTP 301 and 404 documents from indexing backends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
The above output shows that Nutch is correctly installed on this system. But next when I run ./nutch crawlresult urls -dir crawl -depth 3 I get following output -
./nutch: line 272: /usr/java/jdk1.7.0/bin/java: Success
whereas I was expecting the nutch to begin crawling and show the logs. Please tell me what is wrong?
You can't run this command "crawlresult". You have to use a command from the command list of nutch script. if you crawl with nutch, you use this tutorials

Nutch : How to separate url result from the depth : 1 and the result from depth : 2

After I run this command in nutch:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
I get a list of urls, just say 50 urls , but anyone know to separate all the url by the depth.
So I will get the result:
URL from depth 1 = 5 urls
url
url
url
......
URL from depth 2 = 15 urls
url
url
url
......
Something like that, is there anyone already solved this problem?
Is there an function in nutch to solved this problem?
Any help will be appreciate.
There is no in-built function in nutch to do this. But simple hack will be to run the nutch command with dept 1, copy the web table and then run again for deth 1. So you will have 2 versions of the nutch web-table corresponding to each round

how to verify links in a PDF file

I have a PDF file which I want to verify whether the links in that are proper. Proper in the sense - all URLs specified are linked to web pages and nothing is broken. I am looking for a simple utility or a script which can do it easily ?!
Example:
$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt
I have no idea of whether something like that exists, so googled & searched in stackoverflow also. But did not find anything useful yet. So would like to anyone has any idea about it !
Updated: to make the question clear.
You can use pdf-link-checker
pdf-link-checker is a simple tool that parses a PDF document and checks for broken hyperlinks. It does this by sending simple HTTP requests to each link found in a given document.
To install it with pip:
pip install pdf-link-checker
Unfortunately, one dependency (pdfminer) is broken. To fix it:
pip uninstall pdfminer
pip install pdfminer==20110515
I suggest first using the linux command line utility 'pdftotext' - you can find the man page:
pdftotext man page
The utility is part of the Xpdf collection of PDF processing tools, available on most linux distributions. See http://foolabs.com/xpdf/download.html.
Once installed, you could process the PDF file through pdftotext:
pdftotext file.pdf file.txt
Once processed, a simple perl script that searched the resulting text file for http URLs, and retrieved them using LWP::Simple. LWP::Simple->get('http://...') will allow you to validate the URLs with a code snippet such as:
use LWP::Simple;
$content = get("http://www.sn.no/");
die "Couldn't get it!" unless defined $content;
That would accomplish what you want to do, I think. There are plenty of resources on how to write regular expressions to match http URLs, but a very simple one would look like this:
m/http[^\s]+/i
"http followed by one or more not-space characters" - assuming the URLs are property URL encoded.
There are two lines of enquiry with your question.
Are you looking for regex verification that the link contains key information such as http:// and valid TLD codes? If so I'm sure a regex expert will drop by, or have a look at regexlib.com which contains lots of existing regex for dealing with URLs.
Or are you wanting to verify that a website exists then I would recommend Python + Requests as you could script out checks to see if websites exist and don't return error codes.
It's a task which I'm currently undertaking for pretty much the same purpose at work. We have about 54k links to get processed automatically.
Collect links by:
enumerating links using API, or dumping as text and linkifying the result, or saving as html PDFMiner.
Make requests to check them:
there are plethora of options depending on your needs.
https://stackoverflow.com/a/42178474/1587329's advice was inspiration to write this simple tool (see gist):
'''loads pdf file in sys.argv[1], extracts URLs, tries to load each URL'''
import urllib
import sys
import PyPDF2
# credits to stackoverflow.com/questions/27744210
def extract_urls(filename):
'''extracts all urls from filename'''
PDFFile = open(filename,'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
yield u[ank][uri]
def check_http_url(url):
urllib.urlopen(url)
if __name__ == "__main__":
for url in extract_urls(sys.argv[1]):
check_http_url(url)
Save to filename.py, run as python filename.py pdfname.pdf.

Resources