My purpose is to find how many URLs in an HTML page are invalid (404, 500, HostNotFound). So in Nutch is there a config change that we can do through which the web crawler crawls through broken links and indexes it in solr.
Once all the broken links & valid links are indexed in Solr I can just check the URLs which are invalid and can remove it from my HTML page.
Any help will be highly appreciated.
Thanks in advance.
You don't need to index to solr to find out broken links.
Do the following:
bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump
It will give you the links that are 404 as:
Status: 3 (db_gone)
Metadata: _pst_: notfound(14)
go through the output file and you'll find all broken links.
Example:
Put in the url file "http://www.wikipedia.com/somethingUnreal http://en.wikipedia.org/wiki/NocontentPage"
Run the crawl command:bin/nutch crawl urls.txt -depth 1
Run the readdb command:bin/nutch readdb crawl-20140214115539/crawldb/ -dump mydump
Open the output file "part-xxxxx" with a text editor
Results:
http://en.wikipedia.org/wiki/NocontentPage Version: 7
Status: 1 (db_unfetched)
...
Metadata: _pst_: exception(16), lastModified=0: Http code=503, url=http://en.wikipedia.org/wiki/NocontentPage
http://www.wikipedia.com/somethingUnreal Version: 7
Status: 5 (db_redir_perm)
...
Metadata: Content-Type: text/html_pst_: moved(12), lastModified=0: http://www.wikipedia.org/somethingUnreal
This command will give you a dump of just the broken links:
bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump -status db_gone
Remember to exclude URLs with the following tag in the dump since it is generated from respecting robots.txt:
Metadata:
_pst_=robots_denied(18)
Related
I have a website script, it 212MB and it's in RAR format , I could not upload it via filezilla ftp , it gave me a timeout error after sometime, I could not upload it from the filemanager of cpanel as it also kept showing an error. Then I used a php script to upload it directly from the link but now I can not extract it as its RAR not ZIP. I converted the RAR into ZIP and have it on drop box and google drive but there is no direct link which I can use to upload via the php script, SO, Is there any way to extract the rar file from cpanel or using a php script or some other tweak. I have been working on it for 2 hours now and can not find a way around.
create a php file and extra the .rar with that php file. use the following code
$archive = RarArchive::open('archive.rar');
$entries = $archive->getEntries();
foreach ($entries as $entry) {
$entry->extract('/extract/to/this/path');
}
$archive->close();
I asked nutch to crawl a local file: http://localhost:8080/a.txt. I am running the HTTP server and I can see nutch trying to access the file (and before it, /robots.txt). I am using cassandra as backend.
However, I cannot see any data from the crawl. When I do
./bin/nutch readdb -dump data ..., I get the following output.
Can someone help me with a sane answer to this question? Where is the webpage data?
$ cat data/part-r-00000
http://localhost:8000/a.html key: localhost:http:8000/a.html
baseUrl: null
status: 2 (status_fetched)
fetchTime: 1426811920382
prevFetchTime: 1424219908314
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
enter code here
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/crawl/Crawl
Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
Could not find the main class: org.apache.nutch.crawl.Crawl. Program will exit.
but when i run nutch from terminal it show
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
etc etc.....
please tell me what to do
Hey Tejasp i did what u told me, i changed the NUTCH_HOME=/nutch/runtime/local/bin also the crawl.java file is there but when i did this
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
[Fatal Error] nutch-site.xml:6:6: The processing instruction target matching "[xX] [mM][lL]" is not allowed.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1168)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1040)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:980)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:405)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:585)
at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:290)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1079)
... 10 more
it showed me this result now what...?
also i checked nutch-site.xml file i have done the following edits in it
<configuration>
<property>
<name>http.agent.name</name>
<value>PARAM_TEST</value><!-- Your crawler name here -->
</property>
</configuration>
Sir, i did as you told me, this time i compiled nutch with 'ant clean runtime' and nutch home is
NUTCH_HOME=/nutch/runtime/deploy/bin
NUTCH_CONF_DIR=/nutch/runtime/local/conf
and now when i run the same command it is giving me this error
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode.
All i want to create a search engine which can search certain thing from certain websites, for my final year project....
It seems that in Nutch version 2.x the name of the Crawl class has changed to Crawler.
I'm using Hadoop to run Nutch, so I use the following command for crawling:
hadoop jar apache-nutch-2.2.1.job org.apache.nutch.crawl.Crawler urls -solr http://<ip>:8983 -depth 2
If you crawl using Nutch on its own, the nutch script should reference the new class name.
but when i run nutch from terminal it show
This verifies that the NUTCH_HOME/bin/nutch script is present at the correct location.
Please export NUTCH_HOME and NUTCH_CONF_DIR
Which mode of nutch are you trying to use ?
local mode : jobs run without hadoop. you need to have nutch jar inside NUTCH_HOME/lib. Its named after the version that you are using . eg. for nutch release 1.3, the jar name is nutch-1.3.jar.
hadoop mode : jobs run on hadoop cluster. you need to have nutch job file inside NUTCH_HOME. its named after the release version eg. nutch-1.3.job
If you happen to have these files (corresponding to the mode), then extract those and see if the Crawl.class file is indeed present inside it.
If Crawl.class file is not present, then obtain the new jar/job file by compiling the nutch source.
EDIT:
Dont use ant jar. Use ant clean runtime instead. The output gets generated inside NUTCH_INSTALLATION_DIR/runtime/local directory. Run nutch from there. That will be your NUTCH_HOME
Export the required variables JAVA_HOME, NUTCH_HOME and NUTCH_CONF_DIR before running.
I am getting a feeling that the Crawl.class file is not present in the jar. Please extract the jar and check it out. FYI: Command to extract a jar file is jar -xvf <filename>
If after #2, you see that class file aint present in the jar, then see if the nutch source code that you downloaded has the java file. ie. nutch-1.x\src\java\org\apache\nutch\crawl\Crawl.java If not present, get it from internet and rebuild nutch jar.
If after #2, the jar file has class file and you see the issue again, then something is wrong with the environment. Try out some other command like inject. Look for some errors in the hadoop.log file. Let me know what you see.
following this tutorial http://wiki.apache.org/nutch/NutchTutorial
and http://www.nutchinstall.blogspot.com/
when i take the command
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
i have this error
LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301233259
LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301233337
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221729/parse_data
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221754/parse_data
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221804/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
i using cygwin, windows to run nutch
Check if you have parse_data folder present as per the path given above in the error. Make sure that the folders inside the crawl folder that you have given in the command are created and available to be used.
I had similar problem.
I deleted the db and directories . After that, it ran ok.
i crawl sites with nutch 1.3. i see this exception in my log when nutch crawl my sites:
Malformed URL: '', skipping (java.net.MalformedURLException: no protocol:
at java.net.URL.<init>(URL.java:567)
at java.net.URL.<init>(URL.java:464)
at java.net.URL.<init>(URL.java:413)
at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:247)
at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
)
how can i solve this? help me.
According to the docs.
"MalformedURLException is thrown to indicate that a malformed URL has occurred. Either no legal protocol could be found in a specification string or the string could not be parsed."
The thing to be noted here is that this exception is not thrown when the server is down or when the path points to a missing file. It occurs only when URL cannot be parsed.
The error indicates that there is no protocol. and also the crawler does not see any URL,
Malformed URL: '' , skipping (java.net.MalformedURLException: no protocol:
Here is interesting article that I came across, have a look http://www.symphonious.net/2007/03/29/javaneturl-or-javaneturi/
What is the exact URL you are trying to parse?
After having set all setting with regex-urlfilter.txt and seed.txt try this command:
./nutch plugin protocol-file org.apache.nutch.protocol.file.File file:\\\e:\\test.html
(if the file is located at e:\test.htm in my example.
Before this, I always ran this
./nutch plugin protocol-file org.apache.nutch.protocol.file.File \\\e:\test.html
and got this error, because the protocol file: was missing:
java.netMalformedURLException : no protocol : \\e:\test.html
Malformed URL: ''
means that the URL was empty instead of being something like http://www.google.com.