How to fix issue with nutch readseg not dumping any content?

How to fix issue with nutch readseg not dumping any content? - nutch

I am using Nutch 1.4 local on iOS, to crawl a website and nutch
readseg dump does not return any relevant information.
What am I missing?
I am trying to extract 'category' as new metadata from url.
I am using replace to extract substring from the url. I am able to
run the code and index the documents in Google Cloud Search. But
it is not capturing the category.
To debug this end to end I like to verify that the correct value
is extracted by nutch in category metadata. I verified that regex
is correct with a regex tester. I want to log metadata
values - url, category in the log or stdout. I so not see any
pertinent information in hadoop.log even in DEBUG.
$bin/nutch readseg -dump TestCrawl/segments/* segmentAllContent
SegmentReader: dump segment: TestCrawl/segments/20190128171825
SegmentReader: done
logs/hadoop.log -
2019-01-29 11:40:02,275 INFO segment.SegmentReader -
SegmentReader:
dump segment: TestCrawl/segments/20190128171825 .
2019-01-29 11:40:02,463 WARN util.NativeCodeLoader - Unable to
load
native-hadoop library for your platform... using builtin-java
classes where applicable.
log4j.properties
log4j.logger.org.apache.nutch=DEBUG
nutch-site.xml
<property>
<name>index.replace.regexp</name>
<value>
urlmatch=.*mycompany\.com\/([a-zA-Z0-9-]+)
url:category=$1
</value>
</property>
<property>
<name>urlmeta.tags</name>
<value>title,category</value>
<description>
test
</description>
</property>
<property>
<name>index.parse.md</name>
<value>*</value>
<description> test </description
</property>

The readseg -dump command only writes everything contained in the segment as plain text to the output directory segmentAllContent. It does not run the indexer and consequently does not call the plugin index-replace. You may use the command bin/nutch indexchecker to check whether the plugin is configured properly.
Please note that the plugin index-replace is not available in Nutch 1.4, it has been added with Nutch 1.11.
Example how to use the indexchecker to check the index-replace plugin:
% bin/nutch indexchecker \
-Dplugin.includes='protocol-okhttp|parse-html|index-(basic|replace|static)' \
-Dindexingfilter.order='org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.staticfield.StaticFieldIndexer org.apache.nutch.indexer.replace.ReplaceIndexer' \
-Dindex.static='category:unknown' \
-Dindex.replace.regexp=$'hostmatch=localhost\ncategory=/.+/intranet/' \
http://localhost/
...
host : localhost
id : http://localhost/
title : Apache2 Ubuntu Default Page: It works
category : intranet
url : http://localhost/
...
the plugin index-static is configured to add a field "category" with value "unknown"
the plugin index-replace changes the value to "intranet" if the hostname is "localhost" (the $'...' notation is expands \n)

Related

Nutch 1.12 on Cygwin on Windows

English is not my mother tongue; please excuse any errors on my part.
I try to run nutch 1.12 on Cygwin on Windows, and I followed Nutch Turorial. But when I try execute the command line "bin/nutch inject crawl/crawldb urls", I get this in Cygwin:
$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2017-03-10 19:29:00
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
at org.apache.nutch.crawl.Injector.run(Injector.java:467)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:441)
I have tried the methods mentioned in here and here, but either worked for me.
How can I solve this problem and run Nutch on Windows? Thank you!

I had the same problem. Solved it by setting up Hadoop and including winutils.exe in %HADOOP_HOME%/bin folder. If there is still the same error try including hadoop.dll inside the mentioned folder.

.css, .js and .png files not found when opening Tracking URI in YARN

I can view a list of running jobs on YARN at this URI:
https://server1.company.com:8443/gateway/yarnui/yarn/apps/RUNNING
Further I can access job specific information by opening the TrackingUI:
https://server1.company.com:8443/gateway/yarnui/yarn/proxy/application_1481927689976_0178
However, when I do this, I only get the HTML document, none of the other required .js, .css and .png files :
GET https://server.company.com:8443/gateway/yarnui/yarn/proxy/application_1481927689976_0178
200 OK (text/html)
GET https://server.company com:8443/proxy/application_1481927689976_0178/static/bootstrap.min.css
404 Not Found (text/html)
If I go directly to the server on which the job is running :
http://server2.company.com:8088/proxy/application_1481927689976_0178
Everything works fine:
GET http://server2.company.com:8088/proxy/application_1481927689976_0178
200 OK (text/html)
GET http://server2.company:8088/proxy/application_1481927689976_0178/static/bootstrap.min.css
200 OK (text/css)
Sounds like a YARN config issue – but I’ve set the yarn.resourcemanager.webapp.address to the correct value:
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>server2.company.com:8088</value>
</property>
Any ideas why I can’t access these files?

IBM's support fix addresses this exact issue:
http://www-01.ibm.com/support/docview.wss?uid=swg21980169

Apache Camel Grape: Change groovy repository directory

Camel route example:
from("direct:loadCamelFTP").
to("grape:org.apache.camel/camel-ftp/2.15.2");
Groovy documentation explains the default repository is located in~/.groovy/grape and can be changed using groovy -Dgrape.root=/repo/grape yourscript.groovy
What is the proper way to do this?
Is there a configuration option in camel or can I set the property in Wildfly-9.0 configuration?

Unfortunately Grape repository is JVM-level setting, so you have to configure it on the container level. For example for Spring Boot:
java -Dgrape.root=/repo/grape -jar camel-app.jar
For Karaf/ServiceMix/Fuse that would be adding grape.root=/repo/grape to the KARAF_HOME/etc/system.properties file.
For WildFly that would be adding the following lines to your standalone.xml:
<system-properties>
<property name="grape.root" value="/root/grape"/>
</system-properties>

Overriding whoAmI in HDFS

I'm going through the definitive hadoop book. In there the author describes how one can override the whoAmI mechanism by defining a hadoop.job.ugi. Well, I'm not getting anywhere with it...
I'm using hduser#ubuntu (my box name).
I have created a localhost.conf to override the default conf:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>hadoop.job.ugi</name>
<value>user, supergroup</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
but when I run hadoop fs -conf localhost.conf -mkdir newDir followed by hadoop fs -conf localhost.conf -ls . I see that the directory newDir is created by hduser and not by user
I must be missing a setting....
Thanks in advance.

Nutch showing following errors, what to do

enter code here
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/crawl/Crawl
Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
Could not find the main class: org.apache.nutch.crawl.Crawl. Program will exit.
but when i run nutch from terminal it show
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
etc etc.....
please tell me what to do
Hey Tejasp i did what u told me, i changed the NUTCH_HOME=/nutch/runtime/local/bin also the crawl.java file is there but when i did this
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
[Fatal Error] nutch-site.xml:6:6: The processing instruction target matching "[xX] [mM][lL]" is not allowed.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1168)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1040)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:980)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:405)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:585)
at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:290)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1079)
... 10 more
it showed me this result now what...?
also i checked nutch-site.xml file i have done the following edits in it
<configuration>
<property>
<name>http.agent.name</name>
<value>PARAM_TEST</value><!-- Your crawler name here -->
</property>
</configuration>
Sir, i did as you told me, this time i compiled nutch with 'ant clean runtime' and nutch home is
NUTCH_HOME=/nutch/runtime/deploy/bin
NUTCH_CONF_DIR=/nutch/runtime/local/conf
and now when i run the same command it is giving me this error
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode.
All i want to create a search engine which can search certain thing from certain websites, for my final year project....

It seems that in Nutch version 2.x the name of the Crawl class has changed to Crawler.
I'm using Hadoop to run Nutch, so I use the following command for crawling:
hadoop jar apache-nutch-2.2.1.job org.apache.nutch.crawl.Crawler urls -solr http://<ip>:8983 -depth 2
If you crawl using Nutch on its own, the nutch script should reference the new class name.

but when i run nutch from terminal it show
This verifies that the NUTCH_HOME/bin/nutch script is present at the correct location.
Please export NUTCH_HOME and NUTCH_CONF_DIR
Which mode of nutch are you trying to use ?
local mode : jobs run without hadoop. you need to have nutch jar inside NUTCH_HOME/lib. Its named after the version that you are using . eg. for nutch release 1.3, the jar name is nutch-1.3.jar.
hadoop mode : jobs run on hadoop cluster. you need to have nutch job file inside NUTCH_HOME. its named after the release version eg. nutch-1.3.job
If you happen to have these files (corresponding to the mode), then extract those and see if the Crawl.class file is indeed present inside it.
If Crawl.class file is not present, then obtain the new jar/job file by compiling the nutch source.
EDIT:
Dont use ant jar. Use ant clean runtime instead. The output gets generated inside NUTCH_INSTALLATION_DIR/runtime/local directory. Run nutch from there. That will be your NUTCH_HOME
Export the required variables JAVA_HOME, NUTCH_HOME and NUTCH_CONF_DIR before running.
I am getting a feeling that the Crawl.class file is not present in the jar. Please extract the jar and check it out. FYI: Command to extract a jar file is jar -xvf <filename>
If after #2, you see that class file aint present in the jar, then see if the nutch source code that you downloaded has the java file. ie. nutch-1.x\src\java\org\apache\nutch\crawl\Crawl.java If not present, get it from internet and rebuild nutch jar.
If after #2, the jar file has class file and you see the issue again, then something is wrong with the environment. Try out some other command like inject. Look for some errors in the hadoop.log file. Let me know what you see.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to fix issue with nutch readseg not dumping any content? - nutch

Related

Nutch 1.12 on Cygwin on Windows

.css, .js and .png files not found when opening Tracking URI in YARN

Apache Camel Grape: Change groovy repository directory

Overriding whoAmI in HDFS

Nutch showing following errors, what to do

Categories

Resources