Nutch 1.12 on Cygwin on Windows - nutch

English is not my mother tongue; please excuse any errors on my part.
I try to run nutch 1.12 on Cygwin on Windows, and I followed Nutch Turorial. But when I try execute the command line "bin/nutch inject crawl/crawldb urls", I get this in Cygwin:
$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2017-03-10 19:29:00
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
at org.apache.nutch.crawl.Injector.run(Injector.java:467)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:441)
I have tried the methods mentioned in here and here, but either worked for me.
How can I solve this problem and run Nutch on Windows? Thank you!

I had the same problem. Solved it by setting up Hadoop and including winutils.exe in %HADOOP_HOME%/bin folder. If there is still the same error try including hadoop.dll inside the mentioned folder.

Related

Which directory contains third party libraries for Spark

When we use
spark-submit
which directory contains third party libraries that will be loaded on each of the slaves? I would like to scp one or more libraries to each of the slaves instead of shipping the contents in the application uber-jar.
Note: I did try adding to
$SPARK_HOME/lib_managed/jars
But the spark-submit still results in a ClassNotFoundException for classes included in the added library.
Hope these points will help you.
$SPARK_HOME/lib/ [contains the jar files ]
$SPARK_HOME/bin/ [contains the launch scripts - Spark-Submit,Spark-Class,pySpark,compute-classpath.sh etc]
Spark-Submit ---will call ---> Spark-Class.
Spark-class internally calls compute-Classpath.sh before executing / launching the job.
compute-Classpath.sh will pick the jars availble in $SPARK_HOME/lib to CLASSPATH.
(execute ./compute-classpath.sh //returns jars in lib dir)
So try these options.
option-1 - Placing user-specific-jars in $SPARK_HOME/lib/ will works
option-2 - Tweak compute-classpath.sh so that it will be able to pic
your jars specified in a user specific jars dir

Error grabbing Grapes ... unresolved dependency ... not found

UPDATE 8/6:
The beefed up logging has shown me that there is an issue deleting the old jar from the cache, which leads to the fatal "not found" error. There are other threads similar to this, but only when someone is locking the file with their IDE. We are running a single groovy script from Jenkins, and no one is logged into this box.
We ran process explorer right after the failure and there were no locks. Then I login with the user that Jenkins is using to run the script, and I get no error deleting the files.
Also it seems there was a fix in IVY 2.1 to not fail when the jar cannot be deleted, and I'm on Ivy 2.2 (Groovy 1.8.4). What gives?
Couldn't delete outdated artifact from cache: C:\Users\myUser\.groovy\grapes\com.a.b.c\x-y-z\jars\x-y-z-1.496.jar
then the false(?) error:
Caught: java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
Caused by: java.lang.RuntimeException: Error grabbing Grapes -- [unresolved dependency: com.a.b.c#x-y-z;1.+: not found]
at smokeTestSuccess.<clinit>(smokeTestSuccess.groovy)
Interestingly enough, this happens everyday the first time the script is run after 5am. I guess the cache gets invalidated through some default config at 5am? Is this some kind of clue??
Original post:
I am intermittently getting an error when running a number of different Groovy scripts which all share an identical #Grab declaration. (file names changed to protect the innocent). First the full Grab declaration:
#GrabResolver(name = 'libs.release', root = 'http://myserver:8081/artifactory/libs-release', m2compatible = 'true') #Grapes([
#Grab(group = 'com.a.b.c, module = 'x-y-z', version = '1.+', changing = true),
#Grab('commons-lang:commons-lang:2.3'),
#Grab('log4j:log4j:1.2.16'),
#Grab('gpars:gpars:0.12'),
#Grab('jsr166y:jsr166y:1.7.0'),
#Grab('org.codehaus.groovy.modules.http-builder:http-builder:0.6'),
#Grab('org.apache.commons:commons-collections:3.2.1'),
#Grab('org.apache.httpcomponents:httpclient:4.2.2'),
#Grab('org.apache.httpcomponents:httpcore:4.2.3'),
#Grab('org.cyberneko.html:nekohtml:1.9.17'),
#Grab('xerces:xercesImpl:2.11.0'),
]) #GrabConfig(systemClassLoader = true)
Then the error:
Caught: java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
Caused by: java.lang.RuntimeException: Error grabbing Grapes -- [unresolved dependency: com.a.b.c#x-y-z;1.+: not found]
Upon doing numerous internet searches, the cause always seems to be very simple, either one of these two basic problems:
1. Repository unreachable
2. Jar file doesn’t exist
However, in the artifactory logs, I've proven that the file is actually being downloaded:
*Artifactory did accept the request for download:
2014-07-17 07:58:19,938 [ACCEPTED DOWNLOAD] libs-release-local:com/a/b/c/x-y-z/1.477/x-y-z-1.477.jar for anonymous/165.226.40.155.
*Artifactory did deliver jar:
20140717075820|156|REQUEST|165.226.40.155|non_authenticated_user|GET|/libs-release/com/a/b/c/x-y-z/1.477/x-y-z-1.477.jar|HTTP/1.1|200|1276695
The scripts all work about 100% of the time if they are simply restarted. This all leads me to believe that the issue is the Grab timing out. Theoretically the second time I run the script, the file is in the cache, and things happen faster, thus it doesnt fail.
For the above real request, I can see about 20 seconds of elapsed time in the http log from request to download.
Questions:
Does my theory seem correct?
Is there a way to increase the amount of time that the script will wait for the #Grab to resolve?
Does putting a try / catch block around the #Grab statements seem like a good idea? Or will that just hide the real problem?
thanks in advance!!!!
I think I finally figured out the answer to my own question.
I believe there is some sort of bug within Groovy 1.8.4 (or Ivy 2.2), especially since this behavior does mirror an ancient documented Ivy bug with this exact error message scheme and behavior.
Upgrading to Groovy 2.3.6 (which includes Ivy 2.3) appears to solve the issue.
I also still have no idea why the jars cannot be deleted, nothing is locking them. I experimented with moving the grape cache to a less secure folder to rule out a permission issue, but this didn't help:
-Dgrape.root=D:\Temp\grapeCache
UPDATE 8/19:
Once we upgraded to Groovy 2.3.6, the error went away, but I then figured out that the jar was no longer being downloaded at all, when using the "1.+" resolver. Something in the defaultgrapeConfig.xml was causing an issue. Everything is finally working properly after (in addition to the Groovy upgrade) we overrode defaultgrapeConfig.xml with our own stripped down file using this command line JAVA_OPT:
-Dgrape.config=D:\Temp\myGrapeConfig.xml
which had these contents:
<ivysettings>
<settings defaultResolver="downloadGrapes"/>
<resolvers>
<chain name="downloadGrapes">
</chain>
</resolvers>
</ivysettings>
ALSO:
For completeness (further steps):
In Jenkins GUI, update the job(s):
a. Update the drop down for each script: Execute Groovy Script > Groovy Version > Groovy-2.3.6
b. Update the JAVA_OPTS for each script (have to click the ‘advanced’ button under the script to see JAVA_OPTS):
-Dgrape.config=D:\Software\SfGrapeConfig.xml
Optional logging switches: -Dgroovy.grape.report.downloads=true -Divy.message.logger.level=4
In the actual Groovy script itself, delete this option within the #GrabResolver annotation: , m2compatible = 'true'
If you get this or a similar error:
"could not find client or server jvm under [Whatever JAVE_HOME is], please check that it is a valid jdk / jre containing the desired type of jvm"
Delete groovy.exe & groovyw.exe from D:\Software\Groovy-2.3.6\bin (if the exe’s do not exist, the Jenkins groovy plugin will use the bat file versions of these, and they handle the 32-bit / 64-bit problem better than the exe’s)

Nutch showing following errors, what to do

enter code here
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/crawl/Crawl
Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
Could not find the main class: org.apache.nutch.crawl.Crawl. Program will exit.
but when i run nutch from terminal it show
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
etc etc.....
please tell me what to do
Hey Tejasp i did what u told me, i changed the NUTCH_HOME=/nutch/runtime/local/bin also the crawl.java file is there but when i did this
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
[Fatal Error] nutch-site.xml:6:6: The processing instruction target matching "[xX] [mM][lL]" is not allowed.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1168)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1040)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:980)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:405)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:585)
at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:290)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1079)
... 10 more
it showed me this result now what...?
also i checked nutch-site.xml file i have done the following edits in it
<configuration>
<property>
<name>http.agent.name</name>
<value>PARAM_TEST</value><!-- Your crawler name here -->
</property>
</configuration>
Sir, i did as you told me, this time i compiled nutch with 'ant clean runtime' and nutch home is
NUTCH_HOME=/nutch/runtime/deploy/bin
NUTCH_CONF_DIR=/nutch/runtime/local/conf
and now when i run the same command it is giving me this error
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode.
All i want to create a search engine which can search certain thing from certain websites, for my final year project....
It seems that in Nutch version 2.x the name of the Crawl class has changed to Crawler.
I'm using Hadoop to run Nutch, so I use the following command for crawling:
hadoop jar apache-nutch-2.2.1.job org.apache.nutch.crawl.Crawler urls -solr http://<ip>:8983 -depth 2
If you crawl using Nutch on its own, the nutch script should reference the new class name.
but when i run nutch from terminal it show
This verifies that the NUTCH_HOME/bin/nutch script is present at the correct location.
Please export NUTCH_HOME and NUTCH_CONF_DIR
Which mode of nutch are you trying to use ?
local mode : jobs run without hadoop. you need to have nutch jar inside NUTCH_HOME/lib. Its named after the version that you are using . eg. for nutch release 1.3, the jar name is nutch-1.3.jar.
hadoop mode : jobs run on hadoop cluster. you need to have nutch job file inside NUTCH_HOME. its named after the release version eg. nutch-1.3.job
If you happen to have these files (corresponding to the mode), then extract those and see if the Crawl.class file is indeed present inside it.
If Crawl.class file is not present, then obtain the new jar/job file by compiling the nutch source.
EDIT:
Dont use ant jar. Use ant clean runtime instead. The output gets generated inside NUTCH_INSTALLATION_DIR/runtime/local directory. Run nutch from there. That will be your NUTCH_HOME
Export the required variables JAVA_HOME, NUTCH_HOME and NUTCH_CONF_DIR before running.
I am getting a feeling that the Crawl.class file is not present in the jar. Please extract the jar and check it out. FYI: Command to extract a jar file is jar -xvf <filename>
If after #2, you see that class file aint present in the jar, then see if the nutch source code that you downloaded has the java file. ie. nutch-1.x\src\java\org\apache\nutch\crawl\Crawl.java If not present, get it from internet and rebuild nutch jar.
If after #2, the jar file has class file and you see the issue again, then something is wrong with the environment. Try out some other command like inject. Look for some errors in the hadoop.log file. Let me know what you see.

Nutch path error

following this tutorial http://wiki.apache.org/nutch/NutchTutorial
and http://www.nutchinstall.blogspot.com/
when i take the command
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
i have this error
LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301233259
LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301233337
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221729/parse_data
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221754/parse_data
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221804/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
i using cygwin, windows to run nutch
Check if you have parse_data folder present as per the path given above in the error. Make sure that the folders inside the crawl folder that you have given in the command are created and available to be used.
I had similar problem.
I deleted the db and directories . After that, it ran ok.

Disable automatic download for Groovy grapes

A sample script ss.groovy:
#Grab(group='org.codehaus.groovy.modules.http-builder',
module='http-builder',
version='0.5.0')
import groovyx.net.http.HTTPBuilder
println('done')
for some reason takes ~25 seconds to load when run with
groovy ss.groovy
and ~5 seconds when run with
groovy -Dgroovy.grape.autoDownload=false ss.groovy
as per this StackOverflow explanation. I tried doing manual initialization with
Grape.enableAutoDownload = false
Grape.grab(group:'org.codehaus.groovy.modules.http-builder',
module:'http-builder',
version:'0.5.0')
import groovyx.net.http.HTTPBuilder
println('done')
but this fails on import with:
/tmp/ss.groovy: 3: unable to resolve class groovyx.net.http.HTTPBuilder
# line 3, column 1.
import groovyx.net.http.HTTPBuilder
^
Is there a contained way to either:
Make it not download the artifacts automatically (preferred, as it allows for solving other issues, e.g. external site down while an artifact already exists in the local cache)
Make it startup faster in any other way
By contained I mean that all additional instructions should be either within script or, if no such one exists, an acceptable default (e.g. don't check the cached artifacts for updates - I would still, however, like to have automatic downloads globally) to be put in some of groovy config files (e.g. ~/.groovy/grapeConfig.xml or similar).
Update: The issue has been fixed, #GrabConfig(autoDownload=false) will be available in Groovy 2.2
Why not install a repository manager locally?
http://nexus.sonatype.org/
I use Nexus to proxy and cache all my 3rd party repositories. Groovy is the configured to retrieve from either it's local cache or Nexus:
<ivysettings>
<settings defaultResolver="downloadGrapes"/>
<resolvers>
<chain name="downloadGrapes">
<filesystem name="cachedGrapes">
<ivy pattern="${user.home}/.groovy/grapes/[organisation]/[module]/ivy-[revision].xml"/>
<artifact pattern="${user.home}/.groovy/grapes/[organisation]/[module]/[type]s/[artifact]-[revision].[ext]"/>
</filesystem>
<!-- Local Nexus Repository -->
<ibiblio name="nexus" root="http://localhost:8081/nexus/repositories/public" m2compatible="true"/>
</chain>
</resolvers>
</ivysettings>
This doesn't seem to be possible with the current (Groovy 1.8.1) implementation. I created an improvement ticket: http://jira.codehaus.org/browse/GROOVY-4943.

Resources