English is not my mother tongue; please excuse any errors on my part.
I try to run nutch 1.12 on Cygwin on Windows, and I followed Nutch Turorial. But when I try execute the command line "bin/nutch inject crawl/crawldb urls", I get this in Cygwin:
$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2017-03-10 19:29:00
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
at org.apache.nutch.crawl.Injector.run(Injector.java:467)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:441)
I have tried the methods mentioned in here and here, but either worked for me.
How can I solve this problem and run Nutch on Windows? Thank you!
I had the same problem. Solved it by setting up Hadoop and including winutils.exe in %HADOOP_HOME%/bin folder. If there is still the same error try including hadoop.dll inside the mentioned folder.
Im using tika-app.jar with the version 1.12,to try to find the list of corrupted files that can't be opened in a specified folder.
the problem is when i tested inside windows it gives me in the log folder some exception that allow me to know what files that can't be opened like this :
Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: Can't open the specified file: 'folder\mi-am-CV.docx'
but the problem in linux is only i get a broad error in the log folder like this:
WARN org.apache.tika.batch.FileResourceConsumer - <parse_ex resourceId="test-corrupted-2.doc">org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser#f6e9bd4
so i can't know specificaly what files that are really corrupted and can't be opened.
here's the shell command that i use for that in linux :
java -Dlog4j.debug -Dlog4j.configuration=file:log4j_driver.xml -cp "bin/*" org.apache.tika.cli.TikaCLI -JXX:-OmitStackTraceInFastThrow -JXmx5g -JDlog4j.configuration=file:log4j.xml -bc tika-batch-config-basic-test.xml -i /folder -o outxml -numConsumers 10
thanks.
When we use
spark-submit
which directory contains third party libraries that will be loaded on each of the slaves? I would like to scp one or more libraries to each of the slaves instead of shipping the contents in the application uber-jar.
Note: I did try adding to
$SPARK_HOME/lib_managed/jars
But the spark-submit still results in a ClassNotFoundException for classes included in the added library.
Hope these points will help you.
$SPARK_HOME/lib/ [contains the jar files ]
$SPARK_HOME/bin/ [contains the launch scripts - Spark-Submit,Spark-Class,pySpark,compute-classpath.sh etc]
Spark-Submit ---will call ---> Spark-Class.
Spark-class internally calls compute-Classpath.sh before executing / launching the job.
compute-Classpath.sh will pick the jars availble in $SPARK_HOME/lib to CLASSPATH.
(execute ./compute-classpath.sh //returns jars in lib dir)
So try these options.
option-1 - Placing user-specific-jars in $SPARK_HOME/lib/ will works
option-2 - Tweak compute-classpath.sh so that it will be able to pic
your jars specified in a user specific jars dir
UPDATE 8/6:
The beefed up logging has shown me that there is an issue deleting the old jar from the cache, which leads to the fatal "not found" error. There are other threads similar to this, but only when someone is locking the file with their IDE. We are running a single groovy script from Jenkins, and no one is logged into this box.
We ran process explorer right after the failure and there were no locks. Then I login with the user that Jenkins is using to run the script, and I get no error deleting the files.
Also it seems there was a fix in IVY 2.1 to not fail when the jar cannot be deleted, and I'm on Ivy 2.2 (Groovy 1.8.4). What gives?
Couldn't delete outdated artifact from cache: C:\Users\myUser\.groovy\grapes\com.a.b.c\x-y-z\jars\x-y-z-1.496.jar
then the false(?) error:
Caught: java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
Caused by: java.lang.RuntimeException: Error grabbing Grapes -- [unresolved dependency: com.a.b.c#x-y-z;1.+: not found]
at smokeTestSuccess.<clinit>(smokeTestSuccess.groovy)
Interestingly enough, this happens everyday the first time the script is run after 5am. I guess the cache gets invalidated through some default config at 5am? Is this some kind of clue??
Original post:
I am intermittently getting an error when running a number of different Groovy scripts which all share an identical #Grab declaration. (file names changed to protect the innocent). First the full Grab declaration:
#GrabResolver(name = 'libs.release', root = 'http://myserver:8081/artifactory/libs-release', m2compatible = 'true') #Grapes([
#Grab(group = 'com.a.b.c, module = 'x-y-z', version = '1.+', changing = true),
#Grab('commons-lang:commons-lang:2.3'),
#Grab('log4j:log4j:1.2.16'),
#Grab('gpars:gpars:0.12'),
#Grab('jsr166y:jsr166y:1.7.0'),
#Grab('org.codehaus.groovy.modules.http-builder:http-builder:0.6'),
#Grab('org.apache.commons:commons-collections:3.2.1'),
#Grab('org.apache.httpcomponents:httpclient:4.2.2'),
#Grab('org.apache.httpcomponents:httpcore:4.2.3'),
#Grab('org.cyberneko.html:nekohtml:1.9.17'),
#Grab('xerces:xercesImpl:2.11.0'),
]) #GrabConfig(systemClassLoader = true)
Then the error:
Caught: java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
Caused by: java.lang.RuntimeException: Error grabbing Grapes -- [unresolved dependency: com.a.b.c#x-y-z;1.+: not found]
Upon doing numerous internet searches, the cause always seems to be very simple, either one of these two basic problems:
1. Repository unreachable
2. Jar file doesn’t exist
However, in the artifactory logs, I've proven that the file is actually being downloaded:
*Artifactory did accept the request for download:
2014-07-17 07:58:19,938 [ACCEPTED DOWNLOAD] libs-release-local:com/a/b/c/x-y-z/1.477/x-y-z-1.477.jar for anonymous/165.226.40.155.
*Artifactory did deliver jar:
20140717075820|156|REQUEST|165.226.40.155|non_authenticated_user|GET|/libs-release/com/a/b/c/x-y-z/1.477/x-y-z-1.477.jar|HTTP/1.1|200|1276695
The scripts all work about 100% of the time if they are simply restarted. This all leads me to believe that the issue is the Grab timing out. Theoretically the second time I run the script, the file is in the cache, and things happen faster, thus it doesnt fail.
For the above real request, I can see about 20 seconds of elapsed time in the http log from request to download.
Questions:
Does my theory seem correct?
Is there a way to increase the amount of time that the script will wait for the #Grab to resolve?
Does putting a try / catch block around the #Grab statements seem like a good idea? Or will that just hide the real problem?
thanks in advance!!!!
I think I finally figured out the answer to my own question.
I believe there is some sort of bug within Groovy 1.8.4 (or Ivy 2.2), especially since this behavior does mirror an ancient documented Ivy bug with this exact error message scheme and behavior.
Upgrading to Groovy 2.3.6 (which includes Ivy 2.3) appears to solve the issue.
I also still have no idea why the jars cannot be deleted, nothing is locking them. I experimented with moving the grape cache to a less secure folder to rule out a permission issue, but this didn't help:
-Dgrape.root=D:\Temp\grapeCache
UPDATE 8/19:
Once we upgraded to Groovy 2.3.6, the error went away, but I then figured out that the jar was no longer being downloaded at all, when using the "1.+" resolver. Something in the defaultgrapeConfig.xml was causing an issue. Everything is finally working properly after (in addition to the Groovy upgrade) we overrode defaultgrapeConfig.xml with our own stripped down file using this command line JAVA_OPT:
-Dgrape.config=D:\Temp\myGrapeConfig.xml
which had these contents:
<ivysettings>
<settings defaultResolver="downloadGrapes"/>
<resolvers>
<chain name="downloadGrapes">
</chain>
</resolvers>
</ivysettings>
ALSO:
For completeness (further steps):
In Jenkins GUI, update the job(s):
a. Update the drop down for each script: Execute Groovy Script > Groovy Version > Groovy-2.3.6
b. Update the JAVA_OPTS for each script (have to click the ‘advanced’ button under the script to see JAVA_OPTS):
-Dgrape.config=D:\Software\SfGrapeConfig.xml
Optional logging switches: -Dgroovy.grape.report.downloads=true -Divy.message.logger.level=4
In the actual Groovy script itself, delete this option within the #GrabResolver annotation: , m2compatible = 'true'
If you get this or a similar error:
"could not find client or server jvm under [Whatever JAVE_HOME is], please check that it is a valid jdk / jre containing the desired type of jvm"
Delete groovy.exe & groovyw.exe from D:\Software\Groovy-2.3.6\bin (if the exe’s do not exist, the Jenkins groovy plugin will use the bat file versions of these, and they handle the 32-bit / 64-bit problem better than the exe’s)
following this tutorial http://wiki.apache.org/nutch/NutchTutorial
and http://www.nutchinstall.blogspot.com/
when i take the command
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
i have this error
LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301233259
LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301233337
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221729/parse_data
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221754/parse_data
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221804/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
i using cygwin, windows to run nutch
Check if you have parse_data folder present as per the path given above in the error. Make sure that the folders inside the crawl folder that you have given in the command are created and available to be used.
I had similar problem.
I deleted the db and directories . After that, it ran ok.