Nutch path error - nutch

following this tutorial http://wiki.apache.org/nutch/NutchTutorial
and http://www.nutchinstall.blogspot.com/
when i take the command
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
i have this error
LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301233259
LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301233337
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221729/parse_data
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221754/parse_data
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221804/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
i using cygwin, windows to run nutch

Check if you have parse_data folder present as per the path given above in the error. Make sure that the folders inside the crawl folder that you have given in the command are created and available to be used.

I had similar problem.
I deleted the db and directories . After that, it ran ok.

Related

SoapUI specify alternate logdir as a property defined on the command line

I'm upgrading from SoapUI 5.4.0 to 5.7.0 and trying to put the log files in a specific directory. Note: The alternate error logs directory was working prior to the upgrade.
I have both the following specified in my JAVA_OPTS for SoapUITestCaseRunner
-Dsoapui.logroot="%SOAPUI_LOGSDIR%"
-Dsoapui.log4j.config="%SOAPUI_HOME%/soapui-log4j.xml"
In my soapui-log4j.xml I specify the error file as:
<RollingFile name="ERRORFILE"
fileName="${soapui.logroot}/soapui-errors.log"
filePattern="${soapui.logroot}/soapui-errors.log.%i"
append="true">
The error file then gets created without resolving ${soapui.logroot} e.g.
$ find . -name "*errors*"
./${soapui.logroot}/soapui-errors.log
I also tried it as lookup but ended up with this:
ERROR Unable to create file ${sys:soapui.logroot}/soapui-errors.log java.io.IOException: The filename, directory name, or volume label syntax is incorrect
Am I missing anything? Any ideas for next steps?
I tried replacing
fileName="${soapui.logroot}/soapui-errors.log"
with
fileName="${sys:soapui.logroot}/soapui-errors.log"
and it worked for me.
I no longer see unresolved '${soapui.logroot}' directory created.
A

Linux and apache tika issue

Im using tika-app.jar with the version 1.12,to try to find the list of corrupted files that can't be opened in a specified folder.
the problem is when i tested inside windows it gives me in the log folder some exception that allow me to know what files that can't be opened like this :
Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: Can't open the specified file: 'folder\mi-am-CV.docx'
but the problem in linux is only i get a broad error in the log folder like this:
WARN org.apache.tika.batch.FileResourceConsumer - <parse_ex resourceId="test-corrupted-2.doc">org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser#f6e9bd4
so i can't know specificaly what files that are really corrupted and can't be opened.
here's the shell command that i use for that in linux :
java -Dlog4j.debug -Dlog4j.configuration=file:log4j_driver.xml -cp "bin/*" org.apache.tika.cli.TikaCLI -JXX:-OmitStackTraceInFastThrow -JXmx5g -JDlog4j.configuration=file:log4j.xml -bc tika-batch-config-basic-test.xml -i /folder -o outxml -numConsumers 10
thanks.

"Unrecognized option: --format=COBERTURAXML" in trying to convert JSCover report to cobertura xml

I'm trying to convert JSCover to cobertura xml.
Based on what i've read the command is as follows:
java -cp JSCover-all.jar jscover.report.Main --format=COBERTURAXML REPORT-DIR SRC-DIRECTORY
But I get an error
"Error: Could not find or load main class jscover.report.Main"
Even if I set the fully qualified path of there the JSCover-all.jar is located.
So I tried including the JSCover-al.jar into the classpath and run the following command instead:
java -cp jscover.report.Main --format=COBERTURAXML target/local-storage-proxy target/local-storage-proxy/original-src
I no longer get the first error but i'm now getting the following error:
Unrecognized option: --format=COBERTURAXML
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
I hope someone could help me with it. Many thanks!
The first attempt is the correct approach. The error means that JSCover-all.jar is not in the same directory that you are executing the command from. An absolute path to is not needed - a relative one will do.
In the second approach, you have passed 'jscover.report.Main' as the class-path to the JVM and '--format=COBERTURAXML' as parameter to the 'java' command.

Solr/Tika dataimport temporary files permission exception

I'm trying to setup data import from files using apache tika and solr. There are shared docs folder on nfs mounted share. Unfortunately, I can't perform dataimport, 1 file processed and then exception:
[http-8080-3] ERROR org.apache.solr.handler.dataimport.DocBuilder - Exception while processing: files document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 2
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
....
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: Access denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:1006)
at java.io.File.createTempFile(File.java:1989)
at org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
at org.apache.tika.io.TikaInputStream.getFileChannel(TikaInputStream.java:564)
at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:373)
at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:165)
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:140)
... 26 more
So it seems to be some problem with permissions while writing temporary files. Unfortunately, I have no idea where exactly tike tries to write that temporary files so I can't check permissions on nfs. I checked permission for tika home folder (core configuration) and docs folder and subfolders - all ok, including problematic document.
I also tried to change docs directory in my core config to other (on the same nfs share) and all is ok. So, do you have any idea how to track my issue?
[EDIT]
I just noticed that it's not really permission problem. Everything works for files .docx and .pdf. But on .doc file it fails. Do you have any ideas?

Nutch showing following errors, what to do

enter code here
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/crawl/Crawl
Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
Could not find the main class: org.apache.nutch.crawl.Crawl. Program will exit.
but when i run nutch from terminal it show
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
etc etc.....
please tell me what to do
Hey Tejasp i did what u told me, i changed the NUTCH_HOME=/nutch/runtime/local/bin also the crawl.java file is there but when i did this
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
[Fatal Error] nutch-site.xml:6:6: The processing instruction target matching "[xX] [mM][lL]" is not allowed.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1168)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1040)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:980)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:405)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:585)
at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:290)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1079)
... 10 more
it showed me this result now what...?
also i checked nutch-site.xml file i have done the following edits in it
<configuration>
<property>
<name>http.agent.name</name>
<value>PARAM_TEST</value><!-- Your crawler name here -->
</property>
</configuration>
Sir, i did as you told me, this time i compiled nutch with 'ant clean runtime' and nutch home is
NUTCH_HOME=/nutch/runtime/deploy/bin
NUTCH_CONF_DIR=/nutch/runtime/local/conf
and now when i run the same command it is giving me this error
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode.
All i want to create a search engine which can search certain thing from certain websites, for my final year project....
It seems that in Nutch version 2.x the name of the Crawl class has changed to Crawler.
I'm using Hadoop to run Nutch, so I use the following command for crawling:
hadoop jar apache-nutch-2.2.1.job org.apache.nutch.crawl.Crawler urls -solr http://<ip>:8983 -depth 2
If you crawl using Nutch on its own, the nutch script should reference the new class name.
but when i run nutch from terminal it show
This verifies that the NUTCH_HOME/bin/nutch script is present at the correct location.
Please export NUTCH_HOME and NUTCH_CONF_DIR
Which mode of nutch are you trying to use ?
local mode : jobs run without hadoop. you need to have nutch jar inside NUTCH_HOME/lib. Its named after the version that you are using . eg. for nutch release 1.3, the jar name is nutch-1.3.jar.
hadoop mode : jobs run on hadoop cluster. you need to have nutch job file inside NUTCH_HOME. its named after the release version eg. nutch-1.3.job
If you happen to have these files (corresponding to the mode), then extract those and see if the Crawl.class file is indeed present inside it.
If Crawl.class file is not present, then obtain the new jar/job file by compiling the nutch source.
EDIT:
Dont use ant jar. Use ant clean runtime instead. The output gets generated inside NUTCH_INSTALLATION_DIR/runtime/local directory. Run nutch from there. That will be your NUTCH_HOME
Export the required variables JAVA_HOME, NUTCH_HOME and NUTCH_CONF_DIR before running.
I am getting a feeling that the Crawl.class file is not present in the jar. Please extract the jar and check it out. FYI: Command to extract a jar file is jar -xvf <filename>
If after #2, you see that class file aint present in the jar, then see if the nutch source code that you downloaded has the java file. ie. nutch-1.x\src\java\org\apache\nutch\crawl\Crawl.java If not present, get it from internet and rebuild nutch jar.
If after #2, the jar file has class file and you see the issue again, then something is wrong with the environment. Try out some other command like inject. Look for some errors in the hadoop.log file. Let me know what you see.

Resources