Linux and apache tika issue - linux

Im using tika-app.jar with the version 1.12,to try to find the list of corrupted files that can't be opened in a specified folder.
the problem is when i tested inside windows it gives me in the log folder some exception that allow me to know what files that can't be opened like this :
Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: Can't open the specified file: 'folder\mi-am-CV.docx'
but the problem in linux is only i get a broad error in the log folder like this:
WARN org.apache.tika.batch.FileResourceConsumer - <parse_ex resourceId="test-corrupted-2.doc">org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser#f6e9bd4
so i can't know specificaly what files that are really corrupted and can't be opened.
here's the shell command that i use for that in linux :
java -Dlog4j.debug -Dlog4j.configuration=file:log4j_driver.xml -cp "bin/*" org.apache.tika.cli.TikaCLI -JXX:-OmitStackTraceInFastThrow -JXmx5g -JDlog4j.configuration=file:log4j.xml -bc tika-batch-config-basic-test.xml -i /folder -o outxml -numConsumers 10
thanks.

Related

SoapUI specify alternate logdir as a property defined on the command line

I'm upgrading from SoapUI 5.4.0 to 5.7.0 and trying to put the log files in a specific directory. Note: The alternate error logs directory was working prior to the upgrade.
I have both the following specified in my JAVA_OPTS for SoapUITestCaseRunner
-Dsoapui.logroot="%SOAPUI_LOGSDIR%"
-Dsoapui.log4j.config="%SOAPUI_HOME%/soapui-log4j.xml"
In my soapui-log4j.xml I specify the error file as:
<RollingFile name="ERRORFILE"
fileName="${soapui.logroot}/soapui-errors.log"
filePattern="${soapui.logroot}/soapui-errors.log.%i"
append="true">
The error file then gets created without resolving ${soapui.logroot} e.g.
$ find . -name "*errors*"
./${soapui.logroot}/soapui-errors.log
I also tried it as lookup but ended up with this:
ERROR Unable to create file ${sys:soapui.logroot}/soapui-errors.log java.io.IOException: The filename, directory name, or volume label syntax is incorrect
Am I missing anything? Any ideas for next steps?
I tried replacing
fileName="${soapui.logroot}/soapui-errors.log"
with
fileName="${sys:soapui.logroot}/soapui-errors.log"
and it worked for me.
I no longer see unresolved '${soapui.logroot}' directory created.
A

Writing file denied

I am getting an error writing a file, that is driving me crazy.
I have an C# netcore 5 application running on RH Linux.
I mounted an shared folder (windows) using: sudo mount -t cifs -o username=MyDomainUsername,password=MyDomainUsernamePassword,domain=MyDomain,dir_mode=0777,file_mode=0777 //ipv4_from_destination/Reports /fileshare/Reports
Then I run the app, using just ./WebApi --urls=http://+:8060
The read/write test executes the following steps:
Create a text file.
Write the text file.
Delete de text file.
Creates a directory
Creates a text file inside that directory
Writes the text file
Deletes the text file
Deletes the directory.
Now the problem:
The text file is created
The write operation fails.
Where goes part of the log:
Creating file: /fileshare/Reports/test.616db7d1-07fb-4599-a0cf-749e6a8b34ec.tmp...Ok
Writing file: /fileshare/Reports/test.616db7d1-07fb-4599-a0cf-749e6a8b34ec.tmp...[16:22:20 ERR] ID:87988856-a765-4474-9ed9-2f04aef35771 PATH:/api/about ERROR:System.UnauthorizedAccessException:Access to the path '/fileshare/Reports/test.616db7d1-07fb-4599-a0cf-749e6a8b34ec.tmp' is denied. TRACE: at System.IO.FileStream.WriteNative(ReadOnlySpan`1 source)
at System.IO.FileStream.FlushWriteBuffer()
at System.IO.FileStream.FlushInternalBuffer()
at System.IO.FileStream.Flush(Boolean flushToDisk)
at System.IO.FileStream.Flush()
at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
at System.IO.StreamWriter.Flush()
at WebApi.Controllers.ApplicationController.TestFileSystem(String folder) in xxxxxxx\WebApi\Controllers\ApplicationController.cs:line 116
What I discovered so far:
I can create and delete the files and directories.
I cannot write to files.
Can someone give me an hint on this?
Solved using the cifs option nobrl

Sybase 16 startserver failed due to missing libsapcrypto.so

We've installed Sybase 16 Express in our Linux box, it was able to startup right after the installation. When we recently try restarting it with the startserver -f RUN_FILE command, it failed to find the libsapcrypto.so file.
~/sap/ASE-16_0/bin> ../sap/ASE-16_0/bin/dataserver: error while loading shared libraries: libsapcrypto.so: cannot open shared object file: No such file or directory
We searched this file, multiple matches presented in the following paths:
./DM/OCS-16_0/lib3p/libsapcrypto.so
./DM/OCS-16_0/lib3p64/libsapcrypto.so
./DM/OCS-16_0/devlib3p64/libsapcrypto.so
./DM/OCS-16_0/devlib3p/libsapcrypto.so
./DM/REP-16_0/lib64/libsapcrypto.so
./DataAccess/ODBC/lib/libsapcrypto.so
./DataAccess64/ODBC/lib/libsapcrypto.so
./OCS-16_0/lib3p/libsapcrypto.so
./OCS-16_0/lib3p64/libsapcrypto.so
./OCS-16_0/devlib3p64/libsapcrypto.so
./OCS-16_0/devlib3p/libsapcrypto.so
Since this hasn't been answered yet, running this command worked for me:
. /opt/sap/SYBASE.sh
Note the different syntax to make sure the environment variables are set in the terminal session, as opposed to using this syntax:
/opt/sap/SYBASE.sh

Solr/Tika dataimport temporary files permission exception

I'm trying to setup data import from files using apache tika and solr. There are shared docs folder on nfs mounted share. Unfortunately, I can't perform dataimport, 1 file processed and then exception:
[http-8080-3] ERROR org.apache.solr.handler.dataimport.DocBuilder - Exception while processing: files document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 2
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
....
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: Access denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:1006)
at java.io.File.createTempFile(File.java:1989)
at org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
at org.apache.tika.io.TikaInputStream.getFileChannel(TikaInputStream.java:564)
at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:373)
at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:165)
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:140)
... 26 more
So it seems to be some problem with permissions while writing temporary files. Unfortunately, I have no idea where exactly tike tries to write that temporary files so I can't check permissions on nfs. I checked permission for tika home folder (core configuration) and docs folder and subfolders - all ok, including problematic document.
I also tried to change docs directory in my core config to other (on the same nfs share) and all is ok. So, do you have any idea how to track my issue?
[EDIT]
I just noticed that it's not really permission problem. Everything works for files .docx and .pdf. But on .doc file it fails. Do you have any ideas?

Nutch path error

following this tutorial http://wiki.apache.org/nutch/NutchTutorial
and http://www.nutchinstall.blogspot.com/
when i take the command
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
i have this error
LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301233259
LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301233337
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221729/parse_data
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221754/parse_data
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221804/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
i using cygwin, windows to run nutch
Check if you have parse_data folder present as per the path given above in the error. Make sure that the folders inside the crawl folder that you have given in the command are created and available to be used.
I had similar problem.
I deleted the db and directories . After that, it ran ok.

Resources