Configure apache solr3.6 with tika1.2

Configure apache solr3.6 with tika1.2 - linux

I am using solr3.6 with tika1.2 but I can't upload pdf files.
First I install solr and upload some *.xml files from the exampledocs.
This files I could search with this URL http://localhost:8983/solr/select/?q=solr.
And in the next step I install tika to upload pdf and doc files but it doesn't function.
The following content is in the "example/solr/conf/solrconf.xml" file.
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults"><str name="fmap.content">text</str><str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="tika.config">tika-data-config.xml</str>
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>`
And in the file "example/solr/conf/tika-data-config.xml" I have this content:
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" transformer="TemplateTransformer" baseDir="/home/ubuntu-user/Documents" fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip" recursive="true">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" /><entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
</entity>
If I put this lines in the console
curl http://localhost:8983/solr/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=#test.pdf"
I get this output
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">183</int>
</lst>
</response>
But I can't search the content with solr. If I browse to this url: http://localhost:8983/solr/browse, I see a new entry but no content.
Also I started the solr and tika server:
java -jar start.jar
java -jar tika-server-1.2.jar
Can anyone help me ?

You need add the jars (or paths) for apache-solr-dataimporthandler-3.6, apache-solr-dataimporthandler- extras-3.6 and apache-solr-cell-3.6 in the dist folder as well as corresponding files in the contrib folder.
Then you can extract pdf's from Solr without starting a Tika server.

Check the ExtractingRequestHandler which would help you to index the Rich documents.
You don't need to start a separate Tika Server as Solr can use the libraries added within to extract the content from the rich documents.
The jar (Solr Cell and Tika Jars needed with dependencies) required are probably within the configuration :-
<lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" />
<lib dir="../../contrib/extraction/lib" regex=".*\.jar" />

Now I have install solr new and I can search pdf's by this url
http://localhost:8983/solr/select/?q=attr_content:st*
Some PDFs are ok but by any PDF I get this Output
<arr name="attr_content"><str> ((stdin)) � ���������
The are attr_creation_date and attr_meta are ok.The producer was Ghostscript.
GPL Ghostscript 8.63

Related

Error while extracting zip file created from maven assembly plugin

I'm using maven assembly plugin to create a zip packaging resources from another maven module in the same project.
Parent_project
|_module1
|_resources
|_templates
|_abc.xml
|_module2
|_resources
|_build-config.xml
Below is my build-config.xml file.
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2 http://maven.apache.org/xsd/assembly-1.1.2.xsd">
<id>bundle</id>
<formats>
<format>zip</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>${basedir}/../module1/src/main/resources/templates</directory>
<includes>
<include>*.xml</include>
</includes>
<outputDirectory>/testdir</outputDirectory>
</fileSet>
</fileSets>
</assembly>
I'm able to copy the resources to a sub-directory named testdir inside the zip file's root. (I can observe this by viewing the zip file without extracting it.) But if i try to extract the zip, it gives me the below error.
There was an error while extracting the sample.zip. "sample/testdir/abc.xml": Not a directory.
I'm using Ubuntu 18 with maven assembly plugin version -1.1.2
Can someone please point me the issue here?

I tried for a while and observed below. Extracting through UI option causes the error. If i were to use the unzip ./myzip.zip -d . command, the extraction success.
But i found a workaround for this as below.
Create an empty directory first.
<fileSet> <!-- Create empty directory -->
<outputDirectory>./templates</outputDirectory>
<excludes>
<exclude>**/*</exclude>
</excludes>
</fileSet>
Copy resources to the directory.
<fileSet>
<directory>${basedir}/test</directory>
<includes>
<include>*.xml</include>
</includes>
<outputDirectory>./templates</outputDirectory>
</fileSet>
This method fixes the issue while extracting the zip. Cheers!

vqmod doesn't make changes in frontend files in opencart Version 2.0.0.0

hi friends i am working on open cart Version 2.0.0.0 to build an e-commerce site. i had downloaded some vqmod modules and integrate it in my localhost and its work fine.
if i uploaded the same files in my live site. the vqmod changes for frond end file are not working. i have tested with my own vqmod files to its to works for admin files and doesn't for front end files.
<?xml version="1.0" encoding="utf-8"?><modification>
<id>test content</id>
<version>1.0.3</version>
<vqmver>2.2.1</vqmver>
<author>test</author>
<decription><![CDATA[
/*
This file is part test content
*/
]]>
</decription>
<file name="catelog/view/template/common/menu.tpl" error="log">
<operation error="log">
<search position="after"><![CDATA[
<li><?php echo $text_information; ?></li>
]]>
</search>
<add trim="true"><![CDATA[
<li><?php echo "jinna"; ?></li>
]]>
</add>
</operation>
</file></modification>
and this doesn't
<?xml version="1.0" encoding="utf-8"?><modification>
<id>test content</id>
<version>1.0.3</version>
<vqmver>2.2.1</vqmver>
<author>test</author>
<decription><![CDATA[
/*
This file is part test content
*/
]]>
</decription>
<file name="catalog/view/theme/*/template/common/header.tpl" error="log">
<operation error="log">
<search position="after"><![CDATA[
</header>
]]>
</search>
<add trim="true"><![CDATA[
<li><?php echo "jinna"; ?></li>
]]>
</add>
</operation>
</file></modification>

As you said in comment that
it means, Vqmod is not installed on server.
Install vqmod
To check vqmod is installed on your server or not. Follow the url
http://your-domain/vqmod/install/
Also give the 0777 permission to following folder and files.
vqmod/vqcache folder
vqmod/mods.cache file

I want to add code using OC mode in all module file

I want to add code in all controller file of module in admin side using OCMOD.
My code is:
<file path="admin/controller/module/*.php">
<operation>
<search trim="true"><![CDATA[
public function index() {
]]></search>
<add position="after" trim="true"><![CDATA[
$this->document->addScript('catalog/view/javascript/xxxx.js');
]]></add>
</operation> </file>
But it doesn't work.

I have try your code. It is working fine. please try following.
You have to create ocmod xml file with ".ocmod.xml" extension, then you can upload that file using "Extension Installer" from admin panel of opencart.
You have to clear and refresh the modification cache to update the system and make the extension work. You can clear and refresh by top right buttons on Extension > Modification page in admin panel.
Example OCMOD file with your code: (File name: test.ocmod.xml)
<?xml version="1.0" encoding="utf-8"?>
<modification>
<code>mycode001</code>
<name>Modification Default</name>
<version>1.0</version>
<author>OpenCart</author>
<link>http://www.opencart.com</link>
<file path="admin/controller/module/*.php">
<operation>
<search trim="true">
<![CDATA[public function index() {]]>
</search>
<add position="after" trim="true">
<![CDATA[$this->document->addScript('catalog/view/javascript/xxxx.js');]]>
</add>
</operation>
</file>
</modification>

CruiseControl.NET and merging files. How to send log files?

<publishers>
<xmllogger>
<logDir>log</logDir>
</xmllogger>
<merge>
<files>
<file action="Copy" deleteAfterMerge="false">C:\_CCNET\Aso\Artifacts\msbuild-results.xml</file>
<file action="Copy" deleteAfterMerge="false">C:\_CCNET\Aso\Build\src\TestResult.xml</file>
</files>
</merge>
<email from="xx#xx.net" mailhost="mail.xx.net" mailhostUsername="xx" mailhostPassword="xx">
<users>
<user name="x" group="developers" address="xx#gmail.com"/>
</users>
<groups>
<group name="developers" />
</groups>
</email>
</publishers>
and xmllogger create folder log and save file in it with random name like log20111229001245.xml.
How can I merge this file to msbuild-results.xml and send it with mail?

You don't, at least not the way you are approaching this. The standard way would be:
Replace your xmllogger configuration with simple <xmllogger/> and put it after the merge publisher.
Unless the msbuild-results.xml and TestResult.xml files are very large, use action="Merge" instead of Copy for the merge publisher.
Set you email publisher to includeDetails="TRUE" - once the other files are merged into you build report, that tells CruiseControl.Net to output the full build report in the email.

Cruise control merging?

I had successfully extracted the compilation log present in my IDE into some one xml file very well. So in order to merge it I had mentioned in my ccnet.config file inside the publisher task using the <merge>
section.
But when I force my build, I am able to get the output.xml file correctly but an
error is thrown in ccnet.config window that it is unable to merge as this file is currently used by some other process.
Please see below:
[VSAT:ERROR] Publisher threw
exception:
ThoughtWorks.CruiseControl.Core.CruiseC
ontrolException: Unable to read the
contents of the file: C:
\ThreePartition\outp ut.xml --->
System.IO.IOException: The process
cannot access the file 'C:\ThreeP
artition\output.xml' because it is
being used by another process.
Can you suggest any method by which merging can be done successfully?
I have pasted the whole ccnet.config file below.
<project name="VSAT">
<sourcecontrol type="filtered">
<sourceControlProvider type="filesystem">
<repositoryRoot>C:\ThreePartition</repositoryRoot>
<autoGetSource>true</autoGetSource>
<ignoreMissingRoot>false</ignoreMissingRoot>
</sourceControlProvider>
<exclusionFilters>
<pathFilter>
<pattern>C:\ThreePartition\wrSbc750gx_ThreePartition\**</pattern>
</pathFilter>
<pathFilter>
<pattern>C:\ThreePartition\*.txt</pattern>
</pathFilter>
<pathFilter>
<pattern>C:\ThreePartition\*.xml</pattern>
</pathFilter>
</exclusionFilters>
</sourcecontrol>
<triggers>
<intervalTrigger name="continuous" seconds="240"
buildCondition="IfModificationExists" />
</triggers>
<tasks>
<nant>
<executable>C:\Nant-0.85\bin\NAnt.exe</executable>
<buildFile>nant.build</buildFile>
</nant>
</tasks>
<publishers>
<merge>
<files>
<file>C:\ThreePartition\output.xml</file>
</files>
</merge>
<xmllogger logDir="C:\Program Files\CruiseControl.NET\server\DF2.0-CI
\Logfiles" />
<email from="BuildAdmin#server.com"
mailhost="smtp.servermail.com" includeDetails="TRUE">
<users>
user name="Maddy" group="buildmaster"
address="Mymail#server.com"/>
</users>
<groups>
<group name="buildmaster" notification="always"/>
<group name="developers" notification="change"/>
</groups>
</email>
</publishers>
</project>
</cruisecontrol>
I had just placed the publishers section below for the better view
<publishers>
<merge>
<files>
<file>C:\ThreePartition\output.xml</file>
</files>
</merge>
<email from="BuildAdmin#server.com" mailhost="smtp.server.com" includeDetails="TRUE">
<users>
<user name="Maddy" group="buildmaster" address="Maddy.#server.com"/>
</users>
<groups> <group name="buildmaster" notification="always"/>
<group name="developers" notification="change"/>
</groups>
</email>

Some ideas:
Eliminate other obvious applications that would be writing to that file: other CCNet projects, other CCNet instances (e.g. are you maybe running the service and something from the command line?), or perhaps your source control.
If you're not attached to NAnt, try MSBuild and see if you get the same error. If all you're doing is compiling, you can pass the .sln or .csproj as a parameter directly to MSBuild.
Make sure you're on the latest version of CCNet - they regularly publish what I would consider fairly major bug fixes regularly.

What is says: the question is which process has open your xml while CC.net is trying to merge -- perhaps Process Explorer could be useful? Perhaps it works if you copy the xml output to a separate file and merge that.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Configure apache solr3.6 with tika1.2 - linux

You need add the jars (or paths) for apache-solr-dataimporthandler-3.6, apache-solr-dataimporthandler- extras-3.6 and apache-solr-cell-3.6 in the dist folder as well as corresponding files in the contrib folder. Then you can extract pdf's from Solr without starting a Tika server.

Related

Error while extracting zip file created from maven assembly plugin

vqmod doesn't make changes in frontend files in opencart Version 2.0.0.0

I want to add code using OC mode in all module file

CruiseControl.NET and merging files. How to send log files?

Cruise control merging?

Categories

Resources