How can I post, index and search for content within an odt file stored in my solr_home directory?
I have tried understanding and applying the below mentioned pages and have included a body field in the schema:
Indexing text and html files
Simple Post Tool -Confluence
The resourcename field contains the file location but content field is blank.
But i am still not able to search the file contents even though it shows that the file is indexed and the changes are committed.
Is there any end to end documentation for such a requirement.
I am using solr with Tomcat on a linux machine.
I'm a newbie at solr and might be missing out details not mentioned in the above pages.
Use Apache tika to extract content and send it to SOLR
Tika tika = new Tika();
InputStream fileInputStream = new FileInputStream("d:\\fileName.odt");
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, "fileName.odt");
String content = tika.parseToString(fileInputStream, metadata);
Alternatively you can also use ExtractingRequestHandler
Apache Tika was required. Found it at Apache Tika Download
Related
I want to add new column that contains htmls files(raw html files).May I know what configurations changes are required.I read segment reader that contains content folder but output is text file i want to index the htmls files in a column.May I know how could I achieve.
You may have to face special character issues in raw HTML when indexing in Solr. Anyhow, first you should1 customize index-basic plugin in Nutch. Its class name is BasicIndexingFilter.java. Update this class with followings:
String htmlcontent = parse.getData();
doc.add("htmlContent", StringUtil.cleanField(htmlcontent));
After this, you also have to add a field with Solr Schem "htmlContent". Hopefully it will solve your issue.
There may be others options also for this task.
I found another option as commented that works best. Use nutch CLI
bin/nutch index crawldb-path -dir segments-directory -addBinarycontent -base64
I'm using the NodeJS elasticsearch package to interact with ElasticSearch. I have a document that has a file field. I want to be able to upload a file to the index but the only way that I have found is by using the elasticsearch-mapper-attachment plugin.
The problem is that if I use it, I have to load the whole file in memory, encode it to Base64 and then pass the String to ElasticSearch.
I'd like to be able to pass a Stream to ElasticSearch (referencing any binary file: pdf, xls, doc, ppt).
The elasticsearch-mapper-attachment plugin parses the uploaded binary file and extracts text for further indexing using built-in Tika extractor.
What some applications do (for example Search Technology's Aspire) - they run binaries thru Tika locally, extract text and upload just that text with the documents to index.
It might not be the answer you are looking for but you really have just two options - use Elastic plugin (and convert the binary to base64 in yoru code prior to uploading the document to elastic), or parse the binary and extract text in your code and then upload just that text to elastic. Former is easier, latter gives you more control over the process
I am new to nutch.
nutch 1.7
I am looking at ways to parse custom xml files based on xpath and store the data. I did see the xml_parser plugin but that's suspended since tika has taken over.
How do I configure tika embedded within nutch 1.7 to parse the url content based on xpath. I have searched all nutch documentation/wiki but there's not much information there.
tika tries to parse and extract the content which fails because of the custom format, but I want to store the xml with tags based on the xpath. Where should I put the xpath info in the nutch conf? Or do I have to override the tike parser?
Any hints on the right direction much appreciated.
thanks.
I don't think you can easily do this with tika but you may use these custom plugins to parse xml files based on xpath:
https://github.com/BayanGroup/nutch-custom-search
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
I am using example source code from the Lucene 4.2.0 demo API:
http://lucene.apache.org/core/4_2_0/demo/overview-summary.html
I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.
I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.
Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?
Thank You
No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.
Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.
You might also consider using Solr.
I've been reading this but I was just wondering, does Solr have the capability to search static files (i.e. outside of a content management system or a database)?
Some of my files are just straight up html...or server side code with html "blocks"...
SolR can index any text input. The important bit is that it indexes text. So if your static files are not text files, you may need to run them through a tool like Tika first. Then SolR should have no problem indexing the extracted textual data.
There is the ExternalFileField field type. But it's use looks limited.
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html