nutch parse custom xml with tika using xpath - nutch

I am new to nutch.
nutch 1.7
I am looking at ways to parse custom xml files based on xpath and store the data. I did see the xml_parser plugin but that's suspended since tika has taken over.
How do I configure tika embedded within nutch 1.7 to parse the url content based on xpath. I have searched all nutch documentation/wiki but there's not much information there.
tika tries to parse and extract the content which fails because of the custom format, but I want to store the xml with tags based on the xpath. Where should I put the xpath info in the nutch conf? Or do I have to override the tike parser?
Any hints on the right direction much appreciated.
thanks.

I don't think you can easily do this with tika but you may use these custom plugins to parse xml files based on xpath:
https://github.com/BayanGroup/nutch-custom-search
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/

Related

I want to add new column that contains html files in solr indexer using nutch 1.17 version

I want to add new column that contains htmls files(raw html files).May I know what configurations changes are required.I read segment reader that contains content folder but output is text file i want to index the htmls files in a column.May I know how could I achieve.
You may have to face special character issues in raw HTML when indexing in Solr. Anyhow, first you should1 customize index-basic plugin in Nutch. Its class name is BasicIndexingFilter.java. Update this class with followings:
String htmlcontent = parse.getData();
doc.add("htmlContent", StringUtil.cleanField(htmlcontent));
After this, you also have to add a field with Solr Schem "htmlContent". Hopefully it will solve your issue.
There may be others options also for this task.
I found another option as commented that works best. Use nutch CLI
bin/nutch index crawldb-path -dir segments-directory -addBinarycontent -base64

Transform XML to PDF using XSLT with nodejs

I have xml file that contains remote xsl stylesheet link. I need to convert that xml to pdf? I have tried Prince but it is not adding desired style. Is there any other package that i can use for this purpose?
You can try with package (https://www.npmjs.com/package/xml-pdf) in this you can set your custom template also for design purpose. I have committed a GitHub repo for you. https://github.com/vivek9716/convert_pdf please check if this is helpful.

Heritrix 3.2.x , how to read content from warc files ?

Using Heritrix 3.2.x, i had crawled a website, Now i want to read the HTML content from the warc files created. Can anyone help ?
I tried using python warc tool and java based warc-tools.jar.
To get an idea what warc file consists, just use some kind of text editor. For graphical view, you need a tool like webarchiveplayer or pywb or openwayback.
Have you tried programming a reader using JWAT or use the JWAT Tools command line.
jwattools.cmd extract path.to.warc(.gz)

Indexing and accessing odt files in solr

How can I post, index and search for content within an odt file stored in my solr_home directory?
I have tried understanding and applying the below mentioned pages and have included a body field in the schema:
Indexing text and html files
Simple Post Tool -Confluence
The resourcename field contains the file location but content field is blank.
But i am still not able to search the file contents even though it shows that the file is indexed and the changes are committed.
Is there any end to end documentation for such a requirement.
I am using solr with Tomcat on a linux machine.
I'm a newbie at solr and might be missing out details not mentioned in the above pages.
Use Apache tika to extract content and send it to SOLR
Tika tika = new Tika();
InputStream fileInputStream = new FileInputStream("d:\\fileName.odt");
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, "fileName.odt");
String content = tika.parseToString(fileInputStream, metadata);
Alternatively you can also use ExtractingRequestHandler
Apache Tika was required. Found it at Apache Tika Download

Lucene 4.2.0 index pdf

I am using example source code from the Lucene 4.2.0 demo API:
http://lucene.apache.org/core/4_2_0/demo/overview-summary.html
I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.
I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.
Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?
Thank You
No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.
Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.
You might also consider using Solr.

Resources