I've noticed that parse plugins like tika extract the outlinks from the content, but the object WebPage passed in method getParse/2 already have 2 arrays containing outlinks and inlinks.
Whats the difference between the extraction in getParse and after fetch.
Thanks.
The Webpage object is created from the information in nutch database, in my case hsql.
The Webpage field outlinks(and some others) is filled after the parse process (after method getParse returns).
Related
I want to add new column that contains htmls files(raw html files).May I know what configurations changes are required.I read segment reader that contains content folder but output is text file i want to index the htmls files in a column.May I know how could I achieve.
You may have to face special character issues in raw HTML when indexing in Solr. Anyhow, first you should1 customize index-basic plugin in Nutch. Its class name is BasicIndexingFilter.java. Update this class with followings:
String htmlcontent = parse.getData();
doc.add("htmlContent", StringUtil.cleanField(htmlcontent));
After this, you also have to add a field with Solr Schem "htmlContent". Hopefully it will solve your issue.
There may be others options also for this task.
I found another option as commented that works best. Use nutch CLI
bin/nutch index crawldb-path -dir segments-directory -addBinarycontent -base64
I am not sure if i am going to be able to describe this right but ill give it a go.
We are working on implementing Azure search. At the core level we have searchable PDF documents that we want the text of them added to the index so all of them are searchable.
The initial thought was to just submit that document to the index via the add document rest api. The thinking was that this would be the most simple and quickest path
to getting the text of that document into the index. We also considered using and indexer and just having all the Searchable PDF docs in a blob store and have the indexer
crawl those every 10-15 mins.
We also looked into (based on a recommendation) submitting a standalone JSON file with the text from the PDF in it. Submitting that to the index either via the same add document API or
placing that file in a blob store. Within the JSON document we would need to have document identifiers that provide the index with the location of the PDF so that when that text is found
via search, we can make that clickable and as a result open the PDF.
It seems to me that pushing in the json file with the document add api. Indexing that and when it is part of a search we can use the doc id to link back to it and open it.
For those of you that have used Azure search. How did you implement?
If you're totally sure that only pdf will live on this particular index, then the first approach is faster to implement, since the native indexer can be used for extract the content of the pdf document as well to push it to the index.
Both approaches will work, but for the second one, you would need to extract the pdf yourself using an external tool.
I am new to nutch.
nutch 1.7
I am looking at ways to parse custom xml files based on xpath and store the data. I did see the xml_parser plugin but that's suspended since tika has taken over.
How do I configure tika embedded within nutch 1.7 to parse the url content based on xpath. I have searched all nutch documentation/wiki but there's not much information there.
tika tries to parse and extract the content which fails because of the custom format, but I want to store the xml with tags based on the xpath. Where should I put the xpath info in the nutch conf? Or do I have to override the tike parser?
Any hints on the right direction much appreciated.
thanks.
I don't think you can easily do this with tika but you may use these custom plugins to parse xml files based on xpath:
https://github.com/BayanGroup/nutch-custom-search
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
We are creating a map application that has a list view of the resultant set using JSON and jQuery and presenting the result to the user. How can we give the user the ability to download the result as an Excel file/CSV file?
Assuming a web application...
Convert JSON to CSV
Present It as an HTTP response.
You mentioned JSON and jQuery. I assume this is being used on the client side? There are a number of utilities for making the JSON-to-CSV conversion depending on the language you would want to use. Here are some examples.
Javascript/jQuery
PHP
Python
I'm trying to get raw html of crawled pages in different files, named as url of the page. Is it possible with Nutch to save the raw html pages in different files by ruling out the indexing part?
The is no direct way to do that. You will have to do few code modifications.
See this and this.