I'm trying to get raw html of crawled pages in different files, named as url of the page. Is it possible with Nutch to save the raw html pages in different files by ruling out the indexing part?
The is no direct way to do that. You will have to do few code modifications.
See this and this.
Related
I am using python-docx-template and python-docx to create DOCX file with one page. I need to duplicate the page in the document nth times. How can I do this with python?
python-docx doesn't have pages. However, it recognizes sections, so before you load the document with python-docx, make sure you insert section breaks before and after your target page.
However, currently, python-docx doesn't have APIs for grabbing the content of a section. If you really want it, you will have to walk through its underlying XML. You may start looking at it from document.__body, by print(document.__body).
You are basically looking for the contents between w:sectPr. See its documentation here:
https://python-docx.readthedocs.io/en/latest/dev/analysis/features/sections.html
I am new to nutch.
nutch 1.7
I am looking at ways to parse custom xml files based on xpath and store the data. I did see the xml_parser plugin but that's suspended since tika has taken over.
How do I configure tika embedded within nutch 1.7 to parse the url content based on xpath. I have searched all nutch documentation/wiki but there's not much information there.
tika tries to parse and extract the content which fails because of the custom format, but I want to store the xml with tags based on the xpath. Where should I put the xpath info in the nutch conf? Or do I have to override the tike parser?
Any hints on the right direction much appreciated.
thanks.
I don't think you can easily do this with tika but you may use these custom plugins to parse xml files based on xpath:
https://github.com/BayanGroup/nutch-custom-search
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
I am quite new to nutch. I have crawled a site successfully using nutch 1.2 and extracted segment dump by readseg command but issue is that dump contains lot of information other than url and outlinks also if i want to analyse it, manual approach needs to be adopted.
It would be really great if there is any utiltiy, plugin which export link with out links in machine readable format like csv or sql.
Please suggest
Generally you will have to write your own application to do this. You can provide additional flags to remove unecessary data.
http://wiki.apache.org/nutch/bin/nutch%20readseg
check here for what flags can be used to reduce the data.
alternatively writing your own application using the hadoop FS library would be better, and then to extract the information directly programatically.
http://wiki.apache.org/hadoop/SequenceFile
I need to show character data in HTML files. It works fine when data is simple, but problem arises when data is similar to tags.
Let me describe my problem.
I am showing data coming from database tables to HTML files (I am creating table to show data).
Now if content in my table is like <img src ="445521.jpg"> it gives me error while parsing.
since it would try to search image in my system.
In XML, we have <![CDATA["content"]]> to rescue, but I dont know what to in HTML for this,
More over I am converting this HTML to PDF. It gives me error even converting to PDF.
Can anybody tell how to create html to make parser understand that the content is Character data ?
Thanks in anticipation.
Try HttpUtility.HtmlEncode (you'll have to import System.Web). This will convert the special characters to HTML entities (e.g. < → <).
I've noticed that parse plugins like tika extract the outlinks from the content, but the object WebPage passed in method getParse/2 already have 2 arrays containing outlinks and inlinks.
Whats the difference between the extraction in getParse and after fetch.
Thanks.
The Webpage object is created from the information in nutch database, in my case hsql.
The Webpage field outlinks(and some others) is filled after the parse process (after method getParse returns).