How to crawl images in Nutch 2.3 as HBase as backend? - nutch

I want to crawl images from certain sites. So far I tried modifiying
regex-urlfilter.txt.
I changed:
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PP
T|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
To:
-\.(css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|t
gz|TGZ|mov|MOV|exe|EXE|js|JS)$
But it didn't work. I am surprised that I didn't find any documentation regarding crawling images using Nutch 2.3. Referal to any existing documentation would really be a great help.

In order to fetch and store images using Nutch you have to follow these steps:
1- Adding regular expression to not filter image formats, such as jpg, jpeg, tif, gif, png and etc... (which you already did)
2- Implementing a parse plugin for parsing images. For more information about Nutch extension points and writing required plugin follow these links:
http://wiki.apache.org/nutch/AboutPlugins
http://wiki.apache.org/nutch/WritingPluginExample
3- Tell Nutch about the implemented plugin and using that for image file formats:
For this purpose you have to follow two different steps, first, modify conf/parse-plugins.xml and map your implemented plugin to image file formats:
<mimeType name="image/jpeg">
<plugin id="parse-image" />
</mimeType>
<mimeType name="image/gif">
<plugin id="parse-image" />
</mimeType>
<mimeType name="image/png">
<plugin id="parse-image" />
</mimeType>
second, add the implemented plugin to nutch-site.xml to be run at Nutch runtime. You have to add the implemented plugin to <plugin.includes> property.

Related

Does PDFBox executes javascript and executables?

I want to use PDFBox for convert pdf to images. I have following questions ->
Does PDFBox looks for executables and javascript while rendering
If it identifies executables, does it ignore the script/executable or it doesn't render at all?
Is there an official documentation where I can read about security related questions and how PDFBox rendering works?

How to have node convert `.emf` to `.jpg` (or anything I can place on a webpage)

Stuck in this weird situation at work. I have .doc files I'm parsing with Node.JS. They have photos in them that are .emf I want to display in my web app. I have no issue getting the emf file out of the word doc, but I can't figure out how to display it on a webpage. Simply embedding as is didn't work. I tried to find a utility to convert them automatically but with no luck. I thought of converting them myself but can't find any tecnhical info on the .emf file.
Any suggestions?
EMF (WMF) are the SVG like formats of the 1990's.
I can't give you the full solution in this space but checkout this thread that uses Apache Batik
If you don't want to build it yourself perhaps try the paid version of converters
If you can't afford I would recommend to host the Batik and make a service endpoint and make calls to generate the desired format from EMF. It may turn out actually faster.

nutch parse custom xml with tika using xpath

I am new to nutch.
nutch 1.7
I am looking at ways to parse custom xml files based on xpath and store the data. I did see the xml_parser plugin but that's suspended since tika has taken over.
How do I configure tika embedded within nutch 1.7 to parse the url content based on xpath. I have searched all nutch documentation/wiki but there's not much information there.
tika tries to parse and extract the content which fails because of the custom format, but I want to store the xml with tags based on the xpath. Where should I put the xpath info in the nutch conf? Or do I have to override the tike parser?
Any hints on the right direction much appreciated.
thanks.
I don't think you can easily do this with tika but you may use these custom plugins to parse xml files based on xpath:
https://github.com/BayanGroup/nutch-custom-search
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/

extracting information about outlink of url in nutch

I am quite new to nutch. I have crawled a site successfully using nutch 1.2 and extracted segment dump by readseg command but issue is that dump contains lot of information other than url and outlinks also if i want to analyse it, manual approach needs to be adopted.
It would be really great if there is any utiltiy, plugin which export link with out links in machine readable format like csv or sql.
Please suggest
Generally you will have to write your own application to do this. You can provide additional flags to remove unecessary data.
http://wiki.apache.org/nutch/bin/nutch%20readseg
check here for what flags can be used to reduce the data.
alternatively writing your own application using the hadoop FS library would be better, and then to extract the information directly programatically.
http://wiki.apache.org/hadoop/SequenceFile

Is it possible to add itunes:image to an Atom feed with enclosures?

I have an Atom feed for my blog. I recently had some audio content that I wanted to publish, and I added audio enclosure link tags for each entry that has associated audio. Even though I realize that RSS is the canonical format for podcasting, iTunes is able to digest the Atom feed just fine and download the audio enclosures. However, I wanted to add an image for the resulting podcast (iTunes sort of treats this as album art), but my attempt at adding an itunes:image tag didn't seem to work.
I tried this:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd">
<itunes:image href="http://files.mcgeary.org/avatars/ryan-mcgeary-headshot-black-grayscale-600.jpg" />
<title>...</title>
...
<entry>...</entry>
</feed>
Is what I'm trying to accomplish possible or is converting my feed to RSS the only way to get podcast album art in iTunes?
after some experimenting and googeling, it seems to me that itunes does not use <itunes:image> or <image> images from podcast feeds. images embedded that way seem to be only used once you submit your podcast to the itunes store - then itune store fetches and uses them. so if you don't want to submit your feed to the itune store, even converting your atom to an rss feed won't help.
luckily, there seems to be a solution: embed your artwork into the actual media files.

Resources