extracting information about outlink of url in nutch - nutch

I am quite new to nutch. I have crawled a site successfully using nutch 1.2 and extracted segment dump by readseg command but issue is that dump contains lot of information other than url and outlinks also if i want to analyse it, manual approach needs to be adopted.
It would be really great if there is any utiltiy, plugin which export link with out links in machine readable format like csv or sql.
Please suggest

Generally you will have to write your own application to do this. You can provide additional flags to remove unecessary data.
http://wiki.apache.org/nutch/bin/nutch%20readseg
check here for what flags can be used to reduce the data.
alternatively writing your own application using the hadoop FS library would be better, and then to extract the information directly programatically.
http://wiki.apache.org/hadoop/SequenceFile

Related

Crawler reading a pdf

i am trying to create a crawler that can read a pdf and extract certain information from it (to save in a database).
However, i am unsure which method / Tool to use.
My initial thought was to use PhantomJs but after reading a lot it doesn't seem that it has the capabilities. if I wanted to use Phantomjs I would have to download the pdf, convert it into an HTML page and then afterwards crawl it using Phantom which seems like a tedious task that should be able to be done faster.
So my question is, how can I read a pdf from an online source and gather these pieces of information?
If you are not limited in terms of programming language, consider using iText.
It can easily extract all the text from a given PDF document. It also offer utility methods to look for regular expressions within a file, giving you back the exact location (coordinates) and the matching text.
iText is available both for c# and java lovers.
File inputFile = new File("");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
String content = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1));
Have a look at the website to learn more.
http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction

How to have node convert `.emf` to `.jpg` (or anything I can place on a webpage)

Stuck in this weird situation at work. I have .doc files I'm parsing with Node.JS. They have photos in them that are .emf I want to display in my web app. I have no issue getting the emf file out of the word doc, but I can't figure out how to display it on a webpage. Simply embedding as is didn't work. I tried to find a utility to convert them automatically but with no luck. I thought of converting them myself but can't find any tecnhical info on the .emf file.
Any suggestions?
EMF (WMF) are the SVG like formats of the 1990's.
I can't give you the full solution in this space but checkout this thread that uses Apache Batik
If you don't want to build it yourself perhaps try the paid version of converters
If you can't afford I would recommend to host the Batik and make a service endpoint and make calls to generate the desired format from EMF. It may turn out actually faster.

Lucene 4.2.0 index pdf

I am using example source code from the Lucene 4.2.0 demo API:
http://lucene.apache.org/core/4_2_0/demo/overview-summary.html
I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.
I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.
Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?
Thank You
No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.
Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.
You might also consider using Solr.

how to parse the documents using Crawlers

I am new to this topic, but my requirement is to parse documents of different types(Html, pdf,txt) using a crawlers. please suggest me what crawler to use for my requirement and provide me some tutorial s or some example how to parse the document using crawlers.
Thankyou.
This is a very broad question, so my answer is also very broad and only touches the surface.
It all comes down to two steps, (1) extracting the data from its source, and (2) matching and parsing the relevant data.
1a. Extracting data from the web
There are many ways to scrape data from the web. Different strategies can be used depending if the source is static or dynamic.
If the data is on static pages, you can download the HTML source for all the pages (automated, not manually) and then extract the data out of the HTML source. Downloading the HTML source can be done with many different tools (in many different languages), even a simple wget or curl will do.
If the data is on a dynamic page (for example, if the data is behind some forms that you need to do a database query to view it) then a good strategy is to use an automated web scraping or testing tool. There are many of these.
See this list of Automated Data Collection resources [1]. If you use such a tool, you can extract the data right away, you usually don't have the intermediate step of explicitly saving the HTML source to disk and then parsing it afterwards.
1b. Extracting data from PDF
Try Tabula first. It's an open source web application that lets you visually extract tabular data from PDFs.
If your PDF doesn't have its data neatly structured in simple tables or you have way too much data for Tabula to be feasible, then I recommend using the *NIX command-line tool pdftotext for converting Portable Document Format (PDF) files to plain text.
Use the command man pdftotext to see the manual page for the tool. One useful option is the -layout option which tries to preserve the original layout in the text output. The default option is to "undo" the physical layout of the document, and instead output the text in reading order.
1c. Extracting data from spreadsheet
Try xls2text for converting to text.
2. Parsing the (HTML/text) data
For parsing the data, there are also many options. For example, you can use a combination of grep and sed, or the BeautifulSoup Python library` if you're dealing with HTML source, but don't limit yourself to these options, you can use a language or tool that you're familiar with.
When you're parsing and extracting the data, you're essentially doing pattern matching.
Look for unique patterns that make it easy to isolate the data you're after.
One method of course is regular expressions. Say I want to extract email addresses from a text file named file.
egrep -io "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b" file
The above command will print the email addresses [2]. If you instead want to save them to a file, append > filename to the end of the command.
[1] Note that this list is not an exhaustive list. It's missing many options.
[2] This regular expression isn't bulletproof, there are some extreme cases it will not cover.
Alternatively, you can use a script that I've created which is much better for extracting email addresses from text files. It's more accurate at finding email addresses, easier to use, and you can pass it multiple files at once. You can access it here: https://gist.github.com/dideler/5219706

Programmatically downloading a large number of <insert file type here>

I'm wondering if there's an easy way to download a large number of files of one arbitrary type, e.g., downloading 10,000 XML files. In the past, I've used Bing's API. It's free and offers unlimited queries. However, it doesn't index as many types of files as Google does. Google indexes XML files, CSV files, and KML files. (These can all be found by doing searches like "filetype:XML".) As far as I know, Bing doesn't index these in a way that's easily searchable. Is there another API that has these capabilities?
How about using wget? You can give wget a URL (for example, a google search result) and tell it to follow all the links on that page and download them (I bet you could also give it a filter).
Just tried it and got an ERROR 403: Forbidden. Apparently Google blocks requests from Wget. You'll have to provide a different user agent. Quick search provided this example:
http://www.mail-archive.com/wget#sunsite.dk/msg06564.html
Then it worked with the example given.

Resources