Crawler reading a pdf

Crawler reading a pdf - node.js

i am trying to create a crawler that can read a pdf and extract certain information from it (to save in a database).
However, i am unsure which method / Tool to use.
My initial thought was to use PhantomJs but after reading a lot it doesn't seem that it has the capabilities. if I wanted to use Phantomjs I would have to download the pdf, convert it into an HTML page and then afterwards crawl it using Phantom which seems like a tedious task that should be able to be done faster.
So my question is, how can I read a pdf from an online source and gather these pieces of information?

If you are not limited in terms of programming language, consider using iText.
It can easily extract all the text from a given PDF document. It also offer utility methods to look for regular expressions within a file, giving you back the exact location (coordinates) and the matching text.
iText is available both for c# and java lovers.
File inputFile = new File("");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
String content = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1));
Have a look at the website to learn more.
http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction

Related

Is it possible to add an image to a PDF without rendering the PDF?

I'm looking at adding an image to an existing PDF in Node.js. None of the PDF libraries I found appear to have the ability to modify an existing PDF though, so I'm planning on implementing it myself. I'm trying to figure out if it's too much work, as I can always do it server side using iTextPDF instead, but I'd prefer to do it in my app (Electron which uses Node.js).
If I just want to modify an existing PDF and add an image, will I have to write a complete rendering library or is PDF structured in such a way that I can write a very small parser that just gets the page I want and inserts an image using the correct format?
Specifically, I'm asking because I've previously looked into writing a text extraction library, put in order to get the position of text you have to render pretty much the entire PDF because of how positioning is handled. That's too much work to get around server side processing in this case.
To be clear, just asking if it's possible to do, not how to do it (don't want to be too broad, I'm sure I can figure that part out).

To perform a small manipulation of a PDF, you'll need to implement generalized reading, decompression, encryption and traversal of PDF data structures. Some of the thing you would need to handle include:
basic parsing of PDF syntax
indexing via the cross reference index, and/or cross reference index and object streams
objects (num, byte-string, hex string, dictionary, arrays, booleans...)
filters and variants (LZW, Flate, RunLength, Predictors)
encryption (RC4, AES, Custom security handlers)
page tree traversal
basic handling of page content streams
image handling
serialization, either rewriting of the entire PDF, or incremental updates to an existing PDF
Anything's possible, but realistically, you will need a PDF library or toolkit, client or server-side, to accomplish this.

Lucene 4.2.0 index pdf

I am using example source code from the Lucene 4.2.0 demo API:
http://lucene.apache.org/core/4_2_0/demo/overview-summary.html
I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.
I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.
Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?
Thank You

No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.
Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.
You might also consider using Solr.

how to parse the documents using Crawlers

I am new to this topic, but my requirement is to parse documents of different types(Html, pdf,txt) using a crawlers. please suggest me what crawler to use for my requirement and provide me some tutorial s or some example how to parse the document using crawlers.
Thankyou.

This is a very broad question, so my answer is also very broad and only touches the surface.
It all comes down to two steps, (1) extracting the data from its source, and (2) matching and parsing the relevant data.
1a. Extracting data from the web
There are many ways to scrape data from the web. Different strategies can be used depending if the source is static or dynamic.
If the data is on static pages, you can download the HTML source for all the pages (automated, not manually) and then extract the data out of the HTML source. Downloading the HTML source can be done with many different tools (in many different languages), even a simple wget or curl will do.
If the data is on a dynamic page (for example, if the data is behind some forms that you need to do a database query to view it) then a good strategy is to use an automated web scraping or testing tool. There are many of these.
See this list of Automated Data Collection resources [1]. If you use such a tool, you can extract the data right away, you usually don't have the intermediate step of explicitly saving the HTML source to disk and then parsing it afterwards.
1b. Extracting data from PDF
Try Tabula first. It's an open source web application that lets you visually extract tabular data from PDFs.
If your PDF doesn't have its data neatly structured in simple tables or you have way too much data for Tabula to be feasible, then I recommend using the *NIX command-line tool pdftotext for converting Portable Document Format (PDF) files to plain text.
Use the command man pdftotext to see the manual page for the tool. One useful option is the -layout option which tries to preserve the original layout in the text output. The default option is to "undo" the physical layout of the document, and instead output the text in reading order.
1c. Extracting data from spreadsheet
Try xls2text for converting to text.
2. Parsing the (HTML/text) data
For parsing the data, there are also many options. For example, you can use a combination of grep and sed, or the BeautifulSoup Python library` if you're dealing with HTML source, but don't limit yourself to these options, you can use a language or tool that you're familiar with.
When you're parsing and extracting the data, you're essentially doing pattern matching.
Look for unique patterns that make it easy to isolate the data you're after.
One method of course is regular expressions. Say I want to extract email addresses from a text file named file.
egrep -io "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b" file
The above command will print the email addresses [2]. If you instead want to save them to a file, append > filename to the end of the command.
[1] Note that this list is not an exhaustive list. It's missing many options.
[2] This regular expression isn't bulletproof, there are some extreme cases it will not cover.
Alternatively, you can use a script that I've created which is much better for extracting email addresses from text files. It's more accurate at finding email addresses, easier to use, and you can pass it multiple files at once. You can access it here: https://gist.github.com/dideler/5219706

Export text (MediaWiki markup) from MediaWiki installation

I want to export the MediaWiki markup for a number of articles (but not all articles) from a local MediaWiki installation. I want just the current article markup, not the history or anything else, with an individual text file for each article. I want to perform this export programatically and ideally on the MediaWiki server, not remotely.
For example, if I am interested in the Apple, Banana and Cupcake articles I want to be able to:
article_list = ["Apple", "Banana", "Cupcake"]
for a in article_list:
get_article(a, a + ".txt")
My intention is to:
extract required articles
store MediaWiki markup in individual text files
parse and process in a separate program
Is this already possible with MediaWiki? It doesn't look like it. It also doesn't look like Pywikipediabot has such a script.
A fallback would be to be able to do this manually (using the Export special page) and easily parse the output into text files. Are there existing tools to do this? Is there a description of the MediaWiki XML dump format? (I couldn't find one.)

On the server side, you can just export from the database. Remotely, Pywikipediabot has a script called get.py which gets the wikicode of a given article. It is also pretty simple to do manually, somehow like this (writing this from memory, errors might occur):
import wikipedia as pywikibot
site = pywikibot.getSite() # assumes you have a user-config.py with default site/user
article_list = ["Apple", "Banana", "Cupcake"]
for title in article_list:
page = pywikibot.Page(title, site)
text = page.get() # handling of not found etc. exceptions omitted
file = open(title + ".txt", "wt")
file.write(text)
Since MediaWiki's language is not well-defined, the only reliable way to parse/process it is through MediaWiki itself; there is no support for that in Pywikipediabot, and the few tools which try to do it fail with complex templates.

It looks like getText.php is a builtin server-side maintenance script for exporting the wikitext of a specific article. (Easier than querying the database.)
Found it via Publishing from MediaWiki which covers all angles on exporting from MediaWiki.

How to generate application forms/documents programmatically?

At the moment, we use MS WORD and MS EXCEL to mail merge documents that needs to be sent to multiple recepients.
For example, say there is a complaint form where the complainant needs to fill in his/her name, address, etc. So we have a .doc file set up with the content and the dynamic entities set up for mail merging, with the name and address details put in an excel file, from where we can happily mail merge to generate all or just the necessary forms/documents.
However, I would like to automate this process, like a form in a website where the complainant can fill in his/her name, address and other details, and we could use that to generate the complaint form automatically and offer it to be downloaded (preferrably as a pdf).
Now, the only solution that comes to mind, is Latex, so that I can just replace the needed entities and just compile to PDF. However, that bit has to be negotiated with the webhost, if they are offering Latex or not.
Is there any other solution? Any other way we could get this done, with something that shouldn't be a problem for most webhosting solutions to offer?
EDIT: I would prefer a non .NET or rather non microsoft solution since, the servers are running linux and while mono might be capable of getting the job done, none of our devs know any .NET languages. However, if required we might have to dwelve into it.

Generating PDF using an XSL. Check the following: Apoc XSL-FO
You will need to create an XML file with the required fields and transform that with this tool.

If you wish to avoid .NET then XSL-FO is worth a look. Try the FOray project.
XSLT can be a steep learn if you do not have experience already. Also users will not be able to change the templates without asking the XSLT guru to do it.
If your templates are already in MS Word and MS Excel then I would stick with generating MS docs on the server. These are now easy to work with from code since OpenXML - check out OfficeOpenXML and OpenXMLDeveloper

Apache FOP : http://xmlgraphics.apache.org/fop/

I suggest generating rtf on the server: it's easy enough to automatically generate using cpan's RTF::Writer, has converters generating good pdf, can be edited by hand in word, oo-writer & TextEdit, doesn't have any really bad compatibility issues between the main editing applications, and has decent text & resource extraction tools, with text extraction being rather better than pdf.
There's some support for moving between rtf & latex, although the best rtf -> latex converter, docx2tex, depends on the System.IO.Packaging .net module, whose mono implementation isn't yet rock solid.
Postscript — Not a recommendation: it's too much of an unwieldy sledgehammer for this job, but iText will generate the pdf directly from the form data. If you wanted to do fancy things like signed pdf, that would be the way to go.
Postscript #2 — If you break up the Word document into individual files using word's master document representation, then you can clobber one of the parts with hand-generated content. This makes it easy to do something approximating form-filling on word .doc files using just standard file-utils and some trivial rtf->doc tweaking.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string