how to parse the documents using Crawlers - linux

I am new to this topic, but my requirement is to parse documents of different types(Html, pdf,txt) using a crawlers. please suggest me what crawler to use for my requirement and provide me some tutorial s or some example how to parse the document using crawlers.
Thankyou.

This is a very broad question, so my answer is also very broad and only touches the surface.
It all comes down to two steps, (1) extracting the data from its source, and (2) matching and parsing the relevant data.
1a. Extracting data from the web
There are many ways to scrape data from the web. Different strategies can be used depending if the source is static or dynamic.
If the data is on static pages, you can download the HTML source for all the pages (automated, not manually) and then extract the data out of the HTML source. Downloading the HTML source can be done with many different tools (in many different languages), even a simple wget or curl will do.
If the data is on a dynamic page (for example, if the data is behind some forms that you need to do a database query to view it) then a good strategy is to use an automated web scraping or testing tool. There are many of these.
See this list of Automated Data Collection resources [1]. If you use such a tool, you can extract the data right away, you usually don't have the intermediate step of explicitly saving the HTML source to disk and then parsing it afterwards.
1b. Extracting data from PDF
Try Tabula first. It's an open source web application that lets you visually extract tabular data from PDFs.
If your PDF doesn't have its data neatly structured in simple tables or you have way too much data for Tabula to be feasible, then I recommend using the *NIX command-line tool pdftotext for converting Portable Document Format (PDF) files to plain text.
Use the command man pdftotext to see the manual page for the tool. One useful option is the -layout option which tries to preserve the original layout in the text output. The default option is to "undo" the physical layout of the document, and instead output the text in reading order.
1c. Extracting data from spreadsheet
Try xls2text for converting to text.
2. Parsing the (HTML/text) data
For parsing the data, there are also many options. For example, you can use a combination of grep and sed, or the BeautifulSoup Python library` if you're dealing with HTML source, but don't limit yourself to these options, you can use a language or tool that you're familiar with.
When you're parsing and extracting the data, you're essentially doing pattern matching.
Look for unique patterns that make it easy to isolate the data you're after.
One method of course is regular expressions. Say I want to extract email addresses from a text file named file.
egrep -io "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b" file
The above command will print the email addresses [2]. If you instead want to save them to a file, append > filename to the end of the command.
[1] Note that this list is not an exhaustive list. It's missing many options.
[2] This regular expression isn't bulletproof, there are some extreme cases it will not cover.
Alternatively, you can use a script that I've created which is much better for extracting email addresses from text files. It's more accurate at finding email addresses, easier to use, and you can pass it multiple files at once. You can access it here: https://gist.github.com/dideler/5219706

Related

Add Bookmarks to pdf using Pymupdf

How to add Bookmarks to pdf using Pymupdf. I have seen many ways using PyPDF2 but since I'm already using pymupdf for other annotations I would prefer pymupdf for adding bookmarks. Also would like to highlight the text and add bookmarks to it.
You cannot add single bookmarks like you can in other packages.
If you have looked at the details there - or rather in the respective PDF specification, this is an overly / unnecessarily complex task.
PyMuPDF in contrast has this simple approach to offer:
Prepare a Python list that looks like a traditional table of contents (TOC):
Every line in the list contains the hierarchy level, the text to display and the page number. Optionally also some information where on the target page the pointer goes to.
Then use doc.set_toc(toc_list). All pesky detail is taken care of for you.
If the PDF already has a TOC, extract it to a list of that same structure via toc_list = doc.get_toc().
Then modify as required.

Crawler reading a pdf

i am trying to create a crawler that can read a pdf and extract certain information from it (to save in a database).
However, i am unsure which method / Tool to use.
My initial thought was to use PhantomJs but after reading a lot it doesn't seem that it has the capabilities. if I wanted to use Phantomjs I would have to download the pdf, convert it into an HTML page and then afterwards crawl it using Phantom which seems like a tedious task that should be able to be done faster.
So my question is, how can I read a pdf from an online source and gather these pieces of information?
If you are not limited in terms of programming language, consider using iText.
It can easily extract all the text from a given PDF document. It also offer utility methods to look for regular expressions within a file, giving you back the exact location (coordinates) and the matching text.
iText is available both for c# and java lovers.
File inputFile = new File("");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
String content = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1));
Have a look at the website to learn more.
http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction

Using a list for a feature in an ML model

I want to run a machine learning algorithm on some data, so I'm exporting the data into a file first.
But one of my features for the text I'm classifying is a list of tags,
and each text can have multiple tags ex. (["mystery", "thriller"]).
Is it recommended that when I write to my CSV file for exporting the data, that I write that entire list as one of the features for my data (the "tags" feature).
Or is it better to make a separate feature for each tag. The only problem then is that most examples will only have one tag, so the other feature columns for those will be blank.
So it seems like writing this list of tags as one feature makes the most sense, but then when parsing it for training, would I then treat every element of that list as its own feature still or no?
If you do it as a single feature just make sure to use some delimiter to separate the tags that won't occur in any of the tags, and also isn't a comma (as that will mess with the csv format), something like | would probably do fine. When you go to build your models and read in that list of tags you can then split it based on that delimiter. In Java this would look like:
String[] tagList = inputString.split("|");
I'm sure most languages will have a similar method to do this.

Scraping data into Stata

I have 40,000 HTML files. Each file has a table containing the profit & loss statement of a particular company.
I would like to scrape all these data into Stata. (Or alternatively, into an Excel/CSV file). The end product should be a Stata/Excel file containing a list of all companies and details of their balance sheet (revenue, profit, etc.)
May I know how this can be done? I tried Outwit but it doesn't seem good enough.
Stata is not exactly the best tool for the job. You would have to use low-level file commands to read the input text files, and then parse out the relevant tables (again, using low level string processing). Putting them into data set is the easiest part; you can either
expand 2 in l
replace company = "parsed name" in l
replace revenue = parsed_revenue in l
etc., or use post mechanics. With some luck, you'd find some packages that may make it simpler, but I am not aware of any, and findit html does not seem to bring anything usable.
Stata is not the good tool for this job. In principle it is possible. Personally I have already done similar things: reading ascii files into Stata, parsing them and extracting information fro them. I have dumped the data into Stata using insheet. Then I have treated the data with Stata's string functions. It was a bit cumbersome. And the files had quite a simple and clear structure. I don't want to imagine what happens when the files have a more complicated structure.
I think that the best strategy is to use a scripting language such as Python, Perl or Ruby. to extract the information contained in the html tables. The results can easily be written into a csv, Excel or even a Stata (.dta) file.
You should use Python beautifulsoup package. It is very handy in extracting data from HTML files. Following is the link.
http://www.crummy.com/software/BeautifulSoup/
In the documentation, there are many commands, however only few commands are important. Following are the important commands:
from bs4 import BeautifulSoup
#read the file
fp=open(file_name,'r')
data=fp.read()
fp.close()
#pass the data to beautifulsoup
soup = BeautifulSoup(html_doc, 'html.parser')
#extract the html elements by id and write result into file

How to generate application forms/documents programmatically?

At the moment, we use MS WORD and MS EXCEL to mail merge documents that needs to be sent to multiple recepients.
For example, say there is a complaint form where the complainant needs to fill in his/her name, address, etc. So we have a .doc file set up with the content and the dynamic entities set up for mail merging, with the name and address details put in an excel file, from where we can happily mail merge to generate all or just the necessary forms/documents.
However, I would like to automate this process, like a form in a website where the complainant can fill in his/her name, address and other details, and we could use that to generate the complaint form automatically and offer it to be downloaded (preferrably as a pdf).
Now, the only solution that comes to mind, is Latex, so that I can just replace the needed entities and just compile to PDF. However, that bit has to be negotiated with the webhost, if they are offering Latex or not.
Is there any other solution? Any other way we could get this done, with something that shouldn't be a problem for most webhosting solutions to offer?
EDIT: I would prefer a non .NET or rather non microsoft solution since, the servers are running linux and while mono might be capable of getting the job done, none of our devs know any .NET languages. However, if required we might have to dwelve into it.
Generating PDF using an XSL. Check the following: Apoc XSL-FO
You will need to create an XML file with the required fields and transform that with this tool.
If you wish to avoid .NET then XSL-FO is worth a look. Try the FOray project.
XSLT can be a steep learn if you do not have experience already. Also users will not be able to change the templates without asking the XSLT guru to do it.
If your templates are already in MS Word and MS Excel then I would stick with generating MS docs on the server. These are now easy to work with from code since OpenXML - check out OfficeOpenXML and OpenXMLDeveloper
Apache FOP : http://xmlgraphics.apache.org/fop/
I suggest generating rtf on the server: it's easy enough to automatically generate using cpan's RTF::Writer, has converters generating good pdf, can be edited by hand in word, oo-writer & TextEdit, doesn't have any really bad compatibility issues between the main editing applications, and has decent text & resource extraction tools, with text extraction being rather better than pdf.
There's some support for moving between rtf & latex, although the best rtf -> latex converter, docx2tex, depends on the System.IO.Packaging .net module, whose mono implementation isn't yet rock solid.
Postscript — Not a recommendation: it's too much of an unwieldy sledgehammer for this job, but iText will generate the pdf directly from the form data. If you wanted to do fancy things like signed pdf, that would be the way to go.
Postscript #2 — If you break up the Word document into individual files using word's master document representation, then you can clobber one of the parts with hand-generated content. This makes it easy to do something approximating form-filling on word .doc files using just standard file-utils and some trivial rtf->doc tweaking.

Resources