Azure resource to handle unstructured data sources - azure

we have a requirement to extract dark data from unstructured sources such as letters, rad reports, etc. Please suggest azure resource to extract data from common document formats: DOC, DOCX, PDF, RTF, TXT, HTML, etc and then to do analysis on the extracted data.

It sounds like you just want to extract raw text or images from these rich text format documents. If only do these, some libraries of parsing different documents is your real needs.
Here is some libraries in Java or Python to do that. If you are using .NET which I'm not familiar with, you can search in Google or Bing to find these alternative for .NET.
To parse the office document like DOC, DOCX: for Java, Apache POI is a good library for extracting data from MS office files; for Python, there seems to be not any package to do that, except using COM object like Word.Application or IronPython (Reading/Writing MS Word files in Python) in .NET on Windows.
To parse PDF files: there are Apache PDFBox, jPDFText for Java and PyPDF2 for Python.
To read RTF format file: Java natively supports via javax.swing.text.rtf.RTFEditorKit which you can get some sample code via search; like #1, also seems none for Python.
To parse HTML files: jsoup for Java and BeautifulSoup & HTMLParser for Python are best for extracting data from HTML.
For reading TXT format files, I think it's simple for any languages. But to extract valuable information from text content, Stanford NLP for Java and NLTK for Python are useful, also using Azure Text Analytics API of Cognitive Service can help doing some like key phrase extraction, and language detection.
Apache Tika toolkit for content analysis is a good solution, too. Even you can deploy it alone and to invoke its REST APIs by Python, other languages.
If you want to extract text from images, you can use Azure Computer Vision API of Cognitive Services to extract printed text or handwritten text, or use the third party library such as Tess4J or others you searched in GitHub.
All of above are almost depended on the third party dev kits without Azure resources. However, you can store these documents in Azure Storage and process them on Azure VM or Batch services, even to analyze the extract data in Azure Jupyter Notebook or use Azure ML to do more deeper research.

Related

Computer Vision 2.0 PDF to text not working

I am trying to parse text from a PDF file using Computer Vision 2.0. I am following the example and have changed the MediaTypeHeaderValue to "application/pdf". I get an error that the content type is not supported. I change it to "multipart/form-data" and get an error in processing. How do I use Computer-Vision to process PDF files?
Kevin,
You are using the legacy "OCR" API that does not support PDF input. Please use the new OCR technology available as the "Read" API - see overview for processing PDF documents. The version 3.0 is in GA since May. Read supports large images and multi-page and mixed languages documents up to 2000 pages long.
Please see the Read REST API QuickStart in C#.
Note that Form Recognizer is great if you want to extract not just text, but layout insights such as tables, check-boxes, and key value pairs from forms, use pre-built models, and build custom models to process your documents. It's now in GA.
Take a look at the Form Recognizer service for extracting data from the PDF.
https://azure.microsoft.com/en-us/services/cognitive-services/form-recognizer/

Azure Read API for Vector PDFs

I am working on the solution for OCR using Azure Read API, and it provides out of box solution for raster PDFs
https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text#read-api
but I don't see if it can support vector based PDFs. I have other solution by using some third party libraries such as Aspose and PDFxStream, but prefer if I can stay within just Azure Vision API ecosystem.
So my question is is it possible to use Read API for vector PDF, and if not what is best practical approach I could use?
To answer my question: yes, it supports Vector based PDFs, although it is not explicitly mentioned in API documentation. We checked both through Azure portal and through API code and it works. No problem with mixing raster and vector based PDFs.

Node.js - PDF, DOC, DOCX to PNG

Is there any library that can parse and generate a PNG from a Doc, Docx and PDF file?
We're implementing a training system using Node, Sails.js, Express and SQL and would like to generate some PNG image tiles for training modules based on a file upload.
I've done some searching and found some libraries in C# that can do all 3, as well as a just PDF impementation for Node but can't find anything that does more than that.
A point towards any 3rd party libraries or standard implementations of this method would be great.
Thanks
You can do that sort of stuff with C# (probably only on Windows) because C# is from MS stables, the same stable that churns out doc and docx. I am not sure whether the same implementation would work on Linux or Mac (even with Mono).
If you want to achieve this in NodeJS, just create the app in C#, wrap it in a ReSTful cover and call this ReSTful service in NodeJS (via Kue or something similar).
Honestly, converting file formats is a compute intensive process process. I wouldn't recommend it doing it the same main thread any way. If you're anyway gonna spawn a worker, you might as well do it in C# where it's perhaps faster.
Not necessarily an exact match for your requirement, but since you mentioned training purpose, I would recommend Watson Developer Cloud - it has document conversion among many other features which may be relevant and useful for your objective as a whole.
Speaking of the current problem, please see Document conversion overview to see how we can convert a PDF into a desired format such as HTML. Then you could actually get the PNG files from the HTML resource bundle.
Hope this helps.

html or latex to pdf parser

I am trying to generate receipt and invoice in pdf format
What is the best method:
Latex to pdf
Html to pdf
I have tried using photomjs to generate html to pdf, but the alignment is difficult. I am thinking about using latex to generate invoice.
Any good latex parser to generate pdf using nodejs hosting on AZURE?
It seems currently there is not an application or service on Azure present provides Latex server. And to use latex in node.js, we need to install a TeX distribution like TeX Live or MikTeX.
So you can create an Azure VM and build the application for your requirement.

To study & analyse vulnerability on Pdf

This is my project topic given by my college. Can somebody please give me an idea on where to start with this topic.
I have seen a lot of topics on pdf vulnerability but the problem is they require knowing a lot of security stuff beforehand. I have less than a week to submit the project.
If somebody could just guide me to where I should start I would be really grateful.
I have already looked up didier stievens site but its getting really tough for me to understand it since there is no time.
The most important point about PDF security is that most 'popular' attacks are targeting:
application related vulnerabilities in most popular free PDF reading applications: Adobe Reader and Foxit Reader;
humans to get them to click on the malicious attachment inside PDF to initiate attack;
Check these analysis and parsing utilities and documents:
Didier Stevens's pdf tools which include make-pdf-javascript.py ( javascript injection tool), pdfid.py that scans PDF and embedded javascripts for keywords and others;
PDF Stream Dumper and its source code;
PDF Miner Py - pdf parsing library made with python;
PDF.js - javascript based PDF rendering that could help you to learn PDF structure parsing right from your browser console (widely used in lot of online services like DropBox)
Official PDF Format Specification from Adobe for PDF 1.4 and PDF 1.7

Resources