html or latex to pdf parser

html or latex to pdf parser - node.js

I am trying to generate receipt and invoice in pdf format
What is the best method:
Latex to pdf
Html to pdf
I have tried using photomjs to generate html to pdf, but the alignment is difficult. I am thinking about using latex to generate invoice.
Any good latex parser to generate pdf using nodejs hosting on AZURE?

It seems currently there is not an application or service on Azure present provides Latex server. And to use latex in node.js, we need to install a TeX distribution like TeX Live or MikTeX.
So you can create an Azure VM and build the application for your requirement.

Related

Computer Vision 2.0 PDF to text not working

I am trying to parse text from a PDF file using Computer Vision 2.0. I am following the example and have changed the MediaTypeHeaderValue to "application/pdf". I get an error that the content type is not supported. I change it to "multipart/form-data" and get an error in processing. How do I use Computer-Vision to process PDF files?

Kevin,
You are using the legacy "OCR" API that does not support PDF input. Please use the new OCR technology available as the "Read" API - see overview for processing PDF documents. The version 3.0 is in GA since May. Read supports large images and multi-page and mixed languages documents up to 2000 pages long.
Please see the Read REST API QuickStart in C#.
Note that Form Recognizer is great if you want to extract not just text, but layout insights such as tables, check-boxes, and key value pairs from forms, use pre-built models, and build custom models to process your documents. It's now in GA.

Take a look at the Form Recognizer service for extracting data from the PDF.
https://azure.microsoft.com/en-us/services/cognitive-services/form-recognizer/

Azure resource to handle unstructured data sources

we have a requirement to extract dark data from unstructured sources such as letters, rad reports, etc. Please suggest azure resource to extract data from common document formats: DOC, DOCX, PDF, RTF, TXT, HTML, etc and then to do analysis on the extracted data.

It sounds like you just want to extract raw text or images from these rich text format documents. If only do these, some libraries of parsing different documents is your real needs.
Here is some libraries in Java or Python to do that. If you are using .NET which I'm not familiar with, you can search in Google or Bing to find these alternative for .NET.
To parse the office document like DOC, DOCX: for Java, Apache POI is a good library for extracting data from MS office files; for Python, there seems to be not any package to do that, except using COM object like Word.Application or IronPython (Reading/Writing MS Word files in Python) in .NET on Windows.
To parse PDF files: there are Apache PDFBox, jPDFText for Java and PyPDF2 for Python.
To read RTF format file: Java natively supports via javax.swing.text.rtf.RTFEditorKit which you can get some sample code via search; like #1, also seems none for Python.
To parse HTML files: jsoup for Java and BeautifulSoup & HTMLParser for Python are best for extracting data from HTML.
For reading TXT format files, I think it's simple for any languages. But to extract valuable information from text content, Stanford NLP for Java and NLTK for Python are useful, also using Azure Text Analytics API of Cognitive Service can help doing some like key phrase extraction, and language detection.
Apache Tika toolkit for content analysis is a good solution, too. Even you can deploy it alone and to invoke its REST APIs by Python, other languages.
If you want to extract text from images, you can use Azure Computer Vision API of Cognitive Services to extract printed text or handwritten text, or use the third party library such as Tess4J or others you searched in GitHub.
All of above are almost depended on the third party dev kits without Azure resources. However, you can store these documents in Azure Storage and process them on Azure VM or Batch services, even to analyze the extract data in Azure Jupyter Notebook or use Azure ML to do more deeper research.

NodeJS - PDF generate from URL

Wanted to generate a PDF from a URL
(https://10.1.40.117/print/e71b7c0f-4ed1-4d0d-b868-87418d398a4a).
Please help me with the links which is used to do this using nodeJS

I use Puppeteer to generate PDFs and their documentation has many examples. Since it uses Chrom(e|ium), it closes match my development environment as well which is nice when building the web pages.

For those who might stumble on this question nowadays:
There is cool tool called Gotenberg — Docker-powered stateless API for converting HTML, Markdown and Office documents to PDF. It supports converting URLs via Google Chrome headless.
And I am happen to be an author of JS/TS client for Gotenberg — gotenberg-js-client
I welcome you to use it :)
UPD:
Gotenberg has new website now — https://gotenberg.dev

Node.js - PDF, DOC, DOCX to PNG

Is there any library that can parse and generate a PNG from a Doc, Docx and PDF file?
We're implementing a training system using Node, Sails.js, Express and SQL and would like to generate some PNG image tiles for training modules based on a file upload.
I've done some searching and found some libraries in C# that can do all 3, as well as a just PDF impementation for Node but can't find anything that does more than that.
A point towards any 3rd party libraries or standard implementations of this method would be great.
Thanks

You can do that sort of stuff with C# (probably only on Windows) because C# is from MS stables, the same stable that churns out doc and docx. I am not sure whether the same implementation would work on Linux or Mac (even with Mono).
If you want to achieve this in NodeJS, just create the app in C#, wrap it in a ReSTful cover and call this ReSTful service in NodeJS (via Kue or something similar).
Honestly, converting file formats is a compute intensive process process. I wouldn't recommend it doing it the same main thread any way. If you're anyway gonna spawn a worker, you might as well do it in C# where it's perhaps faster.

Not necessarily an exact match for your requirement, but since you mentioned training purpose, I would recommend Watson Developer Cloud - it has document conversion among many other features which may be relevant and useful for your objective as a whole.
Speaking of the current problem, please see Document conversion overview to see how we can convert a PDF into a desired format such as HTML. Then you could actually get the PNG files from the HTML resource bundle.
Hope this helps.

JBOSS create PDF, XLS

I am writing a JBOSS web app with Struts2 and would like to produce reports in PDF and XLS format. How can I do this? Are there popular packages that can do this for me?

Here's a list of PDF libraries for Java. We use iText extensively.
jFreeReport (scroll down on linked page to find) also offers Excel generation, though I have not used that.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string