I have various pdf files which fail a certain logic process due to them being invalid.
I use - https://www.pdf-online.com/osa/validate.aspx
and when I validate a pdf, I get a message that says the pdf does not conform to the PDF 1.3 standard or 1.4 standard.
I'm familiar with converting the pdf to text/json/buffer and then rebuild it and save it as a new pdf file, but was wondering is there an alternative? Because each pdf is different and is basically user input and the rebuilding it using jspdf for example, will be different for every file.
Is it possible to convert such pdf document to conform to the PDF 1.3/1.4 standards?
Related
I am currently working on a project for extracting text from multi paged PDFs(these PDFs are generally circulars or application forms) using tesseractOCR (in node.js) ,since tesseract only takes images as input i am not able to pass the pdf. I need a code to help me pass each page of the pdf and get the result back(if the result of pages are appended then its not a problem).
I tried using pdf-poppler,I dont necessarily need to use pdf-poppler.
Technologies using : tesseractOCR for js,Node.js
Additional/optional info: Can i get suggesstion on some free open source OCR to use and how to parse the text i get.
I am struggling to apply text redaction in a PDF file in a aws lambda function written in NodeJs. Here is a list of libraries that I have tried with no success:
pdf-lib: This library almost fulfils all the requirements except that it doesn't redact the text permanently as part of its limitations https://github.com/Hopding/pdf-lib/issues/827
PDF.js: To overcome the above limitation, tried to covert the pdf to an image, so the redaction black boxes are applied permanently. Example code here: https://github.com/mozilla/pdf.js/blob/master/examples/node/pdf2png/pdf2png.js However, this lib is not reliable as this cannot extract contents from most pdfs during the process.
Finally, Pdf2Pic: This library helps to overcome the limitation of the first library (pdf-lib) by the converting the pdf into images. But this library internally uses two non node based libraries (graphicsmagick and ghostscript) which I am trying to avoid.
Is there a nodejs based solution that can be used to apply redaction permanently on a pdf file or any solution that can be used to covert a pdf to images to overcome limitation of pdf-lib.
I already have excel file in .xlsx format.I am trying to convert to pdf using microsoft graph api (by uploading the file to one drive and then downloading it as pdf). I am using the following API call
https://graph.microsoft.com/v1.0/me/drive/items/[item-id]/content?format=pdf
I see that the pdf conversion process in above API doesn't consider all the page setup parameters that are set in the underlying .xlsx file. More specifically, I see that converted pdf is always rendered in landscape mode and seems to be ignoring fit to width/height/page settings. If I open the same excel file locally using Excel and save the document as pdf, it renders the document correctly by interpreting all the page setup parameters properly.
Any help would be greatly appreciated as to how I can get pdf conversion API to render pdf as per orientation(portrait/landscape) and page width/height settings on the .xlsx file
I have tried multiple smaller files with different page setup parameters but pdf conversion (using rest api) always returns the document in landscape mode and seems to be ignoring fit to page/width/height settings
I want to convert any documents or image or text file into PDF for all the OS.
I tried the approach with node-msoffice-pdf, and its working fine for Windows OS but not working in other OS.
Question:
How to convert docs, images, textfile to pdf in nodejs?
I used wkhtmltopdf from years to manage pdf conversion.
https://github.com/devongovett/node-wkhtmltopdf
You can either render an html file and pass it to the module, or render a pdf directly from an url.
If fidelity/conversion quality is important to you, for Word documents (doc/docx) you could try our freemium https://www.npmjs.com/package/#nativedocuments/docx-wasm which will perform the conversion locally (ie where node is running), without the need to LibreOffice etc.
In my application I am uploading a PDF file after uploading, I should display the information present in PDF file to a HTML form we are using angular 2 for frontend and node js for backend. Can any one help me with this.
Please remember PDF to HTML.
You can do one thing. Convert your pdf to a JSON. Use pdf2json.
pdf2json is a node.js module that parses and converts PDF from binary to json format, it's built with pdf.js and extends it with
interactive form elements and text content parsing outside browser.
The goal is to enable server side PDF parsing with interactive form
elements when wrapped in web service, and also enable parsing local
PDF to json file when using as a command line utility.
perform npm install pdf2json
Create an empty JSON whose key values will be the main headings from the pdf like a customer, age etc. Its values are obtained from the uploaded pdf.
Using this JSON values fill your form, on saving the form using, node.js save it to your DB. Is this what you want?
Simply what you need is to render a PDF in your application.
You could use this library ng2-pdf-viewer
Almost all the basic functionalities are available as properties to this component. You could manipulate it to your requirement.