Is it possible to add an image to a PDF without rendering the PDF? - node.js

I'm looking at adding an image to an existing PDF in Node.js. None of the PDF libraries I found appear to have the ability to modify an existing PDF though, so I'm planning on implementing it myself. I'm trying to figure out if it's too much work, as I can always do it server side using iTextPDF instead, but I'd prefer to do it in my app (Electron which uses Node.js).
If I just want to modify an existing PDF and add an image, will I have to write a complete rendering library or is PDF structured in such a way that I can write a very small parser that just gets the page I want and inserts an image using the correct format?
Specifically, I'm asking because I've previously looked into writing a text extraction library, put in order to get the position of text you have to render pretty much the entire PDF because of how positioning is handled. That's too much work to get around server side processing in this case.
To be clear, just asking if it's possible to do, not how to do it (don't want to be too broad, I'm sure I can figure that part out).

To perform a small manipulation of a PDF, you'll need to implement generalized reading, decompression, encryption and traversal of PDF data structures. Some of the thing you would need to handle include:
basic parsing of PDF syntax
indexing via the cross reference index, and/or cross reference index and object streams
objects (num, byte-string, hex string, dictionary, arrays, booleans...)
filters and variants (LZW, Flate, RunLength, Predictors)
encryption (RC4, AES, Custom security handlers)
page tree traversal
basic handling of page content streams
image handling
serialization, either rewriting of the entire PDF, or incremental updates to an existing PDF
Anything's possible, but realistically, you will need a PDF library or toolkit, client or server-side, to accomplish this.

Related

A Study on the Modification of PDF in nodejs

Project Environment
The environment we are currently developing is using Windows 10. nodejs 10.16.0, express web framework. The actual environment being deployed is the Linux Ubuntu server and the rest is the same.
What technology do you want to implement?
The technology that I want to implement is the information that I entered when I joined the membership. For example, I want to automatically put it in the input text box using my name, age, address, phone number, etc. so that the user only needs to fill in the remaining information in the PDF. (PDF is on some of the webpages.)
If all the information is entered, the PDF is saved and the document is sent to another vendor, which is the end.
Current Problems
We looked at about four days for PDFs, and we tried to create PDFs when we implemented the outline, structure, and code, just like it was on this site at https://web.archive.org/web/20141010035745/http://gnupdf.org/Introduction_to_PDF
However, most PDFs seem to be compressed into flatDecode rather than this simple. So I also looked at Data extraction from /Filter /FlateDecode PDF stream in PHP and tried to decompress it using QPDF.
Unzip it for now.Well, I thought it would be easy to find out the difference compared to the PDF without Kim after putting it in the first name.
However, there is too much difference even though only three characters are added... And the PDF structure itself is more difficult and complex to proceed with.
Note : https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf (PDF official document in English)
Is there a way to solve the problem now?
It sounds like you want to create a PDF from scratch and possibly extract data from it and you are finding this a more difficult prospect than you first imagined.
Check out my answer here on why PDF creation and reading is non-trivial and why you should reach for a tool you help you do this:
https://stackoverflow.com/a/53357682/1669243

Crawler reading a pdf

i am trying to create a crawler that can read a pdf and extract certain information from it (to save in a database).
However, i am unsure which method / Tool to use.
My initial thought was to use PhantomJs but after reading a lot it doesn't seem that it has the capabilities. if I wanted to use Phantomjs I would have to download the pdf, convert it into an HTML page and then afterwards crawl it using Phantom which seems like a tedious task that should be able to be done faster.
So my question is, how can I read a pdf from an online source and gather these pieces of information?
If you are not limited in terms of programming language, consider using iText.
It can easily extract all the text from a given PDF document. It also offer utility methods to look for regular expressions within a file, giving you back the exact location (coordinates) and the matching text.
iText is available both for c# and java lovers.
File inputFile = new File("");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
String content = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1));
Have a look at the website to learn more.
http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction

Question of UIWebView and Core Data

I want to develop a news App such as Engadge etc. The news had loaded from the server, and now I'll save the news included body text and pictures into database(Core Data). Can UIWebView read the datas from Core Data directly, and shows in UIWebview?
Thanks.
Yes and no. You can store the HTML content in the database (CoreData), to show an article you use: loadHTMLString:baseURL: of UIWebView to show the textual content. The images however is probably best stored outside the database as file because you will have to point the image references in your HTML to an actual file.
You could store the images as BLOBs but then you need to pull those blobs and write as files later for UIWebView to be able to pick them up.
I think the easiest way is to store images as files and but place references to them inside CoreData. That way you can also delete them accordingly later on.
If by directly, you mean without any glue code, then no, not on iOS as of present.
You could however use Core data to store text and image objects as desired. It might not be the best idea to use a UIWebView, but to answer your question, it's definitely possible, and in fact quite easy to do so.

Append a dynamically changing watermark to a PDF in SharePoint

This is primarily a question of possibilities more than instructions. I'm a programming consultant working on a WSS project site system for my client. We have a document library in which files are uploaded to go through a complex approval process. With multiple stages in this process, we have an extra field which dictates what the current status of the document is.
Now, my client has become enamored with the idea of PDF watermarking. He wants the document (which is already a PDF) to be affixed with a watermark corresponding to the current status, such that with each stage of the approval process the watermark will change.
One method, the traditional method for PDF watermarking, of accomplishing this is to have one "clean" copy of the document somewhere hidden on the site, and create a new PDF from it that has the watermark at each stage of the approval process. Since the filename will never change, this new PDF can be uploaded continually to a public library, always overwriting the old version and simulating a "dynamically changing watermark". However, in the various stages there will also be people uploading clean copies with corrections and suggestions, nevermind the complex nature of juggling around two libraries and the fact we double the number of files stored. My client and I agree that this is not a practical path to choose.
What we would like to do is be able to "modify" the watermark in a PDF, so that we only have to keep one copy of the file. Unfortunately, from what I've seen, in most cases when you make something like a watermark, which in its nature is supposed to be "unmodifyable", you won't be able to edit it later. So, is it possible to have a part of a PDF which cannot be changed by anyone who downloads the file, but can be changed as part of a workflow or other object model process?
PDF Watermarking in SharePoint is a common request. I have written extensively on this topic. See:
Adding a dynamic watermark to a PDF file from a SharePoint Workflow
Adding a (static) watermark to a PDF file from a SharePoint Workflow
Use SharePoint Workflows to inject JavaScript into PDFs and print the ‘open date’
You could use Event Handlers such that code was run every time a document was checked in. In that code you could perform the fixup/check that made the watermark be what you wanted it to be. This assumes you can write code that manipulates a PDF's internal structure such that it has the watermark that you desire.
It sounds to me like you want to allow people to modify the PDF they download, but not modify its watermark. This is probably going to be nigh on impossible if the watermark is embedded in the PDF (afaict) but what if the watermark image is external to the PDF; is it possible to embed a watermark in a PDF that is sourced via HTTP? Then you could embed:
<watermark image="http://sharepoint/site/_vti_bin/docstatus.asmx?id=5">
Of course, I have no idea about PDFs, so this might not be possible but you get the concept.
-Oisin
It is possible to do so if you use third party tool. Then you can put dynamically binded value from your SharePoint metadata, conditions, rules etc: http://www.pdfsharepoint.com

How to generate application forms/documents programmatically?

At the moment, we use MS WORD and MS EXCEL to mail merge documents that needs to be sent to multiple recepients.
For example, say there is a complaint form where the complainant needs to fill in his/her name, address, etc. So we have a .doc file set up with the content and the dynamic entities set up for mail merging, with the name and address details put in an excel file, from where we can happily mail merge to generate all or just the necessary forms/documents.
However, I would like to automate this process, like a form in a website where the complainant can fill in his/her name, address and other details, and we could use that to generate the complaint form automatically and offer it to be downloaded (preferrably as a pdf).
Now, the only solution that comes to mind, is Latex, so that I can just replace the needed entities and just compile to PDF. However, that bit has to be negotiated with the webhost, if they are offering Latex or not.
Is there any other solution? Any other way we could get this done, with something that shouldn't be a problem for most webhosting solutions to offer?
EDIT: I would prefer a non .NET or rather non microsoft solution since, the servers are running linux and while mono might be capable of getting the job done, none of our devs know any .NET languages. However, if required we might have to dwelve into it.
Generating PDF using an XSL. Check the following: Apoc XSL-FO
You will need to create an XML file with the required fields and transform that with this tool.
If you wish to avoid .NET then XSL-FO is worth a look. Try the FOray project.
XSLT can be a steep learn if you do not have experience already. Also users will not be able to change the templates without asking the XSLT guru to do it.
If your templates are already in MS Word and MS Excel then I would stick with generating MS docs on the server. These are now easy to work with from code since OpenXML - check out OfficeOpenXML and OpenXMLDeveloper
Apache FOP : http://xmlgraphics.apache.org/fop/
I suggest generating rtf on the server: it's easy enough to automatically generate using cpan's RTF::Writer, has converters generating good pdf, can be edited by hand in word, oo-writer & TextEdit, doesn't have any really bad compatibility issues between the main editing applications, and has decent text & resource extraction tools, with text extraction being rather better than pdf.
There's some support for moving between rtf & latex, although the best rtf -> latex converter, docx2tex, depends on the System.IO.Packaging .net module, whose mono implementation isn't yet rock solid.
Postscript — Not a recommendation: it's too much of an unwieldy sledgehammer for this job, but iText will generate the pdf directly from the form data. If you wanted to do fancy things like signed pdf, that would be the way to go.
Postscript #2 — If you break up the Word document into individual files using word's master document representation, then you can clobber one of the parts with hand-generated content. This makes it easy to do something approximating form-filling on word .doc files using just standard file-utils and some trivial rtf->doc tweaking.

Resources