I am looking at Datalogic's Adobe pdf library to repair and optimize PDF files for printing. The APDFL v15.0.0PlusP1a (5/18/2016) version release notes make reference to PDFProcessor for C++ but that seems to be missing from the sample files. The PDFOptimizer looks promising but it does not repair known badly formed PDF files.
The Adobe PDF library PDDocOpenwithParams() method allows you to set a flag doRepair:
doRepair: If true, attempt to repair the file if it is damaged; if false, do not attempt to repair the file if it is damaged.
Will it fix a badly formed PDF? how bad is bad? If Acrobat is able to resolve the issues and display the document then Adobe PDF library should be able to deal with the document also.
Regarding the PDFProcessor sample, in Release v15.0.4PlusP2b the samples were restructured. The samples listed on our website reflects those changes. Some of the old samples were removed or rewritten. PDFProcessor has been temporarily removed but available if needed for evaluations or customers use. The PDFProcessor sample- shows how to convert PDF documents to PDF/A and PDF/X compliant PDF files.
Related
I'm developing a web page. I want to create an option to generate a .pdf file and allow the user to download it.
Currently I'm using jsPDF but I'm finding it very hard to properly format the document.
I was hoping to find a new way of building it in markdown format, compile it and then download it.
Is there a way that I can do this, in node.js, where say, I have a string in memory (which is the markdown text format), compile that into a pdf and then download it from the page?
I haven't found any package that really does this, if you know, feel free to just let me know which one can achieve this and I'll figure it out.
For such a thing, I recommend building your .pdf file first a HTML file, so you could edit it easily (hardcoded or dynamicaly)
then convert your html file to .pdf file.
there is alot of packages to do this
have a closer look on this package
https://www.npmjs.com/package/html-pdf-node
I have a task where I need to fill in PDF Forms. I think they are called Acroforms but I am not sure. It is opposit to XFA Forms (embedded in PDFs), Acroforms are less dynamic, they don't have many features compared to XFA.
I am coding for NodeJS, so I tried its module pdffiller. This library is only a wrapper for so-called pdftk or PDF Toolkit.
It took me almost one day to figure out how to use it on my Ubuntu 18.04 development laptop. I couldn't install it or compile, so I had to download docker image, unpack it and place compiled pdftk in specific system folders to allow pdffiller to work (lib goes here /usr/lib/x86_64-Linux-gnu/libgcj.so.16.0.0, binary goes here /usr/local/bin/pdftk).
Then there go the forms. Those downloaded from official government webpage (these are tax return forms) are XFA forms, these do not work with pdftk - all I get is a request for password which I don't have.
Error: Failed to open PDF file:
tax-return-form.pdf
OWNER PASSWORD REQUIRED, but not given (or incorrect)
Done. Input errors, so no output created.
We took the other approach, we bought Adobe Acrobat DC to convert flat PDFs to simple Acroforms. Then we tested again what PDF Toolkit can do. Two problems are blockers, and dim pdftk unusable for us:
No output for Polish diactrict signs (ąęćśżźnó)
No ability to check a checkbox
And the tax form has plenty of those checkboxes!
I would like to as what tool should we use? Is there any opensource or free for commercial use that will fill PDF Forms properly?
Edit:
I found the way to select checkboxes. After using pdftk dump_data_fields_utf8 method I got a file with a lot of information about fields:
---
FieldType: Button
FieldName: checkbox3
FieldFlags: 0
FieldJustification: Left
FieldStateOption: Off
FieldStateOption: Yes
---
The checkbox above has FieldStateOption: Off or Yes. Just putting Yes in JSON field-value map mad checkboxes selected.
It is doable, but:
there is no utf-8 fonts if one uses 'flatten' option
has to use 'need_appearances' which excludes 'flatten'
Ubuntu's PDF Viewer Evince 3.28.4 doesn't know how to display polish fonts
Firefox, Chrome, Adobe Reader 9 for Linux does display UTF fonts properly.
and fanks for the downvote without explanation why, SO sucks as usual.
I'm trying to access to my iCloud Notes with a python script using pyiCloud framework, but when I try to list the notes it seems that Documents folder is empty. Does anyone know how I should make that?
>>> api.files['com~apple~Notes']['Documents'].dir() It returns:
>>> []
It sounds like you have an authentication problem that you can't get access to your Notes from the file storage (the UbiquityService). This issue might give you some more clues.
On the other note (!), I found the following a better way to get my Notes exported. I have tried a couple of ways mentioned around the web. Although there is no in-app solution to export Notes in a format other than PDF files, I have stumbled upon the following two solutions:
Export in Markdown (or in other formats in the paid version) via the Bear app. I found this way easier and of more quality in terms of keeping the formatting, attachments, etc:
Download the Bear migration Workflow script from here and follow the instructions.
[optional] At this point, you have the HTML files with inline encoded images. Use my script to decode images to get regular HTML files with the images in an accompanying directory.
Install Bear and import the exported files from Notes.
Export the files as Markdown, HTML, or whatever format you desire from File -> Export Notes within the Bear app. Don't forget to check the "Export attachments" box in the export dialog.
Export in HTML (and then convert to Markdown if you want) via the Notes Exporter app. The app gives you HTML files with inline encoded Base64 images saved with .txt extension (?!). Although I personally like this way as it generates output files that mimic closely the original Notes, the hyperlinks are missing in the exported files (it still keeps the hyperlink coloring though):
Download Notes Exporter from here.
Export Notes to the path you choose.
[optional] Rename file extensions to .html.
Voilà, now you have your Notes as HTML files with the same formatting and images.
Decode inline Base64 images and save HTML files with images saved in a separate adjacent directory using the script that I wrote for this:
https://gist.github.com/SHi-ON/945ea2272ea4bb29e13bd0305370da90
Hope this helps to give you an idea!
Suppose we are migrating a set of MS Office files from (say) a Shared drive to SharePoint (eg SharePoint Online). Limited to Office 2007 onwards, so file extensions like DOCX, XLSX.
We see that the size of the file changes when it is saved to SharePoint - as certain metadata is added.
(Though file sizes of non MS Office files such as PDF or JPEG do NOT change).
These MS Office files are "containers" in which are placed a number of component parts - this situation can be viewed crudely by changing the Extension of an XLSX file (say) to ZIP, and opening it with WinZip.
For good sound integrity reasons we want to assure ourselves that the "File Content" component part has not changed.
How can we identify the component parts within those containers which represent the Content?
Are such component parts Invariant when when saved to SharePoint as described?
If so, are there any utilities which could analyse a pair of such files and confirm that the content is the same, or if it has been changed? Is there perhaps some checksum we could generate from both files and compare.
If no such utility exists what sort of environment would be best for creating one? - could it be done in VB.NET and/or C# for instance?
Thanks.
This previous post related to same issue, but does not provide the sort of answer we need. C# - Hash contents of MS Office documents without metadata
Interesting topic.
How can we identify the component parts within those containers which represent the Content?
within the docx you will need to assess each of the content files. Please be aware that the files within a docx are compressed using deflate. So you will probably have to inflate them. This is not only the document.xml and the document.xml.rels file but also:
- the header xml files (can be more then 1)
- header .rels files
- footer xml files (againa multiple files)
- footer .rels files)
- the media files (containing images)
You even have to check the core.xml file if SharePoint property demotion alters a field like title.
To summarise, you cannot compare the docx files at the docx level. You will need to unpack them and compare (use e.g. CRC32 or MD5) each of the "content" files.
I am not aware of utilities providing this functionality.
Note: if you just need to upload the files to SharePoint for archiving then placing them into separate zip files might be an alternative. This is of course only an option if you just need to store the content and do not expect the users to make any changes.
Paul
At the moment, we use MS WORD and MS EXCEL to mail merge documents that needs to be sent to multiple recepients.
For example, say there is a complaint form where the complainant needs to fill in his/her name, address, etc. So we have a .doc file set up with the content and the dynamic entities set up for mail merging, with the name and address details put in an excel file, from where we can happily mail merge to generate all or just the necessary forms/documents.
However, I would like to automate this process, like a form in a website where the complainant can fill in his/her name, address and other details, and we could use that to generate the complaint form automatically and offer it to be downloaded (preferrably as a pdf).
Now, the only solution that comes to mind, is Latex, so that I can just replace the needed entities and just compile to PDF. However, that bit has to be negotiated with the webhost, if they are offering Latex or not.
Is there any other solution? Any other way we could get this done, with something that shouldn't be a problem for most webhosting solutions to offer?
EDIT: I would prefer a non .NET or rather non microsoft solution since, the servers are running linux and while mono might be capable of getting the job done, none of our devs know any .NET languages. However, if required we might have to dwelve into it.
Generating PDF using an XSL. Check the following: Apoc XSL-FO
You will need to create an XML file with the required fields and transform that with this tool.
If you wish to avoid .NET then XSL-FO is worth a look. Try the FOray project.
XSLT can be a steep learn if you do not have experience already. Also users will not be able to change the templates without asking the XSLT guru to do it.
If your templates are already in MS Word and MS Excel then I would stick with generating MS docs on the server. These are now easy to work with from code since OpenXML - check out OfficeOpenXML and OpenXMLDeveloper
Apache FOP : http://xmlgraphics.apache.org/fop/
I suggest generating rtf on the server: it's easy enough to automatically generate using cpan's RTF::Writer, has converters generating good pdf, can be edited by hand in word, oo-writer & TextEdit, doesn't have any really bad compatibility issues between the main editing applications, and has decent text & resource extraction tools, with text extraction being rather better than pdf.
There's some support for moving between rtf & latex, although the best rtf -> latex converter, docx2tex, depends on the System.IO.Packaging .net module, whose mono implementation isn't yet rock solid.
Postscript — Not a recommendation: it's too much of an unwieldy sledgehammer for this job, but iText will generate the pdf directly from the form data. If you wanted to do fancy things like signed pdf, that would be the way to go.
Postscript #2 — If you break up the Word document into individual files using word's master document representation, then you can clobber one of the parts with hand-generated content. This makes it easy to do something approximating form-filling on word .doc files using just standard file-utils and some trivial rtf->doc tweaking.