Bash-script printing a pdf to a pdf in Linux - linux

The question probably sounds a little odd, but the actual task is relatively simple, I swear!
I'm automatically generating some PDFs from a webform, using PDFCreator to merge a generated FDF into a preexisting PDF. I created the preexisting PDF in NitroPDF. This setup works great - almost. The problem is that when you view the generated PDFs in Adobe Reader 9 (the most common reader) a subset of the fields are just blank. The information is still there; using previous versions of Adobe Reader or a different reader like Foxit Reader shows the entire PDF. No clue what's going on, and Adobe tech support was useless since I didn't create the PDF with Adobe software. (If you'd like to help fix this problem instead of the following, feel free to email me.)
However, if I take the resultant PDF and print it to a fresh PDF using a PDF printer driver, it works great everywhere. This is time-consuming and annoying for our sales department to do themselves, so I want to perform this step automagically upon creating the first PDF.
I'm in ubuntu, and have command-line root access to the server. The program is written in PHP, and can easily make system calls. I'm just having trouble figuring out how to tie things together properly so that I can automatically print a known file using a specific printer driver to another known file.

You could try putting your PDF files through Ghostscript. I have found that this is enough to fix many problematic PDFs.
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf input.pdf
(The same command can also be used to merge several PDF files into one, just specify multiple input files.)

Related

How to remove PDF/A markers without using Adobe Acrobat in GNU/Linux

I am trying to remove the PDF/A markers in a file — I have no access to Adobe Acrobat — as some tools balk at PDF/As. Is there a way to revert a PDF/A to a normal PDF with free software tools? I am running Debian testing.
The indicators for PDF/A are in the Metadata entry, but you do not want to erase that entire entry. Instead you would want to modify it.
To modify, you can extract the XML string, modify using whatever XML tool is handy for you, and then "update".
These three entries are the ones you want to erase.
pdfaid:part
pdfaid:amd
pdfaid:conformance
Of course this still leaves you with the following tasks, with 1 and 3 normally done using a PDF SDK library.
Find and extract the Metadata entry (it could be compressed in the PDF)
Reading and editing the XML (should be trivial)
Updating the Metadata entry with your modified XML
Since you gave no indications of platform+OS I can't advise any further.

Update linked excel path in PowerPoint via Python

I want to automate creating of a powerpoint ppt via linking template charts to some Excel files. Updating the excel file values changes the powerpoint slides automatically. I have created my powerpoint template and linked charts to sample excel files data.
I want to send the folder with the powerpoint and excel files to someone else. But this will break the link to excel files due to change in the path. (As path is not relative). I can edit the paths manually by going under the "edit links to files" option under File Menu but this is tedious as charts are numerous with multiple files.
I want to update the same via Python code using the Python-Pptx package.
Please help!
There's no API support for this in the current version of python-pptx.
You would need to modify the underlying XML directly, perhaps using python-pptx internals as a starting point and using lxml calls on the appropriate element objects. If you search on "python-pptx workaround function" you will find some examples.
Another thing to consider is modifying the XML by cruder but still possibly effective means by accessing the XML files in the .pptx package directly (the .pptx file is a Zip archive of largely XML files) and using regular expressions or perhaps a command line tool like sed or awk to do simple text substitution.
Either way you're going to need to want it pretty badly, depending on your Python skill level. You'll also of course need to discover just which strings in which parts of the XML are the ones that need changing. opc-diag can be helpful for that, but it's a bit of detective work even with the best tools.

Adobe Acrobat/Python PDF Outputs Varying

I've noticed that when I use an OCR to transform a scanned PDF document into text, in this case Adobe Acrobat Pro, I'm getting very different outputs depending on how I extract the data.
In the above photo - you can see a piece of a PDF that has been OCR'ed into fairly good quality text. If I select it in Adobe and copy it to say, a word or txt doc, it paste over perfectly fine.
However, if I export it using Adobe to Rich Text Format, use Python's PDFminer, or Python Apache Tika then I get the above photo which as you can see completely jumbles it. The extraction results are very consistent between the approaches - basically all 3 jumble it in the exact same way.
Would any of you have any idea as to why an OCR'd PDF can be copied just fine to a text editor but is extracting in such a bizarre way?
Thank you!
Regards,
Mano
So what ended up working for me was running the initial parsing with Apache-Tika and then, on the few that didn't work on, pass them through PyPDF2. My theory is that PyPDF2 uses a different mechanism for parsing that doesn't rely on the root of the PDF unlike Tika and that is what seems to have become corrupted in a few of these OCR'd docs.
Not sure of the initial cause but that was my solution.

Checking if PDF is searchable

I wrote a bash script that extracts plain text from scanned PDF files. I've got lots of PDF's but some are scanned and some other are not. So now my main goal is to improve my script by checking if PDF's are already searchable, so no OCR extraction will be needed.
I've tried:
pdftext -nopgbrk pdf_file.pdf wordlist
to store possible OCR'ed text in wordlist, so then I can check if it's empty and figure out whether it's a searchable PDF or not.
I've also tried pdffonts pdf_file.pdf to check if there're fonts in that PDF and therefore if there's text on it or not.
Both ways work pretty fine but are failing in some cases.
For example, some of the PDF's I need to OCR are digitally signed, and those signatures always add a text layer to PDFs. So when I run any of those two commands, it'll output either the signature's text, or the font that it's using. It's like if it had found plain text just because of the signing. It might just be a scanned PDF with a digital signature, but it'll be detected as a plain text PDF.
Digital signings always add text this way (using Helvetica font):
Signed by: Name
Date: Date CEST
Company: Company Name
So with:
pdftext -nopgbrk pdf_file.pdf wordlist | grep -v -E 'Signed|Date|Company'
I can manage to remove those lines so if it's really a scanned PDF, the output will be empty.
It worked for some PDF's until I noticed a signature that had some other format, so I feel this is pretty much of a work-around and not a great solution.
Is there any way to check if a PDF is fully searchable? I just need a way to extract PDF's text but omitting digital signings. Also grep -v will always depend on our digital signature's format and if it changes then it'll screw up my script.
Thanks.
Unfortunately, there really isn't an easy way to do this in a "non-hacky" way without significantly more involved analysis of the file which would be far beyond the scope and scale of a bash script.
When pdftotext outputs the text for the digital signature, that text is not coming from the digital signature itself. That is stored as an object in the PDF with metadata that pdftotext ignores. Instead, what pdftotext picks up is just that: text which has also been added to the file.
Here's an example from Adobe's sample signed PDF document. First, the digital signature's metadata:
And here is the text which is inserted into the document:
Technically, you can have one without the other, and there is no established format for the text that generally accompanies a digital signature. Therefore, you're stuck either:
Ignoring specific text with grep, as you are doing now, which can be unreliable.
Running OCR on all files and then checking if there is a difference in the text before/after OCR, but then this defeats the whole purpose of checking in the first place.

make swf from fla without ever opening it

is it possible to change text and images in a fla file without ever opening it up and then making the swf via command line? I want to make a flash template and save the fla. Then be able to update my text and image name and convert it to swf. I have one template but tons of different text options and background images. It would be nice to be able to copy the master.fla twenty times and just change the source code (will do this from command line) and then convert to swf (via command line).
Any help would be appreciated.
With CS5, you can do half of what you're asking today, by using the XFL file format instead of FLA. Instead of a binary blob, you get an editable XML file and a tree of separate asset files: PNGs, AS3 files, etc. You can then modify the XML or AS3 files programmatically to get your variants.
(A CS5 FLA file is really just a zipped up version of the XFL, but there's no advantage to using that instead of an XFL. In CS4 and previous, FLA was a proprietary binary format.)
The missing piece is an XFL compiler. Adobe currently provides no such thing, and the third party market hasn't yet produced one.
You could use a systems automation tool to drive the Flash Professional environment through the compilation steps. On OS X, for example, either Automator or AppleScript should be able to do what you want. It'll just have more overhead than the command line compiler you were hoping for.
I agree with Jason, there are a lot of alternatives to what you suggest. Keeping content out of the SWF is good practice actually. This is a good way to avoid large files!
Depending on what you 're looking to achieve, there are a lot of solutions available. XML is an option, JSON another.
If you're looking to build a template, any of the above would seem appropriate.
It sounds like you're working from the Flash IDE, as Jason suggests you may want to have a look at another IDE, such as FlashDevelop, FDT or FlashBuilder as they make coding with AS3 a lot easier.

Resources