why % sign used in pdf strcuture - security

I have one question regarding the pdf structure, why % sign is used in pdf.
I got some results where they mentioned % sign is used for comments but if we use % sign for comments then what about % sign used in %PDF-1.5 and %%EOF?
%PDF-1.5 which defines the header of the file and
%%EOF which defines the end of the pdf structure.
then why is the % sign used for PDF-1.5 and why is the % sign used 2 times in EOF?
From the results I knew that % sign is used for comments, so why it is different for above two terms?
Your help will be appreciated - Thank you

I actually know nothing about pdf structure or using % correctly, but it seems to have the same reasons as the shebang #! followed by an executable is required in shell scripts, like bash, perl, or even python.
More can be read at this stack overflow answer here for why bash scripts need the #! at the beginning of scripts: https://stackoverflow.com/a/8968514/6037755

why is the % sign used for PDF-1.5 and why is the % sign used 2 times in EOF?
From the results I knew that % sign is used for comments, so why it is different for above two terms?
You can consider those entries actually to be comments (after all they do not contain any PDF objects as such to use for PDF rendering) which you are required to put at certain positions of a PDF file.
According to the specification ISO 32000-1:
7.5.2 File Header
The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7.
and
7.5.5 File Trailer
The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker, %%EOF.
As the use of "shall" here indicates, these are requirements.
And it indeed makes sense that these markers in all other respects are comments.
Only for the purpose of identifying the start and the end of a PDF these markers have a special meaning, before a PDF processor starts working with actual PDF objects. As soon as start and end are identified, these markers have to be ignored. So, making these markers comments is an obvious choice.
This is true for unusual processing types, too. E.g. if for some reason the cross references of a PDF are broken and some program tries to re-create them by searching for indirect PDF objects, it does not need to specially treat these markers, it automatically ignores them as comments.
PS According to Adobe's Implementation Notes in the Annex H of their PDF Reference, their tools also accept an alternative header:
3.4.1, “File Header”
[...]
14.Acrobat viewers also accept a header of the form
%!PS−Adobe−N.n PDF−M.m
If you want to find out why the marker comment contents were chosen exactly like they they are, therefore, you should look into the history of PDF and Postscript

Related

How can I edit a DXF in node.js?

I'd like to make a custom lasered label from a user's input on a website. I have a template dxf file and I'd like to replace placeholder text with the user input. My problem is the dxf file format is very unreadable in its text format. Is there any way to make sense of the numeric data? If not are there any other formats (svg, etc) that would be easier to work with?
EDIT: The reason I've found it unreadable in terms of text is that the program (Solidworks) converted the text to curves.) At this point I'm trying to figure out how to prevent that.
AutoDesk was nice enough to document DXF syntax in great detail. Spend a couple hours understanding the documentation from the link below, and I think you will find it quite easy to parse and edit using code.
To just replace some placeholder text, it should be just as simple as reading the DXF file into a string (a dxf file is no different than a txt file), performing a text replace operation and saving it back to file. Just make sure that your placeholder text is very unique and is not contained in any of the key words in the document below (otherwise your DXF file will get corrupted). Something like "PlaceHolderText" will do the trick.
http://images.autodesk.com/adsk/files/autocad_2012_pdf_dxf-reference_enu.pdf
Edit: More Info
I do a lot of work with AutoDesk Inventor which is in direct competition with SolidWorks, so they are effectively the same tool. We were faced with a similar problem of needing to place text onto sheet metal flat pattern DXFs that came out of Inventor in order to identify the part, but Inventor simply could not do it (see, exactly the same!). One of our developers had the idea to place a very precise geometry punch onto the flat pattern. After the DXF was generated he wrote some code that parsed the DXF file and replaced the geometry with a text entity. More specifically we used a triangle with sides having each length defined to something like the 7th decimal place. You can then use one of the vertices of the triangle to position the text, including rotation. This process would be automatic, so once you write the code with the help of the document above (which won't take the long), it will just work. If your engraver can handle text the way you want it, I'd say this is a very good solution. We generate hundreds of parts every day using this code. Hope this helps.

Checking if PDF is searchable

I wrote a bash script that extracts plain text from scanned PDF files. I've got lots of PDF's but some are scanned and some other are not. So now my main goal is to improve my script by checking if PDF's are already searchable, so no OCR extraction will be needed.
I've tried:
pdftext -nopgbrk pdf_file.pdf wordlist
to store possible OCR'ed text in wordlist, so then I can check if it's empty and figure out whether it's a searchable PDF or not.
I've also tried pdffonts pdf_file.pdf to check if there're fonts in that PDF and therefore if there's text on it or not.
Both ways work pretty fine but are failing in some cases.
For example, some of the PDF's I need to OCR are digitally signed, and those signatures always add a text layer to PDFs. So when I run any of those two commands, it'll output either the signature's text, or the font that it's using. It's like if it had found plain text just because of the signing. It might just be a scanned PDF with a digital signature, but it'll be detected as a plain text PDF.
Digital signings always add text this way (using Helvetica font):
Signed by: Name
Date: Date CEST
Company: Company Name
So with:
pdftext -nopgbrk pdf_file.pdf wordlist | grep -v -E 'Signed|Date|Company'
I can manage to remove those lines so if it's really a scanned PDF, the output will be empty.
It worked for some PDF's until I noticed a signature that had some other format, so I feel this is pretty much of a work-around and not a great solution.
Is there any way to check if a PDF is fully searchable? I just need a way to extract PDF's text but omitting digital signings. Also grep -v will always depend on our digital signature's format and if it changes then it'll screw up my script.
Thanks.
Unfortunately, there really isn't an easy way to do this in a "non-hacky" way without significantly more involved analysis of the file which would be far beyond the scope and scale of a bash script.
When pdftotext outputs the text for the digital signature, that text is not coming from the digital signature itself. That is stored as an object in the PDF with metadata that pdftotext ignores. Instead, what pdftotext picks up is just that: text which has also been added to the file.
Here's an example from Adobe's sample signed PDF document. First, the digital signature's metadata:
And here is the text which is inserted into the document:
Technically, you can have one without the other, and there is no established format for the text that generally accompanies a digital signature. Therefore, you're stuck either:
Ignoring specific text with grep, as you are doing now, which can be unreliable.
Running OCR on all files and then checking if there is a difference in the text before/after OCR, but then this defeats the whole purpose of checking in the first place.

How does PDF securing work?

I'm curious how does PDF securing work? I can lock PDF file so system can't recognize text and manipulate with PDF file. Everything I found was about "how to lock/unlock" however nothing about "how does it work". Is there anyone who could explain it to me? Thx
The OP clarified in a comment
I mean lock on text recognition or manipulation with PDF file. There should be nothing about cryptography imho just some trick.
There are some options, among them:
You can render the text as a bitmap and include that bitmap in the PDF
-> no text information.
Or you can embed the font in question using a non-standard encoding without using standard glyph names
-> text information in an unknown encoding.
E.g. cf. the PDF analysed in this answer.
A special case: make the encoding wrong only for a few characters, maybe just one, probably a digit. This way an unalert person thinks everything was extracted ok, and only when the data is to be used, the errors start screwing things up, something which especially in case of wrong digits is hard to fix. E.g. cf. the PDF analysed in this answer.
Or you can put text in structures where text extraction software or copy&paste routines usually don't look, like creating a large pattern tile containing the text for some text area and filling the area with the matching pattern color.
-> text information present but not seen by most extractors.
E.g. cf. this answer; the technique here is used to make the text of a watermark non-extractable.
Or you can put extra text all over the page but make it invisible, e.g. under images, drawn in rendering mode 3 (invisible), located in some disabled optional content group (layer), ... Text extractors often do not check whether the text they extract actually is visible.
-> text information present but polluted by garbage text bits.
...

restructuredText inline literal that can wrap to the next line (for PDF output)

I am using restructuredText to create a report which includes tom log file outputs.
What I have is a number of sections with numbered lists of literals.
This looks like this:
#. ``some log file output``
#. ``more output``
Now the problem with this is that when I convert to a PDF from this using rst2pdf, the literals can sometimes be quite long and flow off the page.
What I would love is away to mark a section of text as a code literal that can flow onto the next line just like regular text.
I want this because if I don't mark the log file output as being a literal, there is sometimes some crud within the log file output which rst is interpreting as inline markup or other rst related commands.
Any other suggestions as to how this can be best done?
I know that I could ensure that the source rst file only has lines of a certain width but this would make the source file look horrible and make it unwieldy to edit.
I have tried the following 2 things, both of which don't help:
I found a rst2pdf option:
--fit-literal-mode=MODE
What to do when a literal is too wide.
One of error,overflow,shrink,truncate.
Default="shrink"
After some researching, I found mention of a wrapping option for literals.
I got rst2pdf to dump out the default stylesheet using:
rst2pdf --print-stylesheet which I then saved and modified such that the wordWrap option under literal was changed to CJK.

Linux PdfToText function return blank text file

I've used a linux function to convert a list of PDF files to text.
Command:
pdftotext -htmlmeta
This work well for most of my files.
but for a small amount of them, this return me a blank text file.
My unsuccesssfull pdf files were not encrypted, not securised by user / password and they were not read only.
Converting PDFs to text is not a well-defined process. It can work awesome or not at all, depending on the PDF input.
Why is this? Because a PDF's task is mainly to represent the optics of a document, not the textual contents. PDFs can be everything from a pure text with positional information up to a pure graphics of the glyphs of the letters of the text. In the latter case one would need to run an OCR on the input in order to receive text information. This is not done by tools like pdftotext.
Sometimes the text in the PDF is scattered throughout the file, e. g. because first all standard-font letters are mentioned in the PDF, then, later in the file, all the italics-font letters are mentioned (of course with positional information, so a reader of the optical representation won't notice this, even if standard and italics are mixed throughout the text on the page). To rearrange this mess to a fluent text is a major task not very many converters are capable of.
So I guess all you can do is try some more converters for PDF to text (some are better than others, and some are better just for some specific input) or see that you can get the text from another source than the PDF files.

Resources