Connect textboxes in PDF automatically - linux

I've a document in Fraktur font and performed an OCR with tesseract (language is deu-frak). It took me about 10 days (24h a day) to convert these 23 issues (with each about 400 pages).
The result is a searchable PDF with the original image embedded and the invisible text on top:
Now, I've removed the image with Master PDF Editor and turned the text type from "invisible" to "Full text". Now it turned out, that some words weren't recognized by tesseract as such, so each letter is positioned separately:
Notice, that "kommen" was recognized as word but "fruchtbaren" only as a sequence of characters. This makes it impossible to find "fruchtbaren" with the textsearch and when changing the font-size the letters overlap or create ugly gaps.
I'm using Linux and look for a command-line tool which allows to script all 23 PDF documents.
Is it possible to connect textboxes with a minimum distance or even connecting one line would be great?
Thanks.

Probably not what you want to hear, but I'd go back and experiment with pre-processing, Tesseract parameters, etc on a small representative sample until you get the initial OCR as good as possible (including word segmentation) and then re-run the OCR with your new settings. If you still find that you need some type of post-processing, I'd, again, build and refine the entire pipeline on a small sample before running your full dataset.
On the surface, it looks like something Tesseract could do a better job at, provided you're giving it clean images with enough scan resolution.

Related

How to identify if text encoding issue is my processing error or carried from the source pdf

I have a selection of pdfs that I want to text mine. I use tika to parse the text out of each pdf and save to a .txt with utf-8 encoding (I'm using windows)
Most of the pdfs were OCR'd before I got them but when I view the extracted text I have "pnÁnn¿¡c" instead of "Phádraig" if I view the PDF.
Is it possible for me to verify the text layer of the PDF (forgive me if thats the incorrect term) Ideally without needing the full version of Acrobat
It sounds like you are dealing with scanned books with "hidden OCR", ie. the PDF shows an image of the original document, behind which there is a layer of OCRed text.
That allows you to use the search function and to copy-paste text out of the document.
When you highlight the text, the hidden characters become visible (though this behaviour maybe depends on the viewer you use).
To be sure, you can copy-paste the highlighted text to a text editor.
This will allow you to tell if you are actually dealing with OCR quality this terrible, or if your extraction process caused mojibake.
Since OCR quality heavily depends on language resources (dictionaries, language model), I wouldn't be surprised if the output was actually that bad for a low-resource language like Gaelic (Old Irish?).

How can I edit a DXF in node.js?

I'd like to make a custom lasered label from a user's input on a website. I have a template dxf file and I'd like to replace placeholder text with the user input. My problem is the dxf file format is very unreadable in its text format. Is there any way to make sense of the numeric data? If not are there any other formats (svg, etc) that would be easier to work with?
EDIT: The reason I've found it unreadable in terms of text is that the program (Solidworks) converted the text to curves.) At this point I'm trying to figure out how to prevent that.
AutoDesk was nice enough to document DXF syntax in great detail. Spend a couple hours understanding the documentation from the link below, and I think you will find it quite easy to parse and edit using code.
To just replace some placeholder text, it should be just as simple as reading the DXF file into a string (a dxf file is no different than a txt file), performing a text replace operation and saving it back to file. Just make sure that your placeholder text is very unique and is not contained in any of the key words in the document below (otherwise your DXF file will get corrupted). Something like "PlaceHolderText" will do the trick.
http://images.autodesk.com/adsk/files/autocad_2012_pdf_dxf-reference_enu.pdf
Edit: More Info
I do a lot of work with AutoDesk Inventor which is in direct competition with SolidWorks, so they are effectively the same tool. We were faced with a similar problem of needing to place text onto sheet metal flat pattern DXFs that came out of Inventor in order to identify the part, but Inventor simply could not do it (see, exactly the same!). One of our developers had the idea to place a very precise geometry punch onto the flat pattern. After the DXF was generated he wrote some code that parsed the DXF file and replaced the geometry with a text entity. More specifically we used a triangle with sides having each length defined to something like the 7th decimal place. You can then use one of the vertices of the triangle to position the text, including rotation. This process would be automatic, so once you write the code with the help of the document above (which won't take the long), it will just work. If your engraver can handle text the way you want it, I'd say this is a very good solution. We generate hundreds of parts every day using this code. Hope this helps.

Hidden/Open words in an Image file such as PNG or JGP

As far as I can tell my question is not related to topics involved in Stenography or in the win.rar soluations I've seen to this where you are essentially hidding messages.
I am trying to figure out if there is a way to insert code into a file such as a jpg or png with a simple message, that could later be extracted by a program reading the file without having it encoded into the file either by slight differences in pixels or what have you in stenography.
I basically just want a tag along message that is a part of the file itself that is not brought up by the image reader but could perhaps be seen by a text reader of some kind.
I'm not sure how possible this is because I, for the most part don't understand the order/layout of the png/jgp/ect file aside from the RGB pixel code. How does it start, how does the image display tool know to stop displaying ect.
The way I'm envisioning it would be something like:
pngStartCode -> RGBinfo --> png end code so image reader knows to stop -> start sequence that some kind of reader will recognize (possibly a new text reader) -> written text wanted to be communicated -> endcodeforreader
I may just be rambling about something ridiculous here but please let me know if this is at least possible.
You can use following command(Windows command prompt)
Create a text file with your message, say "message.txt"
Now choose target file(it can be any file like a.jpg,a.png,a.exe,..etc), say "image.jpg"
Now execute follwing command
copy /b "image.jpg"+"message.txt" "NewImage.jpg"
Above command will combine files(in binary mode) and creats a new file(in this case NewImage.jpg). Now if anyone opens image they will just see noraml image. If you want to look at text, you have open it with any text editor(Notepad) and scroll down to last, there you can find text.
Here it wont chage any pixels or any thing to image, it just appends text to image.
It sounds like OP is asking about comment tags in the PNG specifications (i.e. adding data but without intent to hide it).
PNG files are broken into "Chunks". The image part is usually divided into several IDAT chunks; the color, size, etc are stored in an IHDR chunk, etc.
The iTXt, tEXt, and zTXt chunks are used for conveying text information associated with the image, so typically you'd look into using a tool to add those types of chunks. tEXt is for just plain text, zTXt is compressed.
More info on the PNG specification including what kinds of chunks are available can be found here, and you find chunk viewers on google.
For convenience at preset time (January 2021) here are a couple tools that will let you view, edit, and add chunks:
Windows 10: http://entropymine.com/jason/tweakpng/
Linux: https://www.systutorials.com/docs/linux/man/n-png/
Mac: https://apps.apple.com/us/app/inspectpng/id498851708?mt=12
NOTE: I do not vouch for the safety of any of the above links. Please use standard caution when downloading any file from the internet. If you don't have your own anti-virus, Virustotal has one online you can upload individual files to for free.

Creating a Print Monitor / Print Handler

I'm having trouble getting started with building a Print Monitor / Print Handler for Windows using Visual Studio 2012 Ultimate with WDK 8. Basically, this is what I am trying to accomplish:
Create a print monitor (something an application can print to) that will generate a file with the content that should be printed (like the default XPS printer or a PDF printer), and then invokes the print handler
Create a print handler that will parse the generated file and do certain actions with it (check to see if certain text is present, upload the file online, etc)
I feel like the print handler part should not be too hard, but starting with the print monitor is what I'm stuck at. What would I do within VS12? I see options for "Printer Driver V4", "Printer Driver V4 Property Bag", and "Printer XPS Render Filter". Should I use one of those templates, and, if so, what would I do within them? Anything pointing me in the right direction would be appreciated!
EDIT:
Just some more clarification - I only need the text from the print output, but I've read from various sources that getting text-only output leads to no output at all from sources like Firefox, etc since they print text as glyphs.
I will be using the print handler to parse the text for keywords and then upload that information to a web server in a specific format. The print monitor just needs to capture and save the text information from whatever application is printing.
As you pointed out in your comments, some applications such as Firefox print using glyph indices instead of characters. In fact, quite a few do and it's becoming more common. What you need is a print driver. The good news is Microsoft has already written it for you and provided you with sample source code in the WDK. Start by reviewing this to understand your options. The Unidriver is perhaps a little simpler but the Postscript driver has the advantage of generating output that can readily be transformed to PDF or other formats that retain text information (as opposed to raster page images that lose all text information). As far as I'm concerned, don't even think about XPS; it's just an all around disaster.
To handle glyph indices, what you'll need to do is add code to the driver's OEMTextOut function that uses the font's cmap tables to translate glyph indices back into character codes. I'm unaware of any public domain libraries that parse font files, so you'll likely have to write your own code to do this. (Hint: If you support only OpenType/TrueType fonts, you'll cover 99% of all printing applications).
Getting the Microsoft sample code to build, install and run is mostly straightforward, but if you're new to the WDK and installing print drivers, plan on spending a week or more on just that. The glyph index translation part is far more complex and you should plan on spending a lot more time on that.

add a duplicate (hidden) text layer to a pdf for extra searching

My problem:
I have a pdf with lots of roman characters with complex diacritical marks (e.g., ṣ, ś, ṝ, ǎ, etc.). To make it easier to search within the pdf, I would like to add an additional layer, much as one does with hocr, where the same text is present without the diacritics.
When using full-text search engines I can index multiple terms at the same position (vector) - I would like to achieve the same effect here.
I have read lots about adding a hocr layer to scanned images, but I really just want to duplicate the text layer, pass it through a script that strips the diacritics (straightforward enough) and then adds it back in as a hidden but searchable layer.
Anyone have any suggestions? (Solutions involving any platform, language, library or toolchain will be useful!)
Thanks :)
Edit: please let me know if the question is unclear.
Well I have a (slightly ugly and hackish) solution, so I thought I'd share it.
I'm using PDFMiner to extract the text, along with the co-ordinates. Then I'm using ReportLab to write the normalized versions of the text to a new pdf, in exactly the same position, as hidden text. To make the positions line up properly, I found I had to use exactly the same font, so I've used a combination of FontForge and MuPDF to extract the required font(s) from the original pdf.
Finally, having created the new pdf, I'm using pdftk to merge it with the original.
It works pretty well, but has the downside that copying text out of the pdf results in the normalized text being copied too. But this is acceptable for my present purposes, and I can't see any way around it. The pdf spec. doesn't really support my objective, and so I don't imagine I can do better than this hackish solution.
I have written something similar to add searchable text by OCR'ing images and converting it to PDF in C#. I used QuickPDF from www.quickpdf.com to create hidden white text objects on top of the image and this worked reasonably well.
In your case QuickPDF would allow you to extract the text strings along with bounding boxes and font details. You could then normalize your text and create the invisible text objects using the existing font and position information and then save it out to a new file.
This would basically give you the same PDF as you have now and also give you both the original and normalised text as you are getting now.
QuickPDF is a commercial library. If your solution works well for you then there is no used buying a commercial engine though. The nice thing though is that it only requires 1 SDK and you would look at it if you had a more than a few PDF's to convert.

Resources