Extracting badly formatted text from a PDF

Extracting badly formatted text from a PDF - python-3.x

I'm trying to extract some entries from a PDF, but the bad formatting is making it inconvenient to simply parse through like a normal document. There isn't any consistent positioning for the text, so each entry is a unique scramble with no consistent pattern I can find. I only want the entry name and the info on the right, not the field name or description.
I've tried experimenting with headers and layout info using the PyPDF2 Module but there doesn't seem to be any metadata for the PDF besides basic author info.
My idea was using the Google Cloud Vision API to transcribe the text, but that brings up issues of auto-positioning.
Does anyone know of a better methodology for this, or if not, simply how to execute the positioning for the Cloud Vision API?

Related

Azure Search - Highlights - Locating in image

Just looking for guidance or even a general outline on approach here.
I am using azure search to OCR a batch of pdfs. I have turned on hit highlighting and I am successfully getting results back there that I am looping through / displaying in my view for the end user. I was looking on expanding that functionality to show the pdf images with the highlighting on the images themselves like in the JFK azure example. I am not proficient in react and seem to be getting lost there.
I am assuming I need to save off the OCR images to a data store for reference using the normalized_images that are created? I do have pdfs locally I can load but assume the OCR images maybe different. Have turned on GeneratedNormalizedImagesPerPage and turned on cache which creates files in my storage account.
Then I assume I need to pull the associated image, display it, use the highlight results and pull a corresponding bounding box where the phrase was detected? Problem with that approach is that I do not see any association between the highlight hit and the location (bounding box) of the hit nor the associated image file the hit was on.
Probably way off on approach here but any guidance is appreciated.
Edit 1
I did noticed the items on this page in the JFK example: https://github.com/microsoft/AzureSearch_JFK_Files/tree/master/JfkWebApiSkills/JfkWebApiSkills
Would trying to replicate the ImageStore (so those are stored in my storage account) and then the HocrGenerator (appears to handle points in a doc) into my skillset for my index be the approach?

There are a few steps here:
you need to save the layoutText from the OCR skill somewhere the UI can access it. The JFK Files demo converts it to a HOCR (to display in the UI) and saves it in index as a field in the index so that it is retrieved in the search results. HOCR isn't necessary and you may find it more efficient to store the layout in blobs using a knowlege store object projection.
save the extracted images into blob storage using a file projection into the knowledge store. Keep in mind that the images may be resized in the process and the coordinates will match the resized image saved to the store. If you want to map the coordinates to the original image see this.
At search time, map the highlight to the the metadata. You will find this code in the nodejs frontend, however it may be simpler to follow in the original demo by following the code here. Essentially you just find the first occurrence of the highlighted word in the metadata, display the associated image, and calculate the bounding region of the word.

A Study on the Modification of PDF in nodejs

Project Environment
The environment we are currently developing is using Windows 10. nodejs 10.16.0, express web framework. The actual environment being deployed is the Linux Ubuntu server and the rest is the same.
What technology do you want to implement?
The technology that I want to implement is the information that I entered when I joined the membership. For example, I want to automatically put it in the input text box using my name, age, address, phone number, etc. so that the user only needs to fill in the remaining information in the PDF. (PDF is on some of the webpages.)
If all the information is entered, the PDF is saved and the document is sent to another vendor, which is the end.
Current Problems
We looked at about four days for PDFs, and we tried to create PDFs when we implemented the outline, structure, and code, just like it was on this site at https://web.archive.org/web/20141010035745/http://gnupdf.org/Introduction_to_PDF
However, most PDFs seem to be compressed into flatDecode rather than this simple. So I also looked at Data extraction from /Filter /FlateDecode PDF stream in PHP and tried to decompress it using QPDF.
Unzip it for now.Well, I thought it would be easy to find out the difference compared to the PDF without Kim after putting it in the first name.
However, there is too much difference even though only three characters are added... And the PDF structure itself is more difficult and complex to proceed with.
Note : https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf (PDF official document in English)
Is there a way to solve the problem now?

It sounds like you want to create a PDF from scratch and possibly extract data from it and you are finding this a more difficult prospect than you first imagined.
Check out my answer here on why PDF creation and reading is non-trivial and why you should reach for a tool you help you do this:
https://stackoverflow.com/a/53357682/1669243

Extract Keywords from Office Documents with Sharepoint Flow

I am trying to implement a document management system using Sharepoint. One major issue is that colleagues cannot find documents in the current setup (local fileserver). They have asked that we have a system that scans uploaded documents and automatically looks for keywords in them and then populates a "Meta" column.
I have had sort of success with OCR on image files, but getting keywords out of office documents (doc, xls etc.) I have had no success until now.
Is there a way to setup a flow to do this task for me?
any help is much aprechiated.
i tried "Get file metadata" and Azure "Text analysis", but it seems to take the raw data of the files (XML I assume) and returns that the document is to large to analyse.

There is something vague about this requirement - how is a keyword defined in a document?
Therefore, first obvious solution would be to assign keywords for each file upon uploading it. You may create a process for this with flow - have tasks, reminders and so on.
Automating this with OCR first means that you need to user OCR that works with MS flow you have only one choice - ElasticOCR. Then, in your flow
- feed the document content to the ElasticOCR action
- keep in mind that OCR is not 100% accurate
- analyze the generated text content according to your keyword definition
- finally write the meta back to the library in the corresponding columns.
Having worked on a similar requirement, we asked uploaders to publish their documents with a short abstract(column from the content type). The assumption is the abstract contains the keywords and is stored in a multi-line column - making it searchable site wide.

Is it possible to add an image to a PDF without rendering the PDF?

I'm looking at adding an image to an existing PDF in Node.js. None of the PDF libraries I found appear to have the ability to modify an existing PDF though, so I'm planning on implementing it myself. I'm trying to figure out if it's too much work, as I can always do it server side using iTextPDF instead, but I'd prefer to do it in my app (Electron which uses Node.js).
If I just want to modify an existing PDF and add an image, will I have to write a complete rendering library or is PDF structured in such a way that I can write a very small parser that just gets the page I want and inserts an image using the correct format?
Specifically, I'm asking because I've previously looked into writing a text extraction library, put in order to get the position of text you have to render pretty much the entire PDF because of how positioning is handled. That's too much work to get around server side processing in this case.
To be clear, just asking if it's possible to do, not how to do it (don't want to be too broad, I'm sure I can figure that part out).

To perform a small manipulation of a PDF, you'll need to implement generalized reading, decompression, encryption and traversal of PDF data structures. Some of the thing you would need to handle include:
basic parsing of PDF syntax
indexing via the cross reference index, and/or cross reference index and object streams
objects (num, byte-string, hex string, dictionary, arrays, booleans...)
filters and variants (LZW, Flate, RunLength, Predictors)
encryption (RC4, AES, Custom security handlers)
page tree traversal
basic handling of page content streams
image handling
serialization, either rewriting of the entire PDF, or incremental updates to an existing PDF
Anything's possible, but realistically, you will need a PDF library or toolkit, client or server-side, to accomplish this.

How to generate application forms/documents programmatically?

At the moment, we use MS WORD and MS EXCEL to mail merge documents that needs to be sent to multiple recepients.
For example, say there is a complaint form where the complainant needs to fill in his/her name, address, etc. So we have a .doc file set up with the content and the dynamic entities set up for mail merging, with the name and address details put in an excel file, from where we can happily mail merge to generate all or just the necessary forms/documents.
However, I would like to automate this process, like a form in a website where the complainant can fill in his/her name, address and other details, and we could use that to generate the complaint form automatically and offer it to be downloaded (preferrably as a pdf).
Now, the only solution that comes to mind, is Latex, so that I can just replace the needed entities and just compile to PDF. However, that bit has to be negotiated with the webhost, if they are offering Latex or not.
Is there any other solution? Any other way we could get this done, with something that shouldn't be a problem for most webhosting solutions to offer?
EDIT: I would prefer a non .NET or rather non microsoft solution since, the servers are running linux and while mono might be capable of getting the job done, none of our devs know any .NET languages. However, if required we might have to dwelve into it.

Generating PDF using an XSL. Check the following: Apoc XSL-FO
You will need to create an XML file with the required fields and transform that with this tool.

If you wish to avoid .NET then XSL-FO is worth a look. Try the FOray project.
XSLT can be a steep learn if you do not have experience already. Also users will not be able to change the templates without asking the XSLT guru to do it.
If your templates are already in MS Word and MS Excel then I would stick with generating MS docs on the server. These are now easy to work with from code since OpenXML - check out OfficeOpenXML and OpenXMLDeveloper

Apache FOP : http://xmlgraphics.apache.org/fop/

I suggest generating rtf on the server: it's easy enough to automatically generate using cpan's RTF::Writer, has converters generating good pdf, can be edited by hand in word, oo-writer & TextEdit, doesn't have any really bad compatibility issues between the main editing applications, and has decent text & resource extraction tools, with text extraction being rather better than pdf.
There's some support for moving between rtf & latex, although the best rtf -> latex converter, docx2tex, depends on the System.IO.Packaging .net module, whose mono implementation isn't yet rock solid.
Postscript — Not a recommendation: it's too much of an unwieldy sledgehammer for this job, but iText will generate the pdf directly from the form data. If you wanted to do fancy things like signed pdf, that would be the way to go.
Postscript #2 — If you break up the Word document into individual files using word's master document representation, then you can clobber one of the parts with hand-generated content. This makes it easy to do something approximating form-filling on word .doc files using just standard file-utils and some trivial rtf->doc tweaking.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string