How to extract unstructured content from a pdf using python? - nlp

I was hoping someone could point me to tool/s that allow content extraction from unstructured pdfs like a slide deck.
Unlike a document where we have the usual/expected structuring and delimiters, I need to extract content from slide pdfs where I could have text boxes, graphs, charts, etc.
Also, If you know of a tool that can translate plot images to time series data please let me know.
Thanks in advance!
I just started working on this and wasn't able to find too much information on the web. I tried tika, PyPDF2, and a few more but they all seem to be linear and more suited for traditional text documents.

Related

Is there a way to extract text with comments from PDF?

There are lots of ways to extract text from pdfs, and to extract comments from pdfs. But are there any ways to extract the text+comments together from pdf files? So that the comment associated with each segment of text is clear.
So far, I have been able to do this using google docs: Export Google Docs comments into Google Sheets, along with highlighted text?
but not using pdfs. Converting the pdf to a docx messes up the formatting very badly, so it doesnt seem to be a viable option.

Extracting specific data from a pdf into excel

community
I need to extract select data from a pdf form into excel. Eventually, the data gathered will be used in another step (excel table) as part of an additional calculation.
I am hoping to find a way to automate this process so I tried importing the pdf file to excel using Power Query. Unfortunately, each time I loaded the pdf, I get a message (Page is blank).
After doing some initial search, I found out that this may be due to the fact the way the pdf file was built originally (not as a table converted to a pdf).
I went back and converted the pdf file into a spreadsheet and now I can actually see the data that I need to extract in excel but needs a lot of cell formatting and rearranging.
I would really like to know if there is an alternative to solving this problem. More importantly, I'd very much appreciate any bright ideas or recommendations on how to best tackle this task since I have to repeat the same process 30+ times.
Also, I don't have a lot of coding experience, knowledge- very minimal.
Thank you so much

How to retrieve shapes (Triangle , Square , Lines , textbox) from word document to python using docx or docx2python library?

I am working with word documents with the python docx library. I am able to extract all the paragraphs, tables, images from the word document. But Along with this, I want to detect non-image shapes but I am not able to figure out how to extract them. Can somebody help me out?
I am already going through multiple StackOverflow questions but have not been able to find a good way to do this.

Automate text extraction using python

I have bunch of documents in excel, pdf, docx, and they all have different shapes/layouts. I want to automate writing these documents in a database.
what I have been doing was to read them in pandas and process them manually. the PROBLEM is even excel files have different shapes and topics, like balance sheets, income statements, with heterogeneous dataframes. pdf can be bank statements, application forms, invoices etc.
What would be the best way to go about this using python?
Since the document types are varying you would want to use different ways of processing each type of document.
Excel Documents: You can read the excel sheets into a pandas data frame and then dump the records into the database by simple database queries. This link should be helpful for this purpose.
PDF Documents: There quite a few utilities to extract text from PDF documents. PyPDF and pdfminer are two libraries which should help you extract text from the PDF documents.
Image Documents: You can use pytesseract library for extracting text from images.
I hope it helps.

Program or method to standardize figure format and create multiple, similar figures

I couldn't find documentation to help me with this problem - [EDIT: ...and the only recommendations I've gotten from colleagues is what I have mentioned below].
I'm developing 23 figures that must include a PNG image of a state, and two charts with accompanying state data. Each chart has a slightly different scale, though the units will be the same (I say this because it makes formatting a headache). The only way I know how to make these figures is to create a slide in Powerpoint, inserting the charts and PNG and formatting each slide "by hand" to match all the others. This is time consuming and aggravating, and I'm optimistic there's a better way.
Question: What program/methods do you recommend for creating a standard figure format?
Ideally I would like to copy-and-paste charts from Excel, and insert a PNG image.
You could look into R to do this. Not sure what the PNG image is, but you can use R also to plot data on maps (using heatmaps or otherwise) if you would want to. You can import an image so you can also work with your PNG image if needed.
As for the charts you then could create them using R instead of excel and put everything together.
This way you can write a piece of code once and generate all the figures you need.

Resources