Automate text extraction using python - python-3.x

I have bunch of documents in excel, pdf, docx, and they all have different shapes/layouts. I want to automate writing these documents in a database.
what I have been doing was to read them in pandas and process them manually. the PROBLEM is even excel files have different shapes and topics, like balance sheets, income statements, with heterogeneous dataframes. pdf can be bank statements, application forms, invoices etc.
What would be the best way to go about this using python?

Since the document types are varying you would want to use different ways of processing each type of document.
Excel Documents: You can read the excel sheets into a pandas data frame and then dump the records into the database by simple database queries. This link should be helpful for this purpose.
PDF Documents: There quite a few utilities to extract text from PDF documents. PyPDF and pdfminer are two libraries which should help you extract text from the PDF documents.
Image Documents: You can use pytesseract library for extracting text from images.
I hope it helps.

Related

How to extract unstructured content from a pdf using python?

I was hoping someone could point me to tool/s that allow content extraction from unstructured pdfs like a slide deck.
Unlike a document where we have the usual/expected structuring and delimiters, I need to extract content from slide pdfs where I could have text boxes, graphs, charts, etc.
Also, If you know of a tool that can translate plot images to time series data please let me know.
Thanks in advance!
I just started working on this and wasn't able to find too much information on the web. I tried tika, PyPDF2, and a few more but they all seem to be linear and more suited for traditional text documents.

How to automate data extraction from many PDFs into Excel without Power Automate?

I have a set of PDF files that contain multiple tables but are all in the exact same format. I have tested data extraction on one file myself and have found the table of interest, and although the data extraction itself is messy and full of NAs, it's good enough to be salvaged with some cleaning.
My question then is how do I automate data extraction from these pdf files into a single table? I have tried some python PDF extraction libraries but the inbuilt excel tool seems to do the best job. Will this require VBA? I want this program to run on work computers and be run by other people.
Thanks.

Is it possible to import Excel file into OBIEE or OAS and use it with other subject areas?

It is known that you could upload an Excel file in Visual Analyzer as a dataset and use that Excel file in Analysis as a separate Subject Area.
However, there was no way (or at least we couldn't find it) to make any connections between this Excel dataset and other subject areas, for example setting connectiong between Excel file's date column with OBIEE's Caledar.Day column, etc.
With new OAS, is there any update on this? Can we somehow make relationships between user-defined datasets and subject areas from rpd? Or is this feature not implemented?
Once you're on OAS you can create data sets which mash up any data source you want. Excel uploaded as a data set can be combined with other uploaded data sets, data sets created by data flows as well as Subject Areas. You have full freedom.
I believe only way to make relationship between data sources is to import them into repository of Analytics.
Maybe if you can import excel as data source into repository, you can manage to relate with other data sources. Here are some links :
https://datacadamia.com/dat/obiee/obis/obiee_excel_importation
https://www.ascentt.com/importing-excel-file-into-obiee-11g/
I hope these helps.
Hakan

RTF doc to Excel

I have several word docs (RTF) that have raw data in them that I need to get into Excel. Other than doing a ‘save as’ to text and importing each and every one of them, is there an easier way to import all the data into Excel?
These RTF docs are system generated and contain headers. There are no tables in the docs, but it appears to be set up in column and row format. I have seen some code examples, but I am unable to manipulate the code to get it to work for me.

Reading custom data format into Excel VBA

I have a file of extension *.qub that I would like to read into Excel. The file contains spacecraft instrument data. The file has two parts, a SFDU-defined text header followed by binary data.
Ideally, we would like the user to be able to access the files using Excel's built-in File-->Open and File-->Save functionality to import/export this *.qub format.
What is the best way to create a custom file reader/writer in Excel?
Thank you!

Resources