Identifying non grid tables in a word docx in python - python-3.x

This is the image of the docx fileI have some tables in the word file sheet from which i would like extract data from
I have tried extracting the text from the file and using regex . I tried to split the the text into tables
import docx
document = docx.Document('2.docx')
docText = b''.join([paragraph.text.encode('utf-8') for paragraph in document.paragraphs])
print (docText)
I don't really understand how to go about it using regex or is there any other method to go about it,I do want to specify the area and extarct the contents

Related

Python reading pdf files

How can i use jupyter lab to read and extract tables from pdf files
A typical pdf file with text tilles subtitles and tables in between. I need the coding to extract the table under a specific title, and cleaning some unwanted text like page numbers
What are some of the coding to do that ?
Tabula-py: you can parse a PDF and convert it into a CSV, TSV, JSON, or a pandas DataFrame.

How does ms word vba detect the end of a paragraph

I converted an html text into a docx document using several different online converters.
Then I analysed the number of paragraphs using an Excel vba macro which opens the document and examines it. Supplied with an original docx document (ie one not converted from another format) this macro always gives the correct number of paragraphs.
Only one converter yielded a docx from which the number of paragraphs could be determined. All the others simply said there was a single paragraph with hundreds of words in it.
Somehow the html to docx converters are missing something. What is missing ? Can I dob it in ?
Tools / Options / View.
Examine the characters that Word uses to delimit paragraphs in the docx and the translated html.
I suspect that "paragraphs" in the translated html might be manual line breaks. If so, that would account for the fact that the paragraph count in the translated html is incorrect.

How to parse a pdf file and extract tables with their titles using python-camelot?

I am trying to parse some pdf files in order to extract some key information.There is number of tables in each pdf that contains a part of these information. So I tried to use camelot to extract tables and I got good results but I want to extract the title of each table because I want to do a mapping for each table with its title.
Can anyone tell me how to extract the title of table from pdf using python?

How to read individual slide from ppt using tika package in python?

I want to compare data in two pptx file and show the differences if any using python.
I have tried with below code, but it is giving all content in single file. No way to segregate data based on slides.
I am able to read all content of pptx using tika but I need slide wise content to compare with other pptx file.
from tika import parser
parsed = parser.from_file('act.pptx')
act =parsed['content']
act=act.strip().replace('\n',' ')
Expected result is store each slide one text file.
Actual result is I am getting all slides data into one text file.

Is there a way to extract the highlighted text from a table in docx document python?

Image of the table i need to extract the highlighted text from.
I need to write a python script which helps me convert this docx table to csv and just writing the highlighted information from a row in a csv.
Like the column name would be "overall verdict" so its value underneath it must be "4".
Please if anyone can help, It would be appericiated.

Resources