Python to search for a specific table in word document - python-3.x

I am new to python.. and have done a small hands-on on the python-docx module.
I am having a requirement in which I have to read a word document which contains multiple tables and text.
Out of this document I have to select specific table to read and that selection depends on the text written in the line just above the table, and then I have to process the data of that table.
I am able to read the table data by referring the table with its index, but in this case the table index is unknown and it can be at any position in the document. The only thing by which I can identify the table is the text written in the line just above the table.
Can you please help me achieving this?

I have a solution made using BeautifulSoup and not python-docx. What I have done here is traversed through OOXML of word(.docx) document.
from bs4 import BeautifulSoup
import zipfile
wordoc = input('Enter your file name here or name with path: ')
text1 = 'Enter your text written above the table'
text1 = ''.join(text1.split())
document = zipfile.ZipFile(wordoc)
xml_content = document.read('word/document.xml')
document.close()
soup = BeautifulSoup(xml_content, 'xml')
for document in soup.children:
for body in document.children:
for tag in body.children:
if tag.name == 'p' and (''.join(tag.text.split())) == text1:
table = tag.find_next_sibling('w:tbl')
table_contents = []
for wtc in table.findChildren('w:tc'):
cell_text = ''
for wr in wtc.findChildren('w:r'):
# We want to exclude striked-out text
if not wr.findChildren('w:strike'):
cell_text += wr.text
table_contents.append(cell_text)
print(table_contents)

Related

extracting data from multiple pdfs and putting that data into an excel table

I am taking data extracted from multiple pdfs that were merged into one pdf.
The data is based on clinical measurements taken from a sample at different time points. Some time points have certain measurement values while others are missing.
So far, I've been able to merge the pdfs, extract the text and specific data from the text, but I want to put it all into a corresponding excel table.
Below is my current code:
import PyPDF2
from PyPDF2 import PdfFileMerger
from glob import glob
#merge all pdf files in current directory
def pdf_merge():
merger = PdfFileMerger()
allpdfs = [a for a in glob("*.pdf")]
[merger.append(pdf) for pdf in allpdfs]
with open("Merged_pdfs1.pdf", "wb") as new_file:
merger.write(new_file)
if __name__ == "__main__":
pdf_merge()
#scan pdf
text =""
with open ("Merged_pdfs1.pdf", "rb") as pdf_file, open("sample.txt", "w") as text_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
for page_number in range(0, number_of_pages):
page = read_pdf.getPage(page_number)
text += page.extractText()
text_file.write(text)
#turn text script into list, separated by newlines
def Convert(text):
li = list(text.split("\n"))
return li
li = Convert(text)
filelines = []
for line in li:
filelines.append(line)
print(filelines)
#extract data from text and put into dictionary
full_data = []
test_data = {"Sample":[], "Timepoint":[],"Phosphat (mmol/l)":[], "Bilirubin, total (µmol/l)":[],
"Bilirubin, direkt (µmol/l)":[], "Protein (g/l)":[], "Albumin (g/l)":[],
"AST (U/l)":[], "ALT (U/l)":[], "ALP (U/l)":[], "GGT (U/l)":[], "IL-6 (ng/l)":[]}
for line2 in filelines:
# For each data item, extract it from the line and strip whitespace
if line2.startswith("Phosphat"):
test_data["Phosphat (mmol/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("Bilirubin,total"):
test_data["Bilirubin, total (µmol/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("Bilirubin,direkt"):
test_data["Bilirubin, direkt (µmol/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Protein "):
test_data["Protein (g/l)"].append( line2.split(" ")[-2].strip())
if line2.startswith("Albumin"):
test_data["Albumin (g/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("AST"):
test_data["AST (U/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("ALT"):
test_data["ALT (U/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Alk."):
test_data["ALP (U/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("GGT"):
test_data["GGT (U/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Interleukin-6"):
test_data["IL-6 (ng/l)"].append(line2.split(" ")[-4].strip())
for sampnum in range(100):
num = str(sampnum)
sampletype = "T" and "H"
if line2.startswith(sampletype+num):
sample = sampletype+num
test_data["Sample"]=sample
for time in range(0,360):
timepoint = str(time) + "h"
word_list = list(line2.split(" "))
for word in word_list:
if word == timepoint:
test_data["Timepoint"].append(word)
full_data.append(test_data)
import pandas as pd
df = pd.DataFrame(full_data)
df.to_excel("IKC4.xlsx", sheet_name="IKC", index=False)
print(df)
The issue is I'm wondering how to move the individual items in the list to their own cells in excel, with the proper timepoint, since they dont necessarily correspond to the right timepoint. For example, timepoint 1 and 3 can have protein measurements, whereas timepoint 2 is missing this info, but timepoint 3 measurements are found at position 2 in the list and will likely be in the wrong row for an excel table.
I figured maybe I need to make an alternative dictionary for the timepoints, and attach the corresponding measurements to the proper timepoint. I'm starting to get confused though on how to do all this and am now asking for help!
Thanks in advance :)
I tried doing an "else" argument after every if argument to add a "-" if there if a measurement wasnt present for that timepoint, but I got far too many dashes since it iterates through the lines of the entire pdf.

Save text in JSON format from Python Selenium

I am trying to scrape data from a webpage and save the scraped text in JSON format.
I have reached until the step where i can gather text which i want but then i cant save it in expected format. Csv or txt format is also sufficient if possible
Please help me how to save scraped text in JSON. Here is my code which i have extracted
for k in range(0, len(op3)):
selectweek.select_by_index(k)
table = driver.find_element_by_xpath("//table[#class='list-table']")
for row in table.find_elements_by_xpath('//*[#id="dvFixtureInner"]/table/tbody/tr[2]/td[6]/a'):
row.click()
mainpage = driver.window_handles[0]
print(mainpage)
popup = driver.window_handles[1]
driver.switch_to.window(popup)
time.sleep(3)
#Meta details of match
team1 = driver.find_element_by_xpath('//*[#id="match-details"]/div/div[1]/div/div[2]/div[1]/div[1]/a') #Data to save
team2 = driver.find_element_by_xpath('//*[#id="match-details"]/div/div[1]/div/div[2]/div[3]/div[1]/a') #Data to save
ht = driver.find_element_by_xpath('//*[#id="dvHTScoreText"]') #Data to save
ft = driver.find_element_by_xpath('//*[#id="dvScoreText"]') #Data to save
Create dictionary and convert it into JSON format using json module.
import json
dictionary = {"team1" : team1, "team2": team2, "ht": ht, "ft": ft}
json_dump = json.dumps(dictionary)
with open("YourFilePath", "w") as f:
f.write(json_dump)
You can create a dictionary and add key-value to it. I don't know the structure of the json but this can give an idea:
json_data = dict()
ht = 1
ft = 2
json_data["team1"] = {"ht": ht, "ft": ft}
print(json_data)
>>> {'team1': {'ht': 1, 'ft': 2}}

Python3:How to get title eng from url?

i ues this code
import urllib.request
fp = urllib.request.urlopen("https://english-thai-dictionary.com/dictionary/?sa=all")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
x = 'alt'
for item in mystr.split():
if (x) in item:
print(item.strip())
I get Thai word from this code but I didn't know how to get Eng word.Thanks
If you want to get words from table you should use parsing library like BeautifulSoup4. Here is an example how you can parse this (I'm using requests to fetch and beautifulsoup here to parse data):
First using dev tools in your browser identify table with content you want to parse. Table with translations has servicesT class attribute which occurs only once in whole document:
import requests
from bs4 import BeautifulSoup
url = 'https://english-thai-dictionary.com/dictionary/?sa=all;ftlang=then'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Get table with translations
table = soup.find('table', {'class':'servicesT'})
After that you need to get all rows that contain translations for Thai words. If you look up page's source file you will notice that first few <tr rows are headers that contain only headers so we will omit them. After that we wil get all <td> elements from row (in that table there are always 3 <td> elements) and fetch words from them (in this table words are actually nested in and ).
table_rows = table.findAll('tr')
# We will skip first 3 rows beacause those are not
# contain information we need
for tr in table_rows[3:]:
# Finding all <td> elements
row_columns = tr.findAll('td')
if len(row_columns) >= 2:
# Get tag with Thai word
thai_word_tag = row_columns[0].select_one('span > a')
# Get tag with English word
english_word_tag = row_columns[1].find('span')
if thai_word_tag:
thai_word = thai_word_tag.text
if english_word_tag:
english_word = english_word_tag.text
# Printing our fetched words
print((thai_word, english_word))
Of course, this is very basic example of what I managed to parse from page and you should decide for yourself what you want to scrape. I've also noticed that data inside table does not have translations all the time so you should keep that in mind when scraping data. You also can use Requests-HTML library to parse data (it supports pagination which is present in table on page you want to scrape).

Why is the .get('href') returning "None" on a bs4.element.tag?

I'm pulling together a dataset to do analysis on. The goal is to parse a table on a SEC webpage and pull out the link in a row that has the text "SC 13D" in it. This needs to be repeatable so I can automate it across a large list of links I have in a database. I know this code is not the most Pythonic, but I hacked it together to get what I need out of the table, except for the link in the table row. How can I extract the href value from the table row?
I tried doing a .findAll on 'tr' instead of 'td' in the table (Line 15) but couldn't figure out how to search on "SC 13D" and pop the element from the list of table rows if I performed the .findAll('td'). I also tried to just get the anchor tag with the link in it using the .get('a) instead of .get('href') (included in the code, line 32) but it also returns "None".
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'https://www.sec.gov/Archives/edgar/data/1050122/000101143807000336/0001011438-07-000336-index.htm'
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table',{'summary':'Document Format Files'})
rows = table.findAll("td")
i = 0
pos = 0
for row in rows:
if "SC 13D" in row:
pos = i
break
else: i = i + 1
linkpos = pos - 1
linkelement = rows[linkpos]
print(linkelement.get('a'))
print(linkelement.get('href'))
The expected results is printing out the link in linkelement. The actual result is "None".
It is because your a tag is inside your td tag
You just have to do:
linkelement = rows[linkpos]
a_element = linkelement.find('a')
print(a_element.get('href'))
Switch your .get to .find
You want to find the <a> tag, and print the href attribute
print(linkelement.find('a')['href'])
Or you need to use .get with the tag:
print(linkelement.a.get('href'))

(Python)- How to store text extracted from HTML table using BeautifulSoup in a structured python list

I parse a webpage using beautifulsoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("webpage url")
soup = BeautifulSoup(page.content, 'html.parser')
I find the table and print the text
Ear_yield= soup.find(text="Earnings Yield").parent
print(Ear_yield.parent.text)
And then I get the output of a single row in a table
Earnings Yield
0.01
-0.59
-0.33
-1.23
-0.11
I would like this output to be stored in a list so that I can print on xls and operate on the elements (For ex if (Earnings Yield [0] > Earnings Yield [1]).
So I write:
import html2text
text1 = Ear_yield.parent.text
Ear_yield_text = html2text.html2text(pr1)
list_Ear_yield = []
for i in Ear_yield_text :
list_Ear_yield.append(i)
Thinking that my web data has gone into list. I print the fourth item and check:
print(list_Ear_yield[3])
I expect the output as -0.33 but i get
n
That means the list takes in individual characters and not the full word:
Please let me know where I am doing wrong
That is because your Ear_yield_text is a string rather than a list. Assuming that the text have new lines you can do directly this:
list_Ear_yield = Ear_yield_text.split('\n')
Now if you print list_Ear_yield you will be given this result
['Earnings Yield', '0.01', '-0.59', '-0.33', '-1.23', '-0.11']

Resources