extracting data from multiple pdfs and putting that data into an excel table - excel

I am taking data extracted from multiple pdfs that were merged into one pdf.
The data is based on clinical measurements taken from a sample at different time points. Some time points have certain measurement values while others are missing.
So far, I've been able to merge the pdfs, extract the text and specific data from the text, but I want to put it all into a corresponding excel table.
Below is my current code:
import PyPDF2
from PyPDF2 import PdfFileMerger
from glob import glob
#merge all pdf files in current directory
def pdf_merge():
merger = PdfFileMerger()
allpdfs = [a for a in glob("*.pdf")]
[merger.append(pdf) for pdf in allpdfs]
with open("Merged_pdfs1.pdf", "wb") as new_file:
merger.write(new_file)
if __name__ == "__main__":
pdf_merge()
#scan pdf
text =""
with open ("Merged_pdfs1.pdf", "rb") as pdf_file, open("sample.txt", "w") as text_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
for page_number in range(0, number_of_pages):
page = read_pdf.getPage(page_number)
text += page.extractText()
text_file.write(text)
#turn text script into list, separated by newlines
def Convert(text):
li = list(text.split("\n"))
return li
li = Convert(text)
filelines = []
for line in li:
filelines.append(line)
print(filelines)
#extract data from text and put into dictionary
full_data = []
test_data = {"Sample":[], "Timepoint":[],"Phosphat (mmol/l)":[], "Bilirubin, total (µmol/l)":[],
"Bilirubin, direkt (µmol/l)":[], "Protein (g/l)":[], "Albumin (g/l)":[],
"AST (U/l)":[], "ALT (U/l)":[], "ALP (U/l)":[], "GGT (U/l)":[], "IL-6 (ng/l)":[]}
for line2 in filelines:
# For each data item, extract it from the line and strip whitespace
if line2.startswith("Phosphat"):
test_data["Phosphat (mmol/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("Bilirubin,total"):
test_data["Bilirubin, total (µmol/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("Bilirubin,direkt"):
test_data["Bilirubin, direkt (µmol/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Protein "):
test_data["Protein (g/l)"].append( line2.split(" ")[-2].strip())
if line2.startswith("Albumin"):
test_data["Albumin (g/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("AST"):
test_data["AST (U/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("ALT"):
test_data["ALT (U/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Alk."):
test_data["ALP (U/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("GGT"):
test_data["GGT (U/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Interleukin-6"):
test_data["IL-6 (ng/l)"].append(line2.split(" ")[-4].strip())
for sampnum in range(100):
num = str(sampnum)
sampletype = "T" and "H"
if line2.startswith(sampletype+num):
sample = sampletype+num
test_data["Sample"]=sample
for time in range(0,360):
timepoint = str(time) + "h"
word_list = list(line2.split(" "))
for word in word_list:
if word == timepoint:
test_data["Timepoint"].append(word)
full_data.append(test_data)
import pandas as pd
df = pd.DataFrame(full_data)
df.to_excel("IKC4.xlsx", sheet_name="IKC", index=False)
print(df)
The issue is I'm wondering how to move the individual items in the list to their own cells in excel, with the proper timepoint, since they dont necessarily correspond to the right timepoint. For example, timepoint 1 and 3 can have protein measurements, whereas timepoint 2 is missing this info, but timepoint 3 measurements are found at position 2 in the list and will likely be in the wrong row for an excel table.
I figured maybe I need to make an alternative dictionary for the timepoints, and attach the corresponding measurements to the proper timepoint. I'm starting to get confused though on how to do all this and am now asking for help!
Thanks in advance :)
I tried doing an "else" argument after every if argument to add a "-" if there if a measurement wasnt present for that timepoint, but I got far too many dashes since it iterates through the lines of the entire pdf.

Related

Instead of printing to console create a dataframe for output

I am currently comparing the text of one file to that of another file.
The method: for each row in the source text file, check each row in the compare text file.
If the word is present in the compare file then write the word and write 'present' next to it.
If the word is not present then write the word and write not_present next to it.
so far I can do this fine by printing to the console output as shown below:
import sys
filein = 'source.txt'
compare = 'compare.txt'
source = 'source.txt'
# change to lower case
with open(filein,'r+') as fopen:
string = ""
for line in fopen.readlines():
string = string + line.lower()
with open(filein,'w') as fopen:
fopen.write(string)
# search and list
with open(compare) as f:
searcher = f.read()
if not searcher:
sys.exit("Could not read data :-(")
#search and output the results
with open(source) as f:
for item in (line.strip() for line in f):
if item in searcher:
print(item, ',present')
else:
print(item, ',not_present')
the output looks like this:
dog ,present
cat ,present
mouse ,present
horse ,not_present
elephant ,present
pig ,present
what I would like is to put this into a pandas dataframe, preferably 2 columns, one for the word and the second for its state . I cant seem to get my head around doing this.
I am making several assumptions here to include:
Compare.txt is a text file consisting of a list of single words 1 word per line.
Source.txt is a free flowing text file, which includes multiple words per line and each word is separated by a space.
When comparing to determine if a compare word is in source, is is found if and only if, no punctuation marks (i.e. " ' , . ?, etc) are appended to the word in source .
The output dataframe will only contain the words found in compare.txt.
The final output is a printed version of the pandas dataframe.
With these assumptions:
import pandas as pd
from collections import defaultdict
compare = 'compare.txt'
source = 'source.txt'
rslt = defaultdict(list)
def getCompareTxt(fid: str) -> list:
clist = []
with open(fid, 'r') as cmpFile:
for line in cmpFile.readlines():
clist.append(line.lower().strip('\n'))
return clist
cmpList = getCompareTxt(compare)
if cmpList:
with open(source, 'r') as fsrc:
items = []
for item in (line.strip().split(' ') for line in fsrc):
items.extend(item)
print(items)
for cmpItm in cmpList:
rslt['Name'].append(cmpItm)
if cmpItm in items:
rslt['State'].append('Present')
else:
rslt['State'].append('Not Present')
df = pd.DataFrame(rslt, index=range(len(cmpList)))
print(df)
else:
print('No compare data present')

Extract numbers only from specific lines within a txt file with certain keywords in Python

I want to extract numbers only from lines in a txt file that have a certain keyword and add them up then compare them, and then print the highest total number and the lowest total number. How should I go about this?
I want to print the highest and the lowest valid total numbers
I managed to extract lines with "valid" keyword in them, but now I want to get numbers from this lines, and then add the numbers up of each line, and then compare these numbers with other lines that have the same keyword and print the highest and the lowest valid numbers.
my code so far
#get file object reference to the file
file = open("shelfs.txt", "r")
#read content of file to string
data = file.read()
#close file<br>
closefile = file.close()
#get number of occurrences of the substring in the string
totalshelfs = data.count("SHELF")
totalvalid = data.count("VALID")
totalinvalid = data.count("INVALID")
print('Number of total shelfs :', totalshelfs)
print('Number of valid valid books :', totalvalid)
print('Number of invalid books :', totalinvalid)
txt file
HEADER|<br>
SHELF|2200019605568|<br>
BOOK|20200120000000|4810.1|20210402|VALID|<br>
SHELF|1591024987400|<br>
BOOK|20200215000000|29310.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200229000000|11519.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200329001234|115.0|20210331|INVALID|<br>
SHELF|1300001188124|<br>
BOOK|2020032904567|1144.0|20210401|INVALID|<br>
FOOTER|
What you need is to use the pandas library.
https://pandas.pydata.org/
You can read a csv file like this:
data = pd.read_csv('shelfs.txt', sep='|')
it returns a DataFrame object that can easily select or sort your data. It will use the first row as header, then you can select a specific column like a dictionnary:
header = data['HEADER']
header is a Series object.
To select columns you can do:
shelfs = data.loc[:,data['HEADER']=='SHELF']
to select only the row where the header is 'SHELF'.
I'm just not sure how pandas will handle the fact that you only have 1 header but 2 or 5 columns.
Maybe you should try to create one header per colmun in your csv, and add separators to make each row the same size first.
Edit (No External libraries or change in the txt file):
# Split by row
data = data.split('<br>\n')
# Split by col
data = [d.split('|') for d in data]
# Fill empty cells
n_cols = max([len(d) for d in data])
for i in range(len(data)):
while len(data[i])<n_cols:
data[i].append('')
# List VALID rows
valid_rows = [d for d in data if d[4]=='VALID']
# Get sum min and max
valid_sum = [d[1]+d[2]+d[3] for d in valid_rows]
valid_minimum = min(valid_sum)
valid_maximum = max(valid_sum)
It's maybe not exactly what you want to do but it solves a part of your problem. I didn't test the code.

CSV manipulation problem. A little complex and would like the solution to not be using pandas

CSV file:
Acct,phn_1,phn_2,phn_3,Name,Consent,zipcode
1234,45678,78906,,abc,NCN,10010
3456,48678,,78976,def,NNC,10010
Problem:
Based on consent value which is for each of the phones (in 1st row: 1st N is phn_1, C for phn_2 and so on) I need to retain only that phn column and move the remaining columns to the end of the file.
The below is what I have. My approach isn't that great is what I feel. I'm trying to get the id of the individual Ns and Cs, get the id and map it with the phone (but I'm unable to iterate through the phn headers and compare the id's of the Ns and Cs)
with open('file.csv', 'rU') as infile:
reader = csv.DictReader(infile) data = {} for row in reader:
for header, value in row.items():
data.setdefault(header, list()).append(value) # print(data)
Consent = data['Consent']
for i in range(len(Consent)):
# print(list(Consent[i]))
for idx, val in enumerate(list(Consent[i])):
# print(idx, val)
if val == 'C':
#print("C")
print(idx)
else:
print("N")
Could someone provide me with the solution for this?
Please Note: Do not want the solution to be by using pandas.
You’ll find my answer in the comments of the code below.
import csv
def parse_csv(file_name):
""" """
# Prepare the output. Note that all rows of a CSV file must have the same structure.
# So it is actually not possible to put the phone numbers with no consent at the end
# of the file, but what you can do is to put them at the end of the row.
# To ensure that the structure is the same on all rows, you need to put all phone numbers
# at the end of the row. That means the phone number with consent is duplicated, and that
# is not very efficient.
# I chose to put the result in a string, but you can use other types.
output = "Acct,phn,Name,Consent,zipcode,phn_1,phn_2,phn_3\n"
with open(file_name, "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# Search the letter “C” in “Consent” and get the position of the first match.
# Add one to the result because the “phn_×” keys are 1-based and not 0-based.
first_c_pos = row["Consent"].find("C") + 1
# If there is no “C”, then the “phn” key is empty.
if first_c_pos == 0:
row["phn"] = ""
# If there is at least one “C”, create a key string that will take the values
# phn_1, phn_2 or phn_3.
else:
key = f"phn_{first_c_pos}"
row["phn"] = row[key]
# Add the current row to the result string.
output += ",".join([
row["Acct"], row["phn"], row["Name"], row["Consent"],
row["zipcode"], row["phn_1"], row["phn_2"], row["phn_3"]
])
output += "\n"
# Return the string.
return(output)
if __name__ == "__main__":
output = parse_csv("file.csv")
print(output)

Skip lines with strange characters when I read a file

I am trying to read some data files '.txt' and some of them contain strange random characters and even extra columns in random rows, like in the following example, where the second row is an example of a right row:
CTD 10/07/30 05:17:14.41 CTD 24.7813, 0.15752, 1.168, 0.7954, 1497.¸ 23.4848, 0.63042, 1.047, 3.5468, 1496.542
CTD 10/07/30 05:17:14.47 CTD 23.4846, 0.62156, 1.063, 3.4935, 1496.482
I read the description of np.loadtxt and I have not found a solution for my problem. Is there a systematic way to skip rows like these?
The code that I use to read the files is:
#Function to read a datafile
def Read(filename):
#Change delimiters for spaces
s = open(filename).read().replace(':',' ')
s = s.replace(',',' ')
s = s.replace('/',' ')
#Take the columns that we need
data=np.loadtxt(StringIO(s),usecols=(4,5,6,8,9,10,11,12))
return data
This works without using csv like the other answer and just reads line by line checking if it is ascii
data = []
def isascii(s):
return len(s) == len(s.encode())
with open("test.txt", "r") as fil:
for line in fil:
res = map(isascii, line)
if all(res):
data.append(line)
print(data)
You could use the csv module to read the file one line at a time and apply your desired filter.
import csv
def isascii(s):
len(s) == len(s.encode())
with open('file.csv') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
if len(row)==expected_length and all((isascii(x) for x in row)):
'write row onto numpy array'
I got the ascii check from this thread
How to check if a string in Python is in ASCII?

No space between words while reading and extracting the text from a pdf file in python?

Hello Community Members,
I want to extract all the text from an e-book with .pdf as the file extension. I came to know that python has a package PyPDF2 to do the necessary action. Somehow, I have tried and able to extract text but it results in inappropriate space between the extracted words, sometimes the results is the result of 2-3 merged words.
Further, I want to extract the text from page 3 onward, as the initial pages deals with the cover page and preface. Also, I don't want to include the last 5 pages as it contains the glossary and index.
Does there exist any other way to read a .pdf binary file with NO ENCRYPTION?
The code snippet, whatever I have tried up to now is as follows.
import PyPDF2
def Read():
pdfFileObj = open('book1.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
global text
text = []
while(count < num_pages):
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText().split()
print(text)
Read()
This is a possible solution:
import PyPDF2
def Read(startPage, endPage):
global text
text = []
cleanText = ""
pdfFileObj = open('myTest2.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
while startPage <= endPage:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
print(text)
Read(0,0)
Read() parameters --> Read(first page to read, last page to read)
Note: To read the first page starts from 0 not from 1 (as for example in an array).

Resources