No space between words while reading and extracting the text from a pdf file in python? - python-3.x

Hello Community Members,
I want to extract all the text from an e-book with .pdf as the file extension. I came to know that python has a package PyPDF2 to do the necessary action. Somehow, I have tried and able to extract text but it results in inappropriate space between the extracted words, sometimes the results is the result of 2-3 merged words.
Further, I want to extract the text from page 3 onward, as the initial pages deals with the cover page and preface. Also, I don't want to include the last 5 pages as it contains the glossary and index.
Does there exist any other way to read a .pdf binary file with NO ENCRYPTION?
The code snippet, whatever I have tried up to now is as follows.
import PyPDF2
def Read():
pdfFileObj = open('book1.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
global text
text = []
while(count < num_pages):
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText().split()
print(text)
Read()

This is a possible solution:
import PyPDF2
def Read(startPage, endPage):
global text
text = []
cleanText = ""
pdfFileObj = open('myTest2.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
while startPage <= endPage:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
print(text)
Read(0,0)
Read() parameters --> Read(first page to read, last page to read)
Note: To read the first page starts from 0 not from 1 (as for example in an array).

Related

extracting data from multiple pdfs and putting that data into an excel table

I am taking data extracted from multiple pdfs that were merged into one pdf.
The data is based on clinical measurements taken from a sample at different time points. Some time points have certain measurement values while others are missing.
So far, I've been able to merge the pdfs, extract the text and specific data from the text, but I want to put it all into a corresponding excel table.
Below is my current code:
import PyPDF2
from PyPDF2 import PdfFileMerger
from glob import glob
#merge all pdf files in current directory
def pdf_merge():
merger = PdfFileMerger()
allpdfs = [a for a in glob("*.pdf")]
[merger.append(pdf) for pdf in allpdfs]
with open("Merged_pdfs1.pdf", "wb") as new_file:
merger.write(new_file)
if __name__ == "__main__":
pdf_merge()
#scan pdf
text =""
with open ("Merged_pdfs1.pdf", "rb") as pdf_file, open("sample.txt", "w") as text_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
for page_number in range(0, number_of_pages):
page = read_pdf.getPage(page_number)
text += page.extractText()
text_file.write(text)
#turn text script into list, separated by newlines
def Convert(text):
li = list(text.split("\n"))
return li
li = Convert(text)
filelines = []
for line in li:
filelines.append(line)
print(filelines)
#extract data from text and put into dictionary
full_data = []
test_data = {"Sample":[], "Timepoint":[],"Phosphat (mmol/l)":[], "Bilirubin, total (µmol/l)":[],
"Bilirubin, direkt (µmol/l)":[], "Protein (g/l)":[], "Albumin (g/l)":[],
"AST (U/l)":[], "ALT (U/l)":[], "ALP (U/l)":[], "GGT (U/l)":[], "IL-6 (ng/l)":[]}
for line2 in filelines:
# For each data item, extract it from the line and strip whitespace
if line2.startswith("Phosphat"):
test_data["Phosphat (mmol/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("Bilirubin,total"):
test_data["Bilirubin, total (µmol/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("Bilirubin,direkt"):
test_data["Bilirubin, direkt (µmol/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Protein "):
test_data["Protein (g/l)"].append( line2.split(" ")[-2].strip())
if line2.startswith("Albumin"):
test_data["Albumin (g/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("AST"):
test_data["AST (U/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("ALT"):
test_data["ALT (U/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Alk."):
test_data["ALP (U/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("GGT"):
test_data["GGT (U/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Interleukin-6"):
test_data["IL-6 (ng/l)"].append(line2.split(" ")[-4].strip())
for sampnum in range(100):
num = str(sampnum)
sampletype = "T" and "H"
if line2.startswith(sampletype+num):
sample = sampletype+num
test_data["Sample"]=sample
for time in range(0,360):
timepoint = str(time) + "h"
word_list = list(line2.split(" "))
for word in word_list:
if word == timepoint:
test_data["Timepoint"].append(word)
full_data.append(test_data)
import pandas as pd
df = pd.DataFrame(full_data)
df.to_excel("IKC4.xlsx", sheet_name="IKC", index=False)
print(df)
The issue is I'm wondering how to move the individual items in the list to their own cells in excel, with the proper timepoint, since they dont necessarily correspond to the right timepoint. For example, timepoint 1 and 3 can have protein measurements, whereas timepoint 2 is missing this info, but timepoint 3 measurements are found at position 2 in the list and will likely be in the wrong row for an excel table.
I figured maybe I need to make an alternative dictionary for the timepoints, and attach the corresponding measurements to the proper timepoint. I'm starting to get confused though on how to do all this and am now asking for help!
Thanks in advance :)
I tried doing an "else" argument after every if argument to add a "-" if there if a measurement wasnt present for that timepoint, but I got far too many dashes since it iterates through the lines of the entire pdf.

Creating a python spellchecker using tkinter

For school, I need to create a spell checker, using python. I decided to do it using a GUI created with tkinter. I need to be able to input a text (.txt) file that will be checked, and a dictionary file, also a text file. The program needs to open both files, check the check file against the dictionary file, and then display any words that are misspelled.
Here's my code:
import tkinter as tk
from tkinter.filedialog import askopenfilename
def checkFile():
# get the sequence of words from a file
text = open(file_ent.get())
dictDoc = open(dict_ent.get())
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
# make a dictionary of the word counts
wordDict = {}
for w in words:
wordDict[w] = wordDict.get(w,0) + 1
for k in dictDict:
dictDoc.pop(k, None)
misspell_lbl["text"] = dictDoc
# Set-up the window
window = tk.Tk()
window.title("Temperature Converter")
window.resizable(width=False, height=False)
# Setup Layout
frame_a = tk.Frame(master=window)
file_lbl = tk.Label(master=frame_a, text="File Name")
space_lbl = tk.Label(master=frame_a, width = 6)
dict_lbl =tk.Label(master=frame_a, text="Dictionary File")
file_lbl.pack(side=tk.LEFT)
space_lbl.pack(side=tk.LEFT)
dict_lbl.pack(side=tk.LEFT)
frame_b = tk.Frame(master=window)
file_ent = tk.Entry(master=frame_b, width=20)
dict_ent = tk.Entry(master=frame_b, width=20)
file_ent.pack(side=tk.LEFT)
dict_ent.pack(side=tk.LEFT)
check_btn = tk.Button(master=window, text="Spellcheck", command=checkFile)
frame_c = tk.Frame(master=window)
message_lbl = tk.Label(master=frame_c, text="Misspelled Words:")
misspell_lbl = tk.Label(master=frame_c, text="")
message_lbl.pack()
misspell_lbl.pack()
frame_a.pack()
frame_b.pack()
check_btn.pack()
frame_c.pack()
# Run the application
window.mainloop()
I want the file to check against the dictionary and display the misspelled words in the misspell_lbl.
The test files I'm using to make it work, and to submit with the assignment are here:
check file
dictionary file
I preloaded the files to the site that I'm submitting this on, so it should just be a matter of entering the file name and extension, not the entire path.
I'm pretty sure the problem is with my function to read and check the file, I've been beating my head on a wall trying to solve this, and I'm stuck. Any help would be greatly appreciated.
Thanks.
The first problem is with how you try to read the files. open(...) will return a _io.TextIOWrapper object, not a string and this is what causes your error. To get the text from the file, you need to use .read(), like this:
def checkFile():
# get the sequence of words from a file
with open(file_ent.get()) as f:
text = f.read()
with open(dict_ent.get()) as f:
dictDoc = f.read().splitlines()
The with open(...) as f part gives you a file object called f, and automatically closes the file when it's done. This is more concise version of
f = open(...)
text = f.read()
f.close()
f.read() will get the text from the file. For the dictionary I also added .splitlines() to turn the newline separated text into a list.
I couldn't really see where you'd tried to check for misspelled words, but you can do it with a list comprehension.
misspelled = [x for x in words if x not in dictDoc]
This gets every word which is not in the dictionary file and adds it to a list called misspelled. Altogether, the checkFile function now looks like this, and works as expected:
def checkFile():
# get the sequence of words from a file
with open(file_ent.get()) as f:
text = f.read()
with open(dict_ent.get()) as f:
dictDoc = f.read().splitlines()
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
# make a dictionary of the word counts
wordDict = {}
for w in words:
wordDict[w] = wordDict.get(w,0) + 1
misspelled = [x for x in words if x not in dictDoc]
misspell_lbl["text"] = misspelled

How to prepare text in all TXT files in folder for python via terminal?

I have folder with a lot of TXT files (books) which have many special symbols (multiple spaces, paragraphs, #, -, '.', etc) in the beginning. It causes great variety of problems while reading the files in python (pandas). Usually it transfers into errors like:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 29, saw 2
or
Found 0 texts.
Can I use some terminal script for text preconditioning? Your assistance will be much appreciated!
example for one file:
and code:
texts = [] # list of text samples
labels_index = {} # dictionary mapping label name to numeric id
labels = [] # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
path = os.path.join(TEXT_DATA_DIR, name)
if os.path.isdir(path):
label_id = len(labels_index)
labels_index[name] = label_id
for fname in sorted(os.listdir(path)):
if fname.isdigit():
fpath = os.path.join(path, fname)
args = {} if sys.version_info < (3,) else {'encoding': 'utf-8'}
with open(fpath, **args) as f:
t = f.read()
i = t.find('\n\n') # skip header
if 0 < i:
t = t[i:]
texts.append(t)
labels.append(label_id)
print('Found %s texts.' % len(texts))
You can try the unicodedata.
text = unicodedata.normalize('NFKD', text)
It Replaces unicode characters with their normal representations

Extract specific pages into PDF file

I need to extract a set of pages from a pdf that contains several sets. Such conjunctions are distinguished by submissions. Inside the pdf has the following information ...
1 - Set of 3 shipments
Page: 1/continued
Page: 2/continued
Page: 3/last
2 - Set of 2 shipments
Page: 1/continued
Page: 2/last
2 - Set of 1 shipping
Page 1/1
This is to speed up my service, since I have to separate these sets manually.
from PyPDF2 import PdfFileWriter, PdfFileReader
import re
output = PdfFileWriter()
input1 = PdfFileReader(open("pdf_teste.PDF", "rb"))
totalPages = input1.getNumPages()
print ("total pages to process:" +str(totalPages))
for i in range(totalPages):
p = i
print ("processing page %s" %str(i))
output.addPage(input1.getPage(p))
p = input1.getPage(p).extractText()#extract text to search for identifier
pr = re.search("Diretor", p)#search for the identifier; to be replaced with a list
#if there's a match, do work
if pr:
outputStream = open("test"+str(i)+".pdf", "wb")
output.write(outputStream)
outputStream.close()
print ('match on page %s' %str(i))
print ('\n')
This code almost does what I want.
He divides the first set, but from the second it repeats the first set and the second set. But I want a pdf for each set.

Excluding the Header and Footer Contents of a page of a PDF file while extracting text?

Is it possible to exclude the contents of footers and headers of a page from a pdf file during extracting the text from it. As these contents are least important and almost redundant.
Note: For extracting the text from the .pdf file, I am using the PyPDF2 package on python version = 3.7.
How to exclude the contents of the footers and headers in PyPDF2. Any help is appreciated.
The code snippet is as follows:
import PyPDF2
def Read(startPage, endPage):
global text
text = []
cleanText = " "
pdfFileObj = open('C:\\Users\\Rocky\\Desktop\\req\\req\\0000 - gamma j.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
print(num_pages)
while (startPage <= endPage):
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.strip().split()
print(text)
Read(1, 1)
As there are no features provided by PyPDF2 officially, I've written a function of my own to exclude the headers and footers in a pdf page which is working fine for my use case. You can add your own Regex patterns in page_format_pattern variable. Here I'm checking only in the first and last elements of my text list.
You can run this function for each page.
def remove_header_footer(self,pdf_extracted_text):
page_format_pattern = r'([page]+[\d]+)'
pdf_extracted_text = pdf_extracted_text.lower().split("\n")
header = pdf_extracted_text[0].strip()
footer = pdf_extracted_text[-1].strip()
if re.search(page_format_pattern, header) or header.isnumeric():
pdf_extracted_text = pdf_extracted_text[1:]
if re.search(page_format_pattern, footer) or footer.isnumeric():
pdf_extracted_text = pdf_extracted_text[:-1]
pdf_extracted_text = "\n".join(pdf_extracted_text)
return pdf_extracted_text
Hope you find this helpful.
At the moment, PyPDF2 does not offer this. It's also unclear how to do it well as those are not semantically represented within the pdf
As a heuristic, you could search for duplicates in the top / bottom of the extracted text of pages. That would likely work well for long documents and not work at all for 1-page documents
You need to consider that the first few pages might have no header or a different header than the rest. Also, there can be differences between chapters and even / odd pages
(side note: I'm the maintainer of PyPDF2 and I think this would be awesome to have)

Resources