In Python, how can I get part of docx document? - python-3.x

I would like to get part of docx document ( for example, 10% of all content) with Python 3. How I can do this?
Thanks.

A good way to interact with .docx files in python is the docx2txt module.
If you have pip installed you can open your terminal and run:
pip install docx2txt
Once you have the docx module you can run:
import docx2txt
You can then return the text in the document and filter only the parts you want. The contents of filename.docx is stored as a string in the variable text.
text = docx2txt.process("filename.docx")
print(text)
It is now possible to manipulate that string using some basic built-functions. The code snippet below prints the results of text, returns the length using the len() function, and slices the string to about 10% by creating a substring.
len(text)
print(len(text)) # returns 1000 for my sample document
text = text[1:100]
print(text) # returns 10% of the string
My full code for this example is below. I hope this is helpful!
import docx2txt
text = docx2txt.process("/home/jared/test.docx")
print(text)
len(text)
print(len(text)) # returns 1000 for my sample document
text = text[1:100]
print(text) # returns 10% of the string

I would try something line this:
from math import floor
def docx(file, percent):
text = []
lines = sum(1 for line in open(file))
#print("File has {0} lines".format(lines))
no = floor((lines * percent / 100))
#print('Rounded to ', no)
limit = 0
with open(file) as f:
for l in f:
text.append(l)
limit += 1
if limit == no:
break
return text
To test it try:
print(docx('example.docx', 10))

Related

Instead of printing to console create a dataframe for output

I am currently comparing the text of one file to that of another file.
The method: for each row in the source text file, check each row in the compare text file.
If the word is present in the compare file then write the word and write 'present' next to it.
If the word is not present then write the word and write not_present next to it.
so far I can do this fine by printing to the console output as shown below:
import sys
filein = 'source.txt'
compare = 'compare.txt'
source = 'source.txt'
# change to lower case
with open(filein,'r+') as fopen:
string = ""
for line in fopen.readlines():
string = string + line.lower()
with open(filein,'w') as fopen:
fopen.write(string)
# search and list
with open(compare) as f:
searcher = f.read()
if not searcher:
sys.exit("Could not read data :-(")
#search and output the results
with open(source) as f:
for item in (line.strip() for line in f):
if item in searcher:
print(item, ',present')
else:
print(item, ',not_present')
the output looks like this:
dog ,present
cat ,present
mouse ,present
horse ,not_present
elephant ,present
pig ,present
what I would like is to put this into a pandas dataframe, preferably 2 columns, one for the word and the second for its state . I cant seem to get my head around doing this.
I am making several assumptions here to include:
Compare.txt is a text file consisting of a list of single words 1 word per line.
Source.txt is a free flowing text file, which includes multiple words per line and each word is separated by a space.
When comparing to determine if a compare word is in source, is is found if and only if, no punctuation marks (i.e. " ' , . ?, etc) are appended to the word in source .
The output dataframe will only contain the words found in compare.txt.
The final output is a printed version of the pandas dataframe.
With these assumptions:
import pandas as pd
from collections import defaultdict
compare = 'compare.txt'
source = 'source.txt'
rslt = defaultdict(list)
def getCompareTxt(fid: str) -> list:
clist = []
with open(fid, 'r') as cmpFile:
for line in cmpFile.readlines():
clist.append(line.lower().strip('\n'))
return clist
cmpList = getCompareTxt(compare)
if cmpList:
with open(source, 'r') as fsrc:
items = []
for item in (line.strip().split(' ') for line in fsrc):
items.extend(item)
print(items)
for cmpItm in cmpList:
rslt['Name'].append(cmpItm)
if cmpItm in items:
rslt['State'].append('Present')
else:
rslt['State'].append('Not Present')
df = pd.DataFrame(rslt, index=range(len(cmpList)))
print(df)
else:
print('No compare data present')

Creating a python spellchecker using tkinter

For school, I need to create a spell checker, using python. I decided to do it using a GUI created with tkinter. I need to be able to input a text (.txt) file that will be checked, and a dictionary file, also a text file. The program needs to open both files, check the check file against the dictionary file, and then display any words that are misspelled.
Here's my code:
import tkinter as tk
from tkinter.filedialog import askopenfilename
def checkFile():
# get the sequence of words from a file
text = open(file_ent.get())
dictDoc = open(dict_ent.get())
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
# make a dictionary of the word counts
wordDict = {}
for w in words:
wordDict[w] = wordDict.get(w,0) + 1
for k in dictDict:
dictDoc.pop(k, None)
misspell_lbl["text"] = dictDoc
# Set-up the window
window = tk.Tk()
window.title("Temperature Converter")
window.resizable(width=False, height=False)
# Setup Layout
frame_a = tk.Frame(master=window)
file_lbl = tk.Label(master=frame_a, text="File Name")
space_lbl = tk.Label(master=frame_a, width = 6)
dict_lbl =tk.Label(master=frame_a, text="Dictionary File")
file_lbl.pack(side=tk.LEFT)
space_lbl.pack(side=tk.LEFT)
dict_lbl.pack(side=tk.LEFT)
frame_b = tk.Frame(master=window)
file_ent = tk.Entry(master=frame_b, width=20)
dict_ent = tk.Entry(master=frame_b, width=20)
file_ent.pack(side=tk.LEFT)
dict_ent.pack(side=tk.LEFT)
check_btn = tk.Button(master=window, text="Spellcheck", command=checkFile)
frame_c = tk.Frame(master=window)
message_lbl = tk.Label(master=frame_c, text="Misspelled Words:")
misspell_lbl = tk.Label(master=frame_c, text="")
message_lbl.pack()
misspell_lbl.pack()
frame_a.pack()
frame_b.pack()
check_btn.pack()
frame_c.pack()
# Run the application
window.mainloop()
I want the file to check against the dictionary and display the misspelled words in the misspell_lbl.
The test files I'm using to make it work, and to submit with the assignment are here:
check file
dictionary file
I preloaded the files to the site that I'm submitting this on, so it should just be a matter of entering the file name and extension, not the entire path.
I'm pretty sure the problem is with my function to read and check the file, I've been beating my head on a wall trying to solve this, and I'm stuck. Any help would be greatly appreciated.
Thanks.
The first problem is with how you try to read the files. open(...) will return a _io.TextIOWrapper object, not a string and this is what causes your error. To get the text from the file, you need to use .read(), like this:
def checkFile():
# get the sequence of words from a file
with open(file_ent.get()) as f:
text = f.read()
with open(dict_ent.get()) as f:
dictDoc = f.read().splitlines()
The with open(...) as f part gives you a file object called f, and automatically closes the file when it's done. This is more concise version of
f = open(...)
text = f.read()
f.close()
f.read() will get the text from the file. For the dictionary I also added .splitlines() to turn the newline separated text into a list.
I couldn't really see where you'd tried to check for misspelled words, but you can do it with a list comprehension.
misspelled = [x for x in words if x not in dictDoc]
This gets every word which is not in the dictionary file and adds it to a list called misspelled. Altogether, the checkFile function now looks like this, and works as expected:
def checkFile():
# get the sequence of words from a file
with open(file_ent.get()) as f:
text = f.read()
with open(dict_ent.get()) as f:
dictDoc = f.read().splitlines()
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
# make a dictionary of the word counts
wordDict = {}
for w in words:
wordDict[w] = wordDict.get(w,0) + 1
misspelled = [x for x in words if x not in dictDoc]
misspell_lbl["text"] = misspelled

How do I perform a regular expression on multiple .txt files in a folder (Python)?

I'm trying to open up 32 .txt files, extract some text from them (using RegEx) and then save them as individual files again(later on in the project I'm hoping to collate them together). I've tested the RegEx on a single file and it seems to work:
import os
import re
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation planning\Manual scrape\Finished years proper')
with open('1988.txt') as txtfile:
text= txtfile.read()
#print(len(text)) #sentences in text
start = r'Body\n\n\n'
docs = re.findall(start, text)
print('Found the start of %s documents.' % len(docs))
end = r'Load-Date:'
print('Found the end of %s documents.' % len(docs))
docs = re.findall(end, text)
regex = start+r'(.+?)'+end
articles = re.findall(regex, text, re.S)
print('You have now parsed the 154 articles so only the body of content remains. All metadata has been removed.')
print('Here is an example of a parsed article:', articles[0])
Now I want to perform the exact same thing on all my .txt files in that folder, but I can't figure out how to. I've been playing around with For loops but with little success. Currently I have this:
import os
import re
finished_years_proper= os.listdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
print('There are %s .txt files in this folder.' % len(finished_years_proper))
if i.endswith(".txt"):
with open(finished_years_proper + i, 'r') as all_years:
for line in all_years:
start = r'Body\n\n\n'
docs = re.findall(start, all_years)
end = r'Load-Date:'
docs = re.findall(end, all_years)
regex = start+r'(.+?)'+end
articles = re.findall(regex, all_years, re.S)
However, I'm returning a type error:
File "C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Method\Python\untitled1.py", line 15, in <module>
with open(finished_years_proper + i, 'r') as all_years:
TypeError: can only concatenate list (not "str") to list
I'm unsure how to proceed... I've seen on other forums that I should convert something into a string, but I'm not sure what to convert or even if this is the right way to proceed. Any help with this would be really appreciated!
After taking Benedictanjw's into my codes I've ended up with this:
Hi, this is what I ended up with:
all_years= []
for fyp in finished_years_proper: #fyp is each text file in folder
with open(fyp, 'r') as year:
for line in year: #line is each element in each text file in folder
start = r'Body\n\n\n'
docs = re.findall(start, line)
end = r'Load-Date:'
docs = re.findall(end, line)
regex = start+r'(.+?)'+end
articles = re.findall(regex, line, re.S)
all_years.append(articles) #append strings to reflect RegEx
parsed_documents= all_years.append(articles)
print(parsed_documents) #returns None. Apparently this is okay.
Does the 'None' mean that the parsing of each file is successful (as in it emulates the result I had when I tested the RegEx on a single file)? And if so, how can I visualise my output without returning None. Many thanks in advance!!
The problem shows because finished_years_proper is a list and in your line:
with open(finished_years_proper + i, 'r') as all_years:
you are trying to concatenate i with that list. I presume you had accidentally defined i elsewhere as a string. I guess you probably want to do something like:
all_years = []
for fyp in finished_years_proper:
with open(fyp, 'r') as year:
for line in year:
... # your regex search on year
all_years.append(xxx)

how to extract text from PDF file using python , i never did this and not getting the DOM of PDF file

this is my PDF file "https://drive.google.com/open?id=1M9k1AO17ZSwT6HTrTrB-uz85ps3WL1wS"
Help me someone to extract this , as i search on SO getting some clue to extract text using these libries PyPDF2, PyPDF2.pdf , PageObject, u_, ContentStream, b_, TextStringObject ,but not getting how to use it.
someone please help me to extract this with some explanation, so i can understand the code and tell me how to read DOM of PDF file.
you need to install some libaries:
pip install PyPDF2
pip install textract
pip install nltk
This will download the libraries you require t0 parsePDF documents and extract keywords. In order to do this, make sure your PDF file is stored within the folder where you’re writing your script.
Startup your favourite editor and type:
Note: All lines starting with # are comments.
Step 1: Import all libraries:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
Step 2: Read PDF File
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'enter the name of the file here'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.
Step 3: Convert text into keywords
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']
#We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
Now you have keywords for your file stored as a list. You can do whatever you want with it. Store it in a spreadsheet if you want to make the PDF searchable, or parse a lot of files and conduct a cluster analysis. You can also use it to create a recommender system for resumes for jobs ;)

No space between words while reading and extracting the text from a pdf file in python?

Hello Community Members,
I want to extract all the text from an e-book with .pdf as the file extension. I came to know that python has a package PyPDF2 to do the necessary action. Somehow, I have tried and able to extract text but it results in inappropriate space between the extracted words, sometimes the results is the result of 2-3 merged words.
Further, I want to extract the text from page 3 onward, as the initial pages deals with the cover page and preface. Also, I don't want to include the last 5 pages as it contains the glossary and index.
Does there exist any other way to read a .pdf binary file with NO ENCRYPTION?
The code snippet, whatever I have tried up to now is as follows.
import PyPDF2
def Read():
pdfFileObj = open('book1.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
global text
text = []
while(count < num_pages):
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText().split()
print(text)
Read()
This is a possible solution:
import PyPDF2
def Read(startPage, endPage):
global text
text = []
cleanText = ""
pdfFileObj = open('myTest2.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
while startPage <= endPage:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
print(text)
Read(0,0)
Read() parameters --> Read(first page to read, last page to read)
Note: To read the first page starts from 0 not from 1 (as for example in an array).

Resources