Read a word document by pages using docx2python package - python-3.x

How could I read a word document by pages (I want to create a dictionary, where the keys would be the number of pages and their respective values would be the strings corresponding to the pages: {"1": "content 1", "2": "content 2 ", ...}) with docx2python? If it is not possible with this package, with what package could I do it?
This is my code so far, it returns a whole word document as a string. Thank you.
!pip install docx2python
from docx2python import docx2python
def read_word(file_path):
"""
Function that reads a Word file and returns a string
"""
# Extract docx content, ignore images
doc = docx2python(file_path, extract_image = False)
# Get all text in a single string
output = doc.text
return output

Related

How to read the data and the associated field name that is in a filled-in PDF form

I am writing a python script that needs to pull the data filled in a PDF form as part of a larger script. I tried using pyPDF3 but while it can show me the strings in the form, it does not show the filled-in data. I have a form where I have entered the value 'XXX" into a field and I want the script to be able to return that data and the name of the field but I can't seem to read the data. The fillpdfs module is very helpful but AFAICT it can return the field names but not the data.
I have this snippet:
from PyPDF3 import PdfFileWriter, PdfFileReader
# Open the PDF file
pdf_file = open('filename.pdf', 'rb')
pdf_reader = PdfFileReader(pdf_file)
# Extract text data from each page
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
'XXX' in page.extractText()
There is a function for pdf forms:
dictionary = pdf_reader.getFormTextFields() # returns a python dictionary
print(dictionary)
Documentation

How to remove duplicate sentences from paragraph using NLTK?

I had a huge document with many repeated sentences such as (footer text, hyperlinks with alphanumeric chars), I need to get rid of those repeated hyperlinks or Footer text. I have tried with the below code but unfortunately couldn't succeed. Please review and help.
corpus = "We use file handling methods in python to remove duplicate lines in python text file or function. The text file or function has to be in the same directory as the python program file. Following code is one way of removing duplicates in a text file bar.txt and the output is stored in foo.txt. These files should be in the same directory as the python script file, else it won’t work.Now, we should crop our big image to extract small images with amounts.In terms of topic modelling, the composites are documents and the parts are words and/or phrases (phrases n words in length are referred to as n-grams).We use file handling methods in python to remove duplicate lines in python text file or function.As an example I will use some image of a bill, saved in the pdf format. From this bill I want to extract some amounts.All our wrappers, except of textract, can’t work with the pdf format, so we should transform our pdf file to the image (jpg). We will use wand for this.Now, we should crop our big image to extract small images with amounts."
from nltk.tokenize import sent_tokenize
sentences_with_dups = []
for sentence in corpus:
words = sentence.sent_tokenize(corpus)
if len(set(words)) != len(words):
sentences_with_dups.append(sentence)
print(sentences_with_dups)
else:
print('No duplciates found')
Error message for the above code :
AttributeError: 'str' object has no attribute 'sent_tokenize'
Desired Output :
Duplicates = ['We use file handling methods in python to remove duplicate lines in python text file or function.','Now, we should crop our big image to extract small images with amounts.']
Cleaned_corpus = {removed duplicates from corpus}
First of all, the example you provided is messed up with spaces between the last period and next sentence, there are a lot of space missing in between them, so I cleaned up.
Then you can do:
corpus = "......"
sentences = sent_tokenize(corpus)
duplicates = list(set([s for s in sentences if sentences.count(s) > 1]))
cleaned = list(set(sentences))
Above will mess the order. If you care about the order, you can do the following to preserve:
duplicates = []
cleaned = []
for s in sentences:
if s in cleaned:
if s in duplicates:
continue
else:
duplicates.append(s)
else:
cleaned.append(s)

how to convert string stored in list into utf8?

i have tokenized the text form the text files stored in a list and stored the tokenized text in a variable and when i print that variable it shows the wrong result.
import glob
files = glob.glob("D:\Pakistan Constitution\*.txt")
documents = []
for file in files:
with open(file) as f:
documents.append(f.read())
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
print(texts)
I expect the tokenized words but result occur is like that
['ÿþp\x00a\x00r\x00t\x00', '\x00v\x00', '\x00', '\x00r\x00e\x00l\x00a\x00t\x00i\x00o\x00n\x00s\x00', '\x00b\x00e\x00t\x00w\x00e\x00e\x00n\x00',
So anyone can help me regarding this

how to extract text from PDF file using python , i never did this and not getting the DOM of PDF file

this is my PDF file "https://drive.google.com/open?id=1M9k1AO17ZSwT6HTrTrB-uz85ps3WL1wS"
Help me someone to extract this , as i search on SO getting some clue to extract text using these libries PyPDF2, PyPDF2.pdf , PageObject, u_, ContentStream, b_, TextStringObject ,but not getting how to use it.
someone please help me to extract this with some explanation, so i can understand the code and tell me how to read DOM of PDF file.
you need to install some libaries:
pip install PyPDF2
pip install textract
pip install nltk
This will download the libraries you require t0 parsePDF documents and extract keywords. In order to do this, make sure your PDF file is stored within the folder where you’re writing your script.
Startup your favourite editor and type:
Note: All lines starting with # are comments.
Step 1: Import all libraries:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
Step 2: Read PDF File
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'enter the name of the file here'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.
Step 3: Convert text into keywords
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']
#We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
Now you have keywords for your file stored as a list. You can do whatever you want with it. Store it in a spreadsheet if you want to make the PDF searchable, or parse a lot of files and conduct a cluster analysis. You can also use it to create a recommender system for resumes for jobs ;)

In Python, how can I get part of docx document?

I would like to get part of docx document ( for example, 10% of all content) with Python 3. How I can do this?
Thanks.
A good way to interact with .docx files in python is the docx2txt module.
If you have pip installed you can open your terminal and run:
pip install docx2txt
Once you have the docx module you can run:
import docx2txt
You can then return the text in the document and filter only the parts you want. The contents of filename.docx is stored as a string in the variable text.
text = docx2txt.process("filename.docx")
print(text)
It is now possible to manipulate that string using some basic built-functions. The code snippet below prints the results of text, returns the length using the len() function, and slices the string to about 10% by creating a substring.
len(text)
print(len(text)) # returns 1000 for my sample document
text = text[1:100]
print(text) # returns 10% of the string
My full code for this example is below. I hope this is helpful!
import docx2txt
text = docx2txt.process("/home/jared/test.docx")
print(text)
len(text)
print(len(text)) # returns 1000 for my sample document
text = text[1:100]
print(text) # returns 10% of the string
I would try something line this:
from math import floor
def docx(file, percent):
text = []
lines = sum(1 for line in open(file))
#print("File has {0} lines".format(lines))
no = floor((lines * percent / 100))
#print('Rounded to ', no)
limit = 0
with open(file) as f:
for l in f:
text.append(l)
limit += 1
if limit == no:
break
return text
To test it try:
print(docx('example.docx', 10))

Resources