How to get an image (inlineshape) from paragraph python docx - python-3.x

I want to read the docx document paragraph by paragraph and if there is a picture (InlineShape), then process it with the text around it. The function Document.inline_shapes will give the list of all inline shapes in the document. But I want to get the one, that appears exactly in the current paragraph if exists...
An example of code:
from docx import Document
doc = Document("test.docx")
blip = doc.inline_shapes[0]._inline.graphic.graphicData.pic.blipFill.blip
rID = blip.embed
document_part = doc.part
image_part = document_part.related_parts[rID]
fr = open("test.png", "wb")
fr.write(image_part._blob)
fr.close()
(this is how I want to save these pictures)

Assume your paragraph is par, you may use the following code to find the images
import xml.etree.ElementTree as ET
def hasImage(par):
"""get all of the images in a paragraph
:param par: a paragraph object from docx
:return: a list of r:embed
"""
ids = []
root = ET.fromstring(par._p.xml)
namespace = {
'a':"http://schemas.openxmlformats.org/drawingml/2006/main", \
'r':"http://schemas.openxmlformats.org/officeDocument/2006/relationships", \
'wp':"http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"}
inlines = root.findall('.//wp:inline',namespace)
for inline in inlines:
imgs = inline.findall('.//a:blip', namespace)
for img in imgs:
id = img.attrib['{{{0}}}embed'.format(namespace['r'])]
ids.append(id)
return ids

Related

Creating a python spellchecker using tkinter

For school, I need to create a spell checker, using python. I decided to do it using a GUI created with tkinter. I need to be able to input a text (.txt) file that will be checked, and a dictionary file, also a text file. The program needs to open both files, check the check file against the dictionary file, and then display any words that are misspelled.
Here's my code:
import tkinter as tk
from tkinter.filedialog import askopenfilename
def checkFile():
# get the sequence of words from a file
text = open(file_ent.get())
dictDoc = open(dict_ent.get())
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
# make a dictionary of the word counts
wordDict = {}
for w in words:
wordDict[w] = wordDict.get(w,0) + 1
for k in dictDict:
dictDoc.pop(k, None)
misspell_lbl["text"] = dictDoc
# Set-up the window
window = tk.Tk()
window.title("Temperature Converter")
window.resizable(width=False, height=False)
# Setup Layout
frame_a = tk.Frame(master=window)
file_lbl = tk.Label(master=frame_a, text="File Name")
space_lbl = tk.Label(master=frame_a, width = 6)
dict_lbl =tk.Label(master=frame_a, text="Dictionary File")
file_lbl.pack(side=tk.LEFT)
space_lbl.pack(side=tk.LEFT)
dict_lbl.pack(side=tk.LEFT)
frame_b = tk.Frame(master=window)
file_ent = tk.Entry(master=frame_b, width=20)
dict_ent = tk.Entry(master=frame_b, width=20)
file_ent.pack(side=tk.LEFT)
dict_ent.pack(side=tk.LEFT)
check_btn = tk.Button(master=window, text="Spellcheck", command=checkFile)
frame_c = tk.Frame(master=window)
message_lbl = tk.Label(master=frame_c, text="Misspelled Words:")
misspell_lbl = tk.Label(master=frame_c, text="")
message_lbl.pack()
misspell_lbl.pack()
frame_a.pack()
frame_b.pack()
check_btn.pack()
frame_c.pack()
# Run the application
window.mainloop()
I want the file to check against the dictionary and display the misspelled words in the misspell_lbl.
The test files I'm using to make it work, and to submit with the assignment are here:
check file
dictionary file
I preloaded the files to the site that I'm submitting this on, so it should just be a matter of entering the file name and extension, not the entire path.
I'm pretty sure the problem is with my function to read and check the file, I've been beating my head on a wall trying to solve this, and I'm stuck. Any help would be greatly appreciated.
Thanks.
The first problem is with how you try to read the files. open(...) will return a _io.TextIOWrapper object, not a string and this is what causes your error. To get the text from the file, you need to use .read(), like this:
def checkFile():
# get the sequence of words from a file
with open(file_ent.get()) as f:
text = f.read()
with open(dict_ent.get()) as f:
dictDoc = f.read().splitlines()
The with open(...) as f part gives you a file object called f, and automatically closes the file when it's done. This is more concise version of
f = open(...)
text = f.read()
f.close()
f.read() will get the text from the file. For the dictionary I also added .splitlines() to turn the newline separated text into a list.
I couldn't really see where you'd tried to check for misspelled words, but you can do it with a list comprehension.
misspelled = [x for x in words if x not in dictDoc]
This gets every word which is not in the dictionary file and adds it to a list called misspelled. Altogether, the checkFile function now looks like this, and works as expected:
def checkFile():
# get the sequence of words from a file
with open(file_ent.get()) as f:
text = f.read()
with open(dict_ent.get()) as f:
dictDoc = f.read().splitlines()
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
# make a dictionary of the word counts
wordDict = {}
for w in words:
wordDict[w] = wordDict.get(w,0) + 1
misspelled = [x for x in words if x not in dictDoc]
misspell_lbl["text"] = misspelled

Writing from a list to a Word document

I have written code (shown below) that reads in a word document, and writes each "run" with formatting into a list. Thus creating a list of formatted runs, so that I may use some or all at a later time. Next, I have written code to write said runs from said list to a new document. The issue is I keep getting an error stating " "Run" object is not iterable". I don't know if the issue is that writing all of the paragraph information to a list that can be recalled can not be done this way, or if I am trying to write it to the document the wrong way.
import tkinter as tk
from tkinter.filedialog import askopenfilename
from docx import Document # Invokes Document command from docx
def get_para_data(output_doc_name, paragraph):
"""
Write the run to the new file and then set its font, bold, alignment, color etc. data.
"""
output_run = []
output_para = output_doc_name.add_paragraph()
for run in paragraph.runs:
if paragraph:
output_run = output_para.add_run(run.text)
# Run's bold data
output_run.bold = run.bold
# Run's italic data
output_run.italic = run.italic
# Run's underline data
output_run.underline = run.underline
# Run's color data
output_run.font.color.rgb = run.font.color.rgb
# Run's font data
output_run.style.name = run.style.name
# Paragraph's alignment data
output_para.paragraph_format.alignment = paragraph.paragraph_format.alignment
else:
output_run = []
return output_run
# IMPORT WORD DOCUMENT
root = tk.Tk()
root.withdraw()
# returns the file path as variable for future use.
doc_path = askopenfilename(title="Choose Word File")
# Imports Word Document to Modify.
document = Document(doc_path)
# Number of paragraphs in document.
t = len(document.paragraphs)
# Preallocation of list.
output_paragraph = [None]*t
result = Document()
# Begin loop to create list of paragraph data using function created above.
i = 0
for para in document.paragraphs:
output_paragraph[i] = get_para_data(result, document.paragraphs[i])
# Write desired portion of document into a new document.
document_new = Document()
new_line = []
a = 0
for out in output_paragraph:
# Check to Verify it is not a blank line/return.
if output_paragraph:
new_line[a] = document_new.add_paragraph(output_paragraph[a])
a += 1
# if it is a blank line/return write blank line return.
else:
document_new.add_paragraph(text='\\r', style=None)
a += 1
My expected results were that the text in the new document was the same as in the previous document, but in an order i choose. Similar to a copy and paste, but I wanted the ability to choose which portions I "pasted" and when.

Retrieve exif data from thousands of images fast - optimising function

I've written script that retrieves specific fields of exif data from thousands of images in a directory (including subdirectories) and saves the info to a csv file:
import os
from PIL import Image
from PIL.ExifTags import TAGS
import csv
from os.path import join
####SET THESE!###
imgpath = 'C:/x/y' #Path to folder of images
csvname = 'EXIF_data.csv' #Name of saved csv
###
def get_exif(fn):
ret = {}
i = Image.open(fn)
info = i._getexif()
for tag, value in info.items():
decoded = TAGS.get(tag, tag)
ret[decoded] = value
return ret
exif_list = []
path_list = []
filename_list = []
DTO_list = []
MN_list = []
for root, dirs, files in os.walk(imgpath, topdown=True):
for name in files:
if name.endswith('.JPG'):
pat = join(root, name)
pat.replace(os.sep,"/")
exif = get_exif(pat)
path_list.append(pat)
filename_list.append(name)
DTO_list.append(exif['DateTimeOriginal'])
MN_list.append(exif['MakerNote'])
zipped = zip(path_list, filename_list, DTO_list, MN_list)
with open(csvname, "w", newline='') as f:
writer = csv.writer(f)
writer.writerow(('Paths','Filenames','DateAndTime','MakerNotes'))
for row in zipped:
writer.writerow(row)
However, it is quite slow. I've attempted to optimise the script for performance + readabilty by using list and dictionary comprehensions.
import os
from os import walk #Necessary for recursive mode
from PIL import Image #Opens images and retrieves exif
from PIL.ExifTags import TAGS #Convert exif tags from digits to names
import csv #Write to csv
from os.path import join #Join directory and filename for path
####SET THESE!###
imgpath = 'C:/Users/au309263/Documents/imagesorting_testphotos/Finse/FINSE01' #Path to folder of images. The script searches subdirectories as well
csvname = 'PLC_Speedtest2.csv' #Name of saved csv
###
def get_exif(fn): #Defining a function that opens an image, retrieves the exif data, corrects the exif tags from digits to names and puts the data into a dictionary
i = Image.open(fn)
info = i._getexif()
ret = {TAGS.get(tag, tag): value for tag, value in info.items()}
return ret
Paths = [join(root, f).replace(os.sep,"/") for root, dirs, files in walk(imgpath, topdown=True) for f in files if f.endswith('.JPG' or '.jpg')] #Creates list of paths for images
Filenames = [f for root, dirs, files in walk(imgpath, topdown=True) for f in files if f.endswith('.JPG' or '.jpg')] #Creates list of filenames for images
ExifData = list(map(get_exif, Paths)) #Runs the get_exif function on each of the images specified in the Paths list. List converts the map-object to a list.
MakerNotes = [i['MakerNote'] for i in ExifData] #Creates list of MakerNotes from exif data for images
DateAndTime = [i['DateTimeOriginal'] for i in ExifData] #Creates list of Date and Time from exif data for images
zipped = zip(Paths, Filenames, DateAndTime, MakerNotes) #Combines the four lists to be written into a csv.
with open(csvname, "w", newline='') as f: #Writes a csv-file with the exif data
writer = csv.writer(f)
writer.writerow(('Paths','Filenames','DateAndTime','MakerNotes'))
for row in zipped:
writer.writerow(row)
However, this has not changed the performance.
I've timed the specific regions of the code and found that specifically opening each image and getting the exif data from each image in the get_exif function is what takes time.
To make the script faster,I'm wondering if:
1) It is possible to optimise on the performance of the function?, 2) It is possible to retrive exif data without opening the image?, 3) list(map(fn,x)) is the fastest way of applying the funtion?
If read the docs in the right way PIL.Image.open() does not only extracts the EXIF data from the file but also reads and decodes the entire image, which probably is the bottleneck here. The first thing I would do would be to change to a library or routine that only works on the EXIF data and does not care for the image content. ExifRead or piexif might be worth a try.

Adding Logo in Header of Word document using python-docx

I want a logo file to be attached everytime in the word document, when I run the code,
Ideally the code should look like :
from docx import Document
document = Document()
logo = open('logo.eps', 'r') #the logo path that is to be attached
document.add_heading('Underground Heating Oil Tank Search Report', 0) #simple heading that will come bellow the logo in the header.
document.save('report for xyz.docx') #saving the file
is this possible in the python-docx or should i try some other library to do this? if possible please tell me how,
with the following code, you can create a table with two columns the first element is the logo, the second element is the text part of the header
from docx import Document
from docx.shared import Inches, Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
document = Document()
header = document.sections[0].header
htable=header.add_table(1, 2, Inches(6))
htab_cells=htable.rows[0].cells
ht0=htab_cells[0].add_paragraph()
kh=ht0.add_run()
kh.add_picture('logo.png', width=Inches(1))
ht1=htab_cells[1].add_paragraph('put your header text here')
ht1.alignment = WD_ALIGN_PARAGRAPH.RIGHT
document.save('yourdoc.docx')
A simpler way to include logo and a header with some style (Heading 2 Char here):
from docx import Document
from docx.shared import Inches, Pt
doc = Document()
header = doc.sections[0].header
paragraph = header.paragraphs[0]
logo_run = paragraph.add_run()
logo_run.add_picture("logo.png", width=Inches(1))
text_run = paragraph.add_run()
text_run.text = '\t' + "My Awesome Header" # For center align of text
text_run.style = "Heading 2 Char"

No space between words while reading and extracting the text from a pdf file in python?

Hello Community Members,
I want to extract all the text from an e-book with .pdf as the file extension. I came to know that python has a package PyPDF2 to do the necessary action. Somehow, I have tried and able to extract text but it results in inappropriate space between the extracted words, sometimes the results is the result of 2-3 merged words.
Further, I want to extract the text from page 3 onward, as the initial pages deals with the cover page and preface. Also, I don't want to include the last 5 pages as it contains the glossary and index.
Does there exist any other way to read a .pdf binary file with NO ENCRYPTION?
The code snippet, whatever I have tried up to now is as follows.
import PyPDF2
def Read():
pdfFileObj = open('book1.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
global text
text = []
while(count < num_pages):
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText().split()
print(text)
Read()
This is a possible solution:
import PyPDF2
def Read(startPage, endPage):
global text
text = []
cleanText = ""
pdfFileObj = open('myTest2.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
while startPage <= endPage:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
print(text)
Read(0,0)
Read() parameters --> Read(first page to read, last page to read)
Note: To read the first page starts from 0 not from 1 (as for example in an array).

Resources