Writing from a list to a Word document - python-3.x

I have written code (shown below) that reads in a word document, and writes each "run" with formatting into a list. Thus creating a list of formatted runs, so that I may use some or all at a later time. Next, I have written code to write said runs from said list to a new document. The issue is I keep getting an error stating " "Run" object is not iterable". I don't know if the issue is that writing all of the paragraph information to a list that can be recalled can not be done this way, or if I am trying to write it to the document the wrong way.
import tkinter as tk
from tkinter.filedialog import askopenfilename
from docx import Document # Invokes Document command from docx
def get_para_data(output_doc_name, paragraph):
"""
Write the run to the new file and then set its font, bold, alignment, color etc. data.
"""
output_run = []
output_para = output_doc_name.add_paragraph()
for run in paragraph.runs:
if paragraph:
output_run = output_para.add_run(run.text)
# Run's bold data
output_run.bold = run.bold
# Run's italic data
output_run.italic = run.italic
# Run's underline data
output_run.underline = run.underline
# Run's color data
output_run.font.color.rgb = run.font.color.rgb
# Run's font data
output_run.style.name = run.style.name
# Paragraph's alignment data
output_para.paragraph_format.alignment = paragraph.paragraph_format.alignment
else:
output_run = []
return output_run
# IMPORT WORD DOCUMENT
root = tk.Tk()
root.withdraw()
# returns the file path as variable for future use.
doc_path = askopenfilename(title="Choose Word File")
# Imports Word Document to Modify.
document = Document(doc_path)
# Number of paragraphs in document.
t = len(document.paragraphs)
# Preallocation of list.
output_paragraph = [None]*t
result = Document()
# Begin loop to create list of paragraph data using function created above.
i = 0
for para in document.paragraphs:
output_paragraph[i] = get_para_data(result, document.paragraphs[i])
# Write desired portion of document into a new document.
document_new = Document()
new_line = []
a = 0
for out in output_paragraph:
# Check to Verify it is not a blank line/return.
if output_paragraph:
new_line[a] = document_new.add_paragraph(output_paragraph[a])
a += 1
# if it is a blank line/return write blank line return.
else:
document_new.add_paragraph(text='\\r', style=None)
a += 1
My expected results were that the text in the new document was the same as in the previous document, but in an order i choose. Similar to a copy and paste, but I wanted the ability to choose which portions I "pasted" and when.

Related

Creating a python spellchecker using tkinter

For school, I need to create a spell checker, using python. I decided to do it using a GUI created with tkinter. I need to be able to input a text (.txt) file that will be checked, and a dictionary file, also a text file. The program needs to open both files, check the check file against the dictionary file, and then display any words that are misspelled.
Here's my code:
import tkinter as tk
from tkinter.filedialog import askopenfilename
def checkFile():
# get the sequence of words from a file
text = open(file_ent.get())
dictDoc = open(dict_ent.get())
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
# make a dictionary of the word counts
wordDict = {}
for w in words:
wordDict[w] = wordDict.get(w,0) + 1
for k in dictDict:
dictDoc.pop(k, None)
misspell_lbl["text"] = dictDoc
# Set-up the window
window = tk.Tk()
window.title("Temperature Converter")
window.resizable(width=False, height=False)
# Setup Layout
frame_a = tk.Frame(master=window)
file_lbl = tk.Label(master=frame_a, text="File Name")
space_lbl = tk.Label(master=frame_a, width = 6)
dict_lbl =tk.Label(master=frame_a, text="Dictionary File")
file_lbl.pack(side=tk.LEFT)
space_lbl.pack(side=tk.LEFT)
dict_lbl.pack(side=tk.LEFT)
frame_b = tk.Frame(master=window)
file_ent = tk.Entry(master=frame_b, width=20)
dict_ent = tk.Entry(master=frame_b, width=20)
file_ent.pack(side=tk.LEFT)
dict_ent.pack(side=tk.LEFT)
check_btn = tk.Button(master=window, text="Spellcheck", command=checkFile)
frame_c = tk.Frame(master=window)
message_lbl = tk.Label(master=frame_c, text="Misspelled Words:")
misspell_lbl = tk.Label(master=frame_c, text="")
message_lbl.pack()
misspell_lbl.pack()
frame_a.pack()
frame_b.pack()
check_btn.pack()
frame_c.pack()
# Run the application
window.mainloop()
I want the file to check against the dictionary and display the misspelled words in the misspell_lbl.
The test files I'm using to make it work, and to submit with the assignment are here:
check file
dictionary file
I preloaded the files to the site that I'm submitting this on, so it should just be a matter of entering the file name and extension, not the entire path.
I'm pretty sure the problem is with my function to read and check the file, I've been beating my head on a wall trying to solve this, and I'm stuck. Any help would be greatly appreciated.
Thanks.
The first problem is with how you try to read the files. open(...) will return a _io.TextIOWrapper object, not a string and this is what causes your error. To get the text from the file, you need to use .read(), like this:
def checkFile():
# get the sequence of words from a file
with open(file_ent.get()) as f:
text = f.read()
with open(dict_ent.get()) as f:
dictDoc = f.read().splitlines()
The with open(...) as f part gives you a file object called f, and automatically closes the file when it's done. This is more concise version of
f = open(...)
text = f.read()
f.close()
f.read() will get the text from the file. For the dictionary I also added .splitlines() to turn the newline separated text into a list.
I couldn't really see where you'd tried to check for misspelled words, but you can do it with a list comprehension.
misspelled = [x for x in words if x not in dictDoc]
This gets every word which is not in the dictionary file and adds it to a list called misspelled. Altogether, the checkFile function now looks like this, and works as expected:
def checkFile():
# get the sequence of words from a file
with open(file_ent.get()) as f:
text = f.read()
with open(dict_ent.get()) as f:
dictDoc = f.read().splitlines()
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
# make a dictionary of the word counts
wordDict = {}
for w in words:
wordDict[w] = wordDict.get(w,0) + 1
misspelled = [x for x in words if x not in dictDoc]
misspell_lbl["text"] = misspelled

How to remove lines from a file starting with a specific word python3

I am doing this as an assignment. So, I need to read a file and remove lines that start with a specific word.
fajl = input("File name:")
rec = input("Word:")
def delete_lines(fajl, rec):
with open(fajl) as file:
text = file.readlines()
print(text)
for word in text:
words = word.split(' ')
first_word = words[0]
for first in word:
if first[0] == rec:
text = text.pop(rec)
return text
print(text)
return text
delete_lines(fajl, rec)
At the last for loop, I completely lost control of what I am doing. Firstly, I can't use pop. So, once I locate the word, I need to somehow delete lines that start with that word. Additionally, there is also one minor problem with my approach and that is that first_word gets me the first word but the , also if it is present.
Example text from a file(file.txt):
This is some text on one line.
The text is irrelevant.
This would be some specific stuff.
However, it is not.
This is just nonsense.
rec = input("Word:") --- This
Output:
The text is irrelevant.
However, it is not.
You cannot modify an array while you are iterating over it. But you can iterate over a copy to modify the original one
fajl = input("File name:")
rec = input("Word:")
def delete_lines(fajl, rec):
with open(fajl) as file:
text = file.readlines()
print(text)
# let's iterate over a copy to modify
# the original one without restrictions
for word in text[:]:
# compare with lowercase to erase This and this
if word.lower().startswith(rec.lower()):
# Remove the line
text.remove(word)
newtext="".join(text) # join all the text
print(newtext) # to see the results in console
# we should now save the file to see the results there
with open(fajl,"w") as file:
file.write(newtext)
print(delete_lines(fajl, rec))
Tested with your sample text. if you want to erase "this". The startswith method will wipe "this" or "this," alike. This will only delete the text and let any blank lines alone. if you don't want them you can also compare with "\n" and remove them

Python/PyPDF4: How do I specify the /PageLabels in the created PDF?

I am using PyPDF4 to create an offline-readable version of the journal "Nature".
I use PyPDF4 PdfFileReader to read the individual article PDFs and PdfFileWriter to create a single, merged ouput.
The problem that I am trying to solve is that the page numbers of some issues do not start at 1, for example, issue 7805 starts with page 563.
How do I specify the desired /PageLabels in the document catalog?
for pdf_file in pdf_files:
input_pdf = PdfFileReader(open(pdf_file, 'rb'))
page_indices = file_page_dictionary[pdf_file]
for page_index in page_indices:
page = input_pdf.getPage(page_index)
# Specify actual page number here:
# page.setPageNumber(actual_page_numbers[page_index])
output.addPage(page)
with open(pdf_output_name, 'wb') as f:
output.write(f)
After exploring the PDF standard and a bit of hacking, I found that the following function will add a single PageLabels entry that creates page lables starting from offset (i.e. the first page will be labelled the offset, the second page, offset+1, etc.).
# output_pdf is an instance of PdfFileWriter().
# offset is the desired page offset.
def add_pagelabels(output_pdf, offset):
number_type = PDF.DictionaryObject()
number_type.update({PDF.NameObject("/S"):PDF.NameObject("/D")})
number_type.update({PDF.NameObject("/St"):PDF.NumberObject(offset)})
nums_array = PDF.ArrayObject()
nums_array.append(PDF.NumberObject(0)) # physical page index
nums_array.append(number_type)
page_numbers = PDF.DictionaryObject()
page_numbers.update({PDF.NameObject("/Nums"):nums_array})
page_labels = PDF.DictionaryObject()
page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})
root_obj = output_pdf._root_object
root_obj.update(page_labels)
Additional page label entries can be created (i.e. with different offsets or different numbering styles).
Note that the first PDF page has an index of 0.
# Use PyPDF to manipulate pages
from PyPDF4 import PdfFileWriter, PdfFileReader
# To manipulate the PDF dictionary
import PyPDF4.pdf as PDF
def pdf_pagelabels_roman():
number_type = PDF.DictionaryObject()
number_type.update({PDF.NameObject("/S"):PDF.NameObject("/r")})
return number_type
def pdf_pagelabels_decimal():
number_type = PDF.DictionaryObject()
number_type.update({PDF.NameObject("/S"):PDF.NameObject("/D")})
return number_type
def pdf_pagelabels_decimal_with_offset(offset):
number_type = pdf_pagelabels_decimal()
number_type.update({PDF.NameObject("/St"):PDF.NumberObject(offset)})
return number_type
...
nums_array = PDF.ArrayObject()
# Each entry consists of an index followed by a page label...
nums_array.append(PDF.NumberObject(0)) # Page 0:
nums_array.append(pdf_pagelabels_roman()) # Roman numerals
# Each entry consists of an index followed by a page label...
nums_array.append(PDF.NumberObject(1)) # Page 1 -- 10:
nums_array.append(pdf_pagelabels_decimal_with_offset(first_offset)) # Decimal numbers, with Offset
# Each entry consists of an index followed by a page label...
nums_array.append(PDF.NumberObject(10)) # Page 11 --> :
nums_array.append(pdf_pagelabels_decimal_with_offset(second_offset))
page_numbers = PDF.DictionaryObject()
page_numbers.update({PDF.NameObject("/Nums"):nums_array})
page_labels = PDF.DictionaryObject()
page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})
root_obj = output._root_object
root_obj.update(page_labels)

Correctly substitute values for working with ImageGrab

How can I change the format of a string so that ImageGrab accepts it?
My task is to get the coordinates for the container from the file and paste their values into the box.
There is only one line in the file and it has the format: 335,50,467,70.
If I substitute these values directly, and not through a variable, the script works perfectly. But he refuses to take values from the file.
What do i do?
from PIL import ImageGrab
with open(r"C:\Users\admin\Desktop\area.txt", "r") as file:
lines = file.readline()
print(lines)
box = (lines)
#box = (335,50,467,70) # If so, then everything works perfectly
print(box)
img = ImageGrab.grab(box)
#img.show()
To expand on my comment a little, you have a string: lines = "335,50,467,70"
You can split the string by separating across the commas to give a list of strings, like so:
box = lines.split(',')
box
>>> ["335", "50", "467", "70"]
Then, you can iterate over the list and cast each item to an int like so:
box = tuple(int(item) for item in lines.split(','))
box
>>> (335, 50, 467, 70)

How to get an image (inlineshape) from paragraph python docx

I want to read the docx document paragraph by paragraph and if there is a picture (InlineShape), then process it with the text around it. The function Document.inline_shapes will give the list of all inline shapes in the document. But I want to get the one, that appears exactly in the current paragraph if exists...
An example of code:
from docx import Document
doc = Document("test.docx")
blip = doc.inline_shapes[0]._inline.graphic.graphicData.pic.blipFill.blip
rID = blip.embed
document_part = doc.part
image_part = document_part.related_parts[rID]
fr = open("test.png", "wb")
fr.write(image_part._blob)
fr.close()
(this is how I want to save these pictures)
Assume your paragraph is par, you may use the following code to find the images
import xml.etree.ElementTree as ET
def hasImage(par):
"""get all of the images in a paragraph
:param par: a paragraph object from docx
:return: a list of r:embed
"""
ids = []
root = ET.fromstring(par._p.xml)
namespace = {
'a':"http://schemas.openxmlformats.org/drawingml/2006/main", \
'r':"http://schemas.openxmlformats.org/officeDocument/2006/relationships", \
'wp':"http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"}
inlines = root.findall('.//wp:inline',namespace)
for inline in inlines:
imgs = inline.findall('.//a:blip', namespace)
for img in imgs:
id = img.attrib['{{{0}}}embed'.format(namespace['r'])]
ids.append(id)
return ids

Resources