I'm a noob PyQt5 user following a tutorial and I'm confused how I might extend the sample code below.
The two handlers canInsertFromMimeData and insertFromMimeData Qt5 methods accept an image mime datatype dragged and dropped onto document (that works great). They both receive a signal parameter source which receives a QMimeData object.
However, If I try to paste an image copied from the Windows clipboard into the document it just crashes as there is no handler for this.
Searching the Qt5 documentation at just leads me to further confusion as I'm not a C++ programmer and I'm using Python 3.x and PyQt5 to do this.
How would I write a handler to allow an image copied to the clipboard to be pasted into the document directly?
class TextEdit(QTextEdit):
def canInsertFromMimeData(self, source):
if source.hasImage():
return True
return super(TextEdit, self).canInsertFromMimeData(source)
def insertFromMimeData(self, source):
cursor = self.textCursor()
document = self.document()
if source.hasUrls():
for u in source.urls():
file_ext = splitext(str(u.toLocalFile()))
if u.isLocalFile() and file_ext in IMAGE_EXTENSIONS:
image = QImage(u.toLocalFile())
document.addResource(QTextDocument.ImageResource, u, image)
# If we hit a non-image or non-local URL break the loop and fall out
# to the super call & let Qt handle it
# If all were valid images, finish here.
elif source.hasImage():
image = source.imageData()
uuid = hexuuid()
document.addResource(QTextDocument.ImageResource, uuid, image)
super(TextEdit, self).insertFromMimeData(source)
I was exactly in the same position as you. I am also new to Python, so there might be mistakes.
The variable uuid in document.addResource(QTextDocument.ImageResource, uuid, image) is not working. It should be a path -> QUrl(uuid).
Now you can insert the image. However, because the path to an image from the clipboard is changing, it would be better to use a different path, for example to the directory where you are also saving the files.
Also be aware that the user has to select the file type when saving (.html)
For my own project I am going to print the file as pdf. That way you dont have to worry about paths to images ^-^

I got around this by converting to base64 inline embedding of the images, then no resource files as it is all in one file.


How to keep the share properties of an excel with python openpyxl?

I have trouble trying to keep the sharing properties of an excel. I tried this :Python and openpyxl is saving my shared workbook as unshared but the part of vout just cancels all the modification I made with the script
To explain the problem :
There's an excel file that is shared in which people can do some modification
Python reads and writes on it
When I save the workbook in the excel file, it automatically either drops the sharing property or when I try to keep it, it just doesn't do any modification
Can someone help me please ?
I'll get a little more precise, as requested.
The sharing mode is the one Microsoft provides. You can see the button below:
Share button Excel
The excel is stored on a server. Several users can write on it at the same time but when I launch my script, it stops automatically the sharing property, so everyone that is writing on it just can't do modification anymore and every modif they did is lost.
First I treated my Excel normally :
ws=DLT['DLT'] modifications on ws...
But then I tried this (Python and openpyxl is saving my shared workbook as unshared)
zin = zipfile.ZipFile(myPath, 'r')
buffers = []
for item in zin.infolist():
zin.close() modif on ws...
zout = zipfile.ZipFile(myPath, 'w')
for item, buffer in buffers:
zout.writestr(item, buffer)
The second one just doesn't save my modification on ws.
The thing I would like to do, is not to get rid of the sharing property. I would need to keep it while I write on it. Not sure if it is possible. I have one alternative solution that is to use another file, and just copy/paste by hand the new data from this file to the DLT one.
well... after playing with it back and forth, for some weird reason zipfile.infolist() does contains the sheet data as well, so here's my way to fine tune it, using the shared_pyxl_save example the previous gentleman provided
basically instead of letting the old file overriding the sheet's data, use the old one
def shared_pyxl_save(file_path, workbook):
`file_path`: path to the shared file you want to save
`workbook`: the object returned by openpyxl.load_workbook()
zin = zipfile.ZipFile(file_path, 'r')
buffers = []
for item in zin.infolist():
if "sheet1.xml" not in item.filename:
""" loop through again to find the sheet1.xmls and put it into buffer, else will show up error"""
zin2 = zipfile.ZipFile(file_path, 'r')
for item in zin2.infolist():
if "sheet1.xml" in item.filename:
#finally saves the file
zout = zipfile.ZipFile(file_path, 'w')
for item, buffer in buffers:
zout.writestr(item, buffer)

How to make PyQT drag event copy a file instead of move it?

I have a little python app that searches the file system for files, and lets you drag them out of a QListWidget and into other software, your desktop, etc.
The problem is, I don't ever want to destroy/remove the source file. If I drag the file onto my desktop, I want to keep the file in the source location as well.
Currently, I'm creating a QDrag event and attaching mime data for the file. How do I tell windows that this is a copy/paste situation rather than a cut/paste situation?
class MyListWidget(QListWidget):
def dragLeaveEvent(self, event):
global window
drag = QDrag(self)
data = QMimeData()
data.setData("text/plain", "")
files = []
# Loop through QList (file paths), turn into QUrl, and add to array
for element in window.searchResults.selectedIndexes():
url =
As #musicamante suggested, changing drag.exec_() to drag.exec_(QtCore.Qt.CopyAction) did the trick!

filled PDF fields showing up differently in different contexts

I have a python script that creates a number of pdf forms (0 - 10) and then concatenates them into one form. The fields on the compiled PDF show up differently in 4 different contexts. I am developing in debian linux, and the pdf viewer (Okular) does not show any fields within the compiled PDF, whereas on Windows 10, if I open the pdf with chrome, I have to hover over the field to see the field value. It has the correct field data for the first page, however, each subsequent page is just a duplicate of the first page, which is incorrect. If I open the pdf with Microsoft Edge, it correctly displays the form data for each page, however when I go to print with edge, none of the form data shows up.
I am using pdfrw for writing to pdf, and pypdf2 for merging. I have tried a number of different things, including attempting to flatten the pdf with python (which there is very little support for btw), reading and writing instead of merging, attempting to convert the form fields into text, along with many other things that I have since forgotten about since they did not work.
def writeToPdf(unfilled, output, data, fields):
'''Function writes the data from data to unfilled, and saves it as output'''
# TODO: Use literal declarations for lists, dicts, etc
checkboxes = [
template_pdf = pdfrw.PdfReader(unfilled)
annotations = template_pdf.pages[0][Annot_Key]
for annotation in annotations:
# TODO: Singly nested if's with no else's suggest a logic problem, find a clearer way to do this.
if annotation[Subtype_Key] == Widget_Subtype_Key:
if annotation[Annot_Field_Key]:
key = annotation[Annot_Field_Key][1:-1]
if key in fields:
if key in checkboxes:
if(key == 'course'):
pdfrw.PdfWriter().write(output, template_pdf)
def set_need_appearances_writer(writer):
# basically used to ensured there are not
# overlapping form fields, which makes printing hard
catalog = writer._root_object
# get the AcroForm tree and add "/NeedAppearances attribute
if "/AcroForm" not in catalog:
NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
except Exception as e:
print('set_need_appearances_writer() catch : ', repr(e))
return writer
def mergePDFs(listOfPdfPaths, outputPDf):
'''Function Merges a list of pdfs into a single one, and saves it to outputPDf'''
pdf_writer = PdfFileWriter()
for path in listOfPdfPaths:
pdf_reader = PdfFileReader(path)
for page in range(pdf_reader.getNumPages()):
with open(outputPDf, 'wb') as fh:
As mentioned above, there is different results for different contexts. Within Debian Linux, the okular view shows no forms, within windows 10 google chrome shows duplicate fields after the first page (but I have to hover over/click the field), Microsoft Edge shows the correct with each page having its own field data, and if i look at the print preview, it also shows no form data
If anyone else is having this quite obscure problem, the behavior is unspecified for the use case that I was dealing with (template fillable form with the same field names). The only solution that is available with python at the moment (at least that I found in my many hours researching and testing) was to flatten the pdf, create a separate pdf, and write the form data to the desired locations (I did this with reportlab), then to overlay the template pdf with the created pdf. Overall this is not a good solution for many reasons, so if you have a better one, please Post it!

Xref table not zero-indexed. ID numbers for objects will be corrected. won't continue

I am trying to open a pdf to get the number of pages. I am using PyPDF2.
Here is my code:
def pdfPageReader(file_name):
reader = PyPDF2.PdfReader(file_name, strict=True)
number_of_pages = len(reader.pages)
print(f"{file_name} = {number_of_pages}")
return number_of_pages
return "1"
But then i run into this error:
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. []
I tried to use strict=True and strict=False, When it is True, it displays this message, and nothing, I waited for 30minutes, but nothing happened. When it is False, it just display nothing, and that's it, just do nothing, if I press ctrl+c on the terminal (cmd, windows 10) then it cancel that open and continues (I run this in a batch of pdf files). Only 1 in the batch got this problem.
My questions are, how do I fix this, or how do I skip this, or how can I cancel this and move on with the other pdf files?
If somebody had a similar problem and it even crashed the program with this error message
File "C:\Programy\Anaconda3\lib\site-packages\PyPDF2\", line 1604, in getObject
% (indirectReference.idnum, indirectReference.generation, idnum, generation))
PyPDF2.utils.PdfReadError: Expected object ID (14 0) does not match actual (13 0); xref table not zero-indexed.
It helped me to add the strict argument equal to False for my pdf reader
pdf_reader = PdfReader(input_file, strict=False)
For anybody else who may be running into this problem, and found that strict=False didn't help, I was able to solve the problem by just re-saving a new copy of the file in Adobe Acrobat Reader. I just opened the PDF file inside an actual copy of Adobe Acrobat Reader (the plain ol' free version on Windows), did a "Save as...", and gave the file a new name. Then I ran my script again using the newly saved copy of my PDF file.
Apparently, the PDF file I was using, which were generated directly from my scanner, were somehow corrupt, even though I could open and view it just fine in Reader. Making a duplicate copy of the file via re-saving in Acrobat Reader somehow seemed to correct whatever was missing.
I had the same problem and looked for a way to skip it. I am not a programmer but looking at the documentation about warnings there is a piece of code that helps you avoid such hindrance.
Although I wouldn't recomend this as a solution, the piece of code that I used for my purpose is (just copied and pasted it from doc on link)
import sys
if not sys.warnoptions:
import warnings
This happens to me when the file was created in a printer / scanner combo that generates PDFs. I could read in the PDF with only a warning though so I read it in, and then rewrote it as a new file. I could append that new one.
from PyPDF2 import PdfMerger, PdfReader, PdfWriter
reader = PdfReader("scanner_generated.pdf", strict=False)
writer = PdfWriter()
for page in reader.pages:
with open("fixedPDF.pdf", "wb") as fp:
merger = PdfMerger()
I had the exact same problem, and the solutions did help but didn't solve the problem completely, at least the one setting strict=False & resaving the document using Acrobat reader.
Anyway, I still got a stream error, but I was able to fix it after using an PDF online repair. I used but please be aware that you are uploading your PDF on some website, so make sure there is nothing sensible in there.

Python3 multiprocessing

I am an absolute beginner. I fumble my way through code by analogy to examples so apologies for any misuse of terminology.
I have written a small piece of code in python 3 which:
takes a user input (a folder on their computer)
searches the folder for pdf files
turns each page of the PDF to an image with sequential numbering. Iterates through the jpgs in order of numbering, turning them black and white. OCR scans the files and outputs the text into an object, saves the text contents to a .txt file (via pytesseract). Deletes jpgs, leaving .txt file. Most time is taken in converting to jpgs and possibly making them black and white.
The code works, though I am sure it could be improved. It takes a while so I thought I'd try multiprocessing using Pools.
My code appears to create pools. I can also get the function to print a list of files in the folder, so it appears to have the list passed to it in one form or another.
I cannot get it to work and have now hacked the code about repeatedly with various errors. I think the main problem is, I am clueless.
My code begins:
User input block (asks for a folder in the user's directory, checks it is a valid folder etc).
OCR block as a function (parses PDF then outputs contents into single .txt file)
For loop block as a function (is supposed to loop over each PDF in folder and execute OCR block on it.
Multiprocessing block (is supposed to feed the list of files in the directory to the loop block.
To avoid writing War and Peace, I set out last version of the loop block and multiprocessing blocks below:
#import necessary modules
home_path = os.path.expanduser('~')
#ask for input with various checking mechanisms to make sure a useful pdfDir is obtained
pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:')
def textExtractor():
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries
#various lines of code here
def per_file_process (subject_files):
for pdf in subject_files:
#decode the whole file name as a string
pdf_filename = os.fsdecode(pdf)
#check whether the string ends in .pdf
if pdf_filename.endswith(".pdf"):
#call the OCR function on it
print ('nonsense')
if __name__ == '__main__':
pool = Pool(2), os.listdir(pdfDir))
Is anyone willing/able to point out my errors, please?
The relevant bits of the code whilst working:
#import necessary
home_path = os.path.expanduser('~')
#block accepting input
pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:')
def textExtractor():
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #need to think about using generic expanduser or other libraries to allow portability
#various lines of code to OCR and output .txt file
subject_files = os.listdir(pdfDir)
for pdf in subject_files:
#decode the whole file name as a string you can see
pdf_filename = os.fsdecode(pdf)
#check whether the string ends in /pdf
if pdf_filename.endswith(".pdf"):
#print for debugging calls the worker function repeatedly with each name returned by os.listdir. In per_file_process, subject_files is a single filename and for pdf in subject_files: is enumerating the individual characters in the name. Further, listdir only shows the base name, without subdirectories, so you aren't looking in the right place for the pdf. You can use glob to filter by extension name and return a working path to the file.
Your example is confusing... textExtractor() takes no parameters so how is it to know which file it is processing? I'm going out on a limb and assuming that it really does take the path to the file processing. If so, you can parallelize rather easily just by feeding pdf's directory it via map. Assuming processing time will vary by pdf, I am setting chunksize to 1 so that an early finishing worker can grap extra files to process.
from glob import glob
import os
from multiprocessing import Pool
def textExtractor(pdf_filename):
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries
#...various lines of code here
if __name__ == '__main__':
#pdfDir is the folder inputted by user
with Pool(2) as pool:
# assuming call signature: textExtractor(path_to_file),
(filename for filename in glob(os.path.join(pdfDir, '*.pdf'))
if os.path.isfile(filename))
