Is it possible to load xlsx with openpyxl from zipfile - excel

I'm trying to openpyxl.load_workbook xlsx files from compressed zip file, but it doesn't work. The following code fails at openpyxl.load_workbook with "BadZipfile: File is not a zip file"
with zipfile.ZipFile(os.path.join(root, raw)) as z:
for file_info in z.infolist():
wb = openpyxl.load_workbook(z.open(file_info), read_only=True)
There is nothing wrong with the archive and the excel file in it, as if i extract it to disk then the following works:
with open('report.xlsx') as f:
wb = openpyxl.load_workbook(f, read_only=True)
I can go with this solution and temporary extract it somewhere and load xslx, but would like to understand if it possible to load it from zipfile.

The problem is that readonly=True does not do quite what you think it does. According to the docs:
Fortunately, there are two modes that enable you to read and write unlimited amounts of data with (near) constant memory consumption.
While not explicitly stated, I would assume that this involves some equivalent to a memory-mapped file (because of "constant memory consumption") and random access (because of the range of allowed operations).
Either way, setting readonly=True is not an indication of the fact that you only intend to read a workbook (that's all load_workbook can do anyway, you have to overwrite the existing one to make any "changes"). It is an indication of the fact that you want to access the file directly on disk, without loading the entire contents.
It seems pretty clear (and intuitively expected) that ZipFile.open does not provide a random-access file:
Note: The file-like object is read-only and provides the following methods: read(), readline(), readlines(), __iter__(), next().
The fact that seek is not mentioned in this list is quite telling (pun only somewhat intended).
You can get more information about the exception by splitting the offending line into two (a useful general debugging technique for nested function calls):
x = z.open(file_info)
wb = openpyxl.load_workbook(x, readonly=True)
You will notice that the error occurs on the second of those two lines. This is because pretty much all the Microsoft open-document formats are actually just fancy zip files. The problem is most likely that openpyxl cannot open your file in random access mode, not that it's actually an invalid zip file.
Either way, this is a bunch of very educated guesswork that leads to a simple, one-keyword-deletion solution:
TL;DR
Get rid of readonly=True when reading non-random-access data like a compressed zip entry:
wb = openpyxl.load_workbook(z.open(file_info))
Appendix
You should get in the habit of writing minimal programs that demonstrate your issue so that people answering your question can focus on doing their job instead of getting irritated and closing down what would otherwise be a perfectly good question. I liked your question enough to do that for you, so here is a minimal program that demonstrates your issue and requires nothing more than copy-and-paste to run:
import openpyxl, zipfile
from openpyxl.workbook.workbook import Workbook
wb = Workbook()
wb.active['A1'] = 12
wb.active['A2'] = 13
wb.save('report.xlsx')
with zipfile.ZipFile('test.zip', 'w') as z:
z.write('report.xlsx')
with open('report.xlsx') as f:
wb = openpyxl.load_workbook(f, read_only=True)
print(wb.active['A1'].value)
print(wb.active['A2'].value)
with zipfile.ZipFile('test.zip', 'r') as z:
for file_info in z.infolist():
x = z.open(file_info, 'r')
wb = openpyxl.load_workbook(x, readonly=True)
print(wb.active['A1'].value)
print(wb.active['A2'].value)

Related

How to keep the share properties of an excel with python openpyxl?

I have trouble trying to keep the sharing properties of an excel. I tried this :Python and openpyxl is saving my shared workbook as unshared but the part of vout just cancels all the modification I made with the script
To explain the problem :
There's an excel file that is shared in which people can do some modification
Python reads and writes on it
When I save the workbook in the excel file, it automatically either drops the sharing property or when I try to keep it, it just doesn't do any modification
Can someone help me please ?
I'll get a little more precise, as requested.
The sharing mode is the one Microsoft provides. You can see the button below:
Share button Excel
The excel is stored on a server. Several users can write on it at the same time but when I launch my script, it stops automatically the sharing property, so everyone that is writing on it just can't do modification anymore and every modif they did is lost.
First I treated my Excel normally :
DLT=openpyxl.load_workbook(myPath)
ws=DLT['DLT']
...my modifications on ws...
DLT.save()
DLT.close()
But then I tried this (Python and openpyxl is saving my shared workbook as unshared)
DLT=openpyxl.load_workbook(myPath)
ws=DLT['DLT']
zin = zipfile.ZipFile(myPath, 'r')
buffers = []
for item in zin.infolist():
buffers.append((item, zin.read(item.filename)))
zin.close()
...my modif on ws...
DLT.save()
zout = zipfile.ZipFile(myPath, 'w')
for item, buffer in buffers:
zout.writestr(item, buffer)
zout.close()
DLT.close()
The second one just doesn't save my modification on ws.
The thing I would like to do, is not to get rid of the sharing property. I would need to keep it while I write on it. Not sure if it is possible. I have one alternative solution that is to use another file, and just copy/paste by hand the new data from this file to the DLT one.
well... after playing with it back and forth, for some weird reason zipfile.infolist() does contains the sheet data as well, so here's my way to fine tune it, using the shared_pyxl_save example the previous gentleman provided
basically instead of letting the old file overriding the sheet's data, use the old one
def shared_pyxl_save(file_path, workbook):
"""
`file_path`: path to the shared file you want to save
`workbook`: the object returned by openpyxl.load_workbook()
"""
zin = zipfile.ZipFile(file_path, 'r')
buffers = []
for item in zin.infolist():
if "sheet1.xml" not in item.filename:
buffers.append((item, zin.read(item.filename)))
zin.close()
workbook.save(file_path)
""" loop through again to find the sheet1.xmls and put it into buffer, else will show up error"""
zin2 = zipfile.ZipFile(file_path, 'r')
for item in zin2.infolist():
if "sheet1.xml" in item.filename:
buffers.append((item, zin2.read(item.filename)))
zin2.close()
#finally saves the file
zout = zipfile.ZipFile(file_path, 'w')
for item, buffer in buffers:
zout.writestr(item, buffer)
zout.close()
workbook.close()

How to make a file non-editable?

I am working on an idea. I need a bit help here, as I don't have in depth knowledge of python modules(I mean I don't know all the python modules). Take a look at the code
file = open('Data.txt', 'w')
a = input('Enter your name in format F|M|L: ')
file.write(a)
file.close()
The above code opens a file which writes my data to it. However I want to edit the document only through python and not through opening it from saved location. Shortly I want to disable editions done by opening the file in text-editors.
If your OS is Windows, the most straightforward option is to make the file read-only when your script is done. And set the read-only flag to false while your script is running. There are some ways to modify file permissions using pywin32 library, but it's complicated and hard to find good examples.
import os
from stat import S_IWRITE, S_IREAD
fname = 'test.txt'
# if file exists, reset read only to false (allow write)
if os.path.isfile(fname):
os.chmod(fname, S_IWRITE)
fid = open(fname, 'w')
fid.write('shoobie doobie')
fid.close()
It's already been pointed out in the comments, this technique won't stop a determined person from changing the read only attribute.
If text-editor is all your concern, use 'wb' instead of 'w'.
Use 'rb' to open those files too.

Python 3 combining file open and read commands - a need to close a file and how?

I am working through "Learn Python 3 the Hard Way" and am making code more concise. Lines 11 to 18 of the program below (line 1 starts at # program: p17.py) are relevant to my question. Opening and reading a file are very easy and it is easy to see how you close the file you open when working with the files. The original section is commented out and I include the concise code on line 16. I commented out the line of code that causes an error (on line 20):
$ python3 p17_aside.py p17_text.txt p17_to_file_3.py
Copying from p17_text.txt to p17_to_file_3.py
This is text.
Traceback (most recent call last):
File "p17_aside.py", line 20, in
indata.close()
AttributeError: 'str' object has no attribute 'close'
Code is below:
# program: p17.py
# This program copies one file to another. It uses the argv function as well
# as exists - from sys and os.path modules respectively
from sys import argv
from os.path import exists
script, from_file, to_file = argv
print(f"Copying from {from_file} to {to_file}")
# we could do these two on one line, how?
#in_file = open(from_file)
#indata = in_file.read()
#print(indata)
# THE ANSWER -
indata = open(from_file).read()
# The next line was used for testing
print(indata)
# indata.close()
So my question is should I just avoid the practice of combining commands as done above or is there a way to properly deal with that situation so files are closed when they should be? Is it necessary to deal with the situation of closing a file at all in this situation?
Context manager and with statement is a comfortable way to make sure your file is closed as needed:
with open(from_file) as fobj:
indata = fobj.read()
Nowadays, you can also use Path-like objects and their read_text and read_bytes methods:
# This assumes Path from pathlib has been imported
indata = Path(from_file).read_text()
The error you were seeing... is because you were not trying to close the file, but str into which you've read its content into. You'd need to assign object returned by open a name, and then read from and close that one:
fobj = open(from_file)
indata = fobj.read()
fobj.close() # This is OK
Strictly speaking, you would not need to close that file as dangling file descriptors would be "clobbered" with the process. Esp. in a short example like this, it would be of relatively little concern.
I hope I got the follow up question in comment correctly to extend on this a bit more.
If you wanted a single command, look at the pathtlib.Path example above.
With open as such, you cannot perform read and close in a single operation and without assigning result of open to a variable. As both read and close would have to be performed on the same object returned by open. If you do:
var = fobj.read()
Now, var refers to content read out of the file (so nothing that you could close, would have a close method).
If you did:
open(from_file).close()
After (but also before; at any point), you would simply open that file (again) and close it immediately. BTW. this returns None, just in case you wanted to get the return value. But it would not affect previously open file handles and file-like objects. It would not serve any practical purpose except for perhaps making sure you can open a file.
But again. It's a good practice to perform the housekeeping, but strictly speaking (and esp. in a short code like this). If you did not close the file and relied on the OS to clean-up after your process. It'd work fine.
How about the following:
# to open the file and read it
indata = open(from_file).read()
print(indata)
# this closes the file - just the opposite of opening and reading
open(from_file).close()

Xref table not zero-indexed. ID numbers for objects will be corrected. won't continue

I am trying to open a pdf to get the number of pages. I am using PyPDF2.
Here is my code:
def pdfPageReader(file_name):
try:
reader = PyPDF2.PdfReader(file_name, strict=True)
number_of_pages = len(reader.pages)
print(f"{file_name} = {number_of_pages}")
return number_of_pages
except:
return "1"
But then i run into this error:
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
I tried to use strict=True and strict=False, When it is True, it displays this message, and nothing, I waited for 30minutes, but nothing happened. When it is False, it just display nothing, and that's it, just do nothing, if I press ctrl+c on the terminal (cmd, windows 10) then it cancel that open and continues (I run this in a batch of pdf files). Only 1 in the batch got this problem.
My questions are, how do I fix this, or how do I skip this, or how can I cancel this and move on with the other pdf files?
If somebody had a similar problem and it even crashed the program with this error message
File "C:\Programy\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1604, in getObject
% (indirectReference.idnum, indirectReference.generation, idnum, generation))
PyPDF2.utils.PdfReadError: Expected object ID (14 0) does not match actual (13 0); xref table not zero-indexed.
It helped me to add the strict argument equal to False for my pdf reader
pdf_reader = PdfReader(input_file, strict=False)
For anybody else who may be running into this problem, and found that strict=False didn't help, I was able to solve the problem by just re-saving a new copy of the file in Adobe Acrobat Reader. I just opened the PDF file inside an actual copy of Adobe Acrobat Reader (the plain ol' free version on Windows), did a "Save as...", and gave the file a new name. Then I ran my script again using the newly saved copy of my PDF file.
Apparently, the PDF file I was using, which were generated directly from my scanner, were somehow corrupt, even though I could open and view it just fine in Reader. Making a duplicate copy of the file via re-saving in Acrobat Reader somehow seemed to correct whatever was missing.
I had the same problem and looked for a way to skip it. I am not a programmer but looking at the documentation about warnings there is a piece of code that helps you avoid such hindrance.
Although I wouldn't recomend this as a solution, the piece of code that I used for my purpose is (just copied and pasted it from doc on link)
import sys
if not sys.warnoptions:
import warnings
warnings.simplefilter("ignore")
This happens to me when the file was created in a printer / scanner combo that generates PDFs. I could read in the PDF with only a warning though so I read it in, and then rewrote it as a new file. I could append that new one.
from PyPDF2 import PdfMerger, PdfReader, PdfWriter
reader = PdfReader("scanner_generated.pdf", strict=False)
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
with open("fixedPDF.pdf", "wb") as fp:
writer.write(fp)
merger = PdfMerger()
merger.append("fixedPDF.pdf")
I had the exact same problem, and the solutions did help but didn't solve the problem completely, at least the one setting strict=False & resaving the document using Acrobat reader.
Anyway, I still got a stream error, but I was able to fix it after using an PDF online repair. I used sejda.com but please be aware that you are uploading your PDF on some website, so make sure there is nothing sensible in there.

Load spydata file

I'm coming from R + Rstudio. In RStudio, you can save objects to an .RData file using save()
save(object_to_save, file = "C:/path/where/RData/file/will/be/saved.RData")
You can then load() the objects :
load(file = "C:/path/where/RData/file/was/saved.RData")
I'm now using Spyder and Python3, and I was wondering if the same thing is possible.
I'm aware everything in the globalenv can be saved to a .spydata using this :
But I'm looking for a way to save to a .spydata file in the code. Basically, just the code under the buttons.
Bonus points if the answer includes a way to save an object (or multiple objects) and not the whole env.
(Please note I'm not looking for an answer using pickle or shelve, but really something similar to R's load() and save().)
(Spyder developer here) There's no way to do what you ask for with a command in Spyder consoles.
If you'd like to see this in a future Spyder release, please open an issue in our issues tracker about it, so we don't forget to consider it.
Considering the comment here, we can
rename the file from .spydata to .tar
extract the file (using file manager, for example). It will deliver a file .pickle (and maybe a .npy)
extract the objects saved from the environment:
import pickle
with open(path, 'rb') as f:
data_temp = pickle.load(f)
that object will be a dictionary with the objects saved.

Resources