Xref table not zero-indexed. ID numbers for objects will be corrected. won't continue - python-3.x

I am trying to open a pdf to get the number of pages. I am using PyPDF2.
Here is my code:
def pdfPageReader(file_name):
try:
reader = PyPDF2.PdfReader(file_name, strict=True)
number_of_pages = len(reader.pages)
print(f"{file_name} = {number_of_pages}")
return number_of_pages
except:
return "1"
But then i run into this error:
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
I tried to use strict=True and strict=False, When it is True, it displays this message, and nothing, I waited for 30minutes, but nothing happened. When it is False, it just display nothing, and that's it, just do nothing, if I press ctrl+c on the terminal (cmd, windows 10) then it cancel that open and continues (I run this in a batch of pdf files). Only 1 in the batch got this problem.
My questions are, how do I fix this, or how do I skip this, or how can I cancel this and move on with the other pdf files?

If somebody had a similar problem and it even crashed the program with this error message
File "C:\Programy\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1604, in getObject
% (indirectReference.idnum, indirectReference.generation, idnum, generation))
PyPDF2.utils.PdfReadError: Expected object ID (14 0) does not match actual (13 0); xref table not zero-indexed.
It helped me to add the strict argument equal to False for my pdf reader
pdf_reader = PdfReader(input_file, strict=False)

For anybody else who may be running into this problem, and found that strict=False didn't help, I was able to solve the problem by just re-saving a new copy of the file in Adobe Acrobat Reader. I just opened the PDF file inside an actual copy of Adobe Acrobat Reader (the plain ol' free version on Windows), did a "Save as...", and gave the file a new name. Then I ran my script again using the newly saved copy of my PDF file.
Apparently, the PDF file I was using, which were generated directly from my scanner, were somehow corrupt, even though I could open and view it just fine in Reader. Making a duplicate copy of the file via re-saving in Acrobat Reader somehow seemed to correct whatever was missing.

I had the same problem and looked for a way to skip it. I am not a programmer but looking at the documentation about warnings there is a piece of code that helps you avoid such hindrance.
Although I wouldn't recomend this as a solution, the piece of code that I used for my purpose is (just copied and pasted it from doc on link)
import sys
if not sys.warnoptions:
import warnings
warnings.simplefilter("ignore")

This happens to me when the file was created in a printer / scanner combo that generates PDFs. I could read in the PDF with only a warning though so I read it in, and then rewrote it as a new file. I could append that new one.
from PyPDF2 import PdfMerger, PdfReader, PdfWriter
reader = PdfReader("scanner_generated.pdf", strict=False)
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
with open("fixedPDF.pdf", "wb") as fp:
writer.write(fp)
merger = PdfMerger()
merger.append("fixedPDF.pdf")

I had the exact same problem, and the solutions did help but didn't solve the problem completely, at least the one setting strict=False & resaving the document using Acrobat reader.
Anyway, I still got a stream error, but I was able to fix it after using an PDF online repair. I used sejda.com but please be aware that you are uploading your PDF on some website, so make sure there is nothing sensible in there.

Related

How to keep the share properties of an excel with python openpyxl?

I have trouble trying to keep the sharing properties of an excel. I tried this :Python and openpyxl is saving my shared workbook as unshared but the part of vout just cancels all the modification I made with the script
To explain the problem :
There's an excel file that is shared in which people can do some modification
Python reads and writes on it
When I save the workbook in the excel file, it automatically either drops the sharing property or when I try to keep it, it just doesn't do any modification
Can someone help me please ?
I'll get a little more precise, as requested.
The sharing mode is the one Microsoft provides. You can see the button below:
Share button Excel
The excel is stored on a server. Several users can write on it at the same time but when I launch my script, it stops automatically the sharing property, so everyone that is writing on it just can't do modification anymore and every modif they did is lost.
First I treated my Excel normally :
DLT=openpyxl.load_workbook(myPath)
ws=DLT['DLT']
...my modifications on ws...
DLT.save()
DLT.close()
But then I tried this (Python and openpyxl is saving my shared workbook as unshared)
DLT=openpyxl.load_workbook(myPath)
ws=DLT['DLT']
zin = zipfile.ZipFile(myPath, 'r')
buffers = []
for item in zin.infolist():
buffers.append((item, zin.read(item.filename)))
zin.close()
...my modif on ws...
DLT.save()
zout = zipfile.ZipFile(myPath, 'w')
for item, buffer in buffers:
zout.writestr(item, buffer)
zout.close()
DLT.close()
The second one just doesn't save my modification on ws.
The thing I would like to do, is not to get rid of the sharing property. I would need to keep it while I write on it. Not sure if it is possible. I have one alternative solution that is to use another file, and just copy/paste by hand the new data from this file to the DLT one.
well... after playing with it back and forth, for some weird reason zipfile.infolist() does contains the sheet data as well, so here's my way to fine tune it, using the shared_pyxl_save example the previous gentleman provided
basically instead of letting the old file overriding the sheet's data, use the old one
def shared_pyxl_save(file_path, workbook):
"""
`file_path`: path to the shared file you want to save
`workbook`: the object returned by openpyxl.load_workbook()
"""
zin = zipfile.ZipFile(file_path, 'r')
buffers = []
for item in zin.infolist():
if "sheet1.xml" not in item.filename:
buffers.append((item, zin.read(item.filename)))
zin.close()
workbook.save(file_path)
""" loop through again to find the sheet1.xmls and put it into buffer, else will show up error"""
zin2 = zipfile.ZipFile(file_path, 'r')
for item in zin2.infolist():
if "sheet1.xml" in item.filename:
buffers.append((item, zin2.read(item.filename)))
zin2.close()
#finally saves the file
zout = zipfile.ZipFile(file_path, 'w')
for item, buffer in buffers:
zout.writestr(item, buffer)
zout.close()
workbook.close()

PyPDF2/PDFminner text extracting issue when PDF is created using InDesign

We have a large number of PDFs which have been created with InDesign and not all the text was being extracted by PyPDF2. Here is the code:-
for pageNum in range(0, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
# note text is a bytes object not string.
text = pageObj.extractText().encode('utf-8')
search_text = text.lower()
if search_word in search_text.decode("utf-8"):
search_word = search_word.strip()
search_word_count += 1
print("Pattern Found on Page: " + str(pageNum+1))
search_word_count_list.append(search_word_count)
print("The word:- '{}' was found:- {} times\n".format(search_word, search_word_count))
I did some testing with PDFminner and found I had the same results i.e. the same bits of text were extracted/not extracted. So I figured there must be something going on with the PDF.
Off the back of this, I worked with a Typesetter doing some testing and discovered when text boxes are locked in InDesign (Crtl+L) the PDF exported has its text locked and is not extractable, I mean the bits that are locked are not extractable via PyPDF2 or PDFminner.
While going forward I can ask the typesetters to unlock text before exporting PDFs. BUT with the thousands to existing PDFs, I want to be able to extract the locked text, asking the typesetters to unlock thousands of files is not an option. Does anyone have experience of this? Any ideas on how to access the locked text?
Edit 1
So doing some testing with the Adobe Acrobat pro 11. When Saving-As Plain Text the locked text does not save to the text file. But the unlocked text does save to the .txt file.
Checking the Security tab in Acrobat:-
With all tested documents, open in Acrobat I pick File -> Properties, switch to the Security tab of the Document Properties dialog, and there I read "Security Method: No Security", and under the restrictions everything is 'Allowed' (Printing, Changing, Copying ...). So I think these are all valid PDFs, which are unprotected.
Edit 2
I have tried to install pdf2txt but my machine does not meet requirements as I am missing "Microsoft Visual C++ 14.0" and as it's a work machine it is locked down.
Edit 3
Acrobat says the PDF Version 1.7. PDF Producer Adobe PDF Library 15.0
I can copy-paste the locked text so I do not think rasterisation is the problem.
Edit 4 - possible solution
So I have tested using the https://pdftotext.com/ and it was able to access the locked text. So I will talk to the IT department to get "Microsoft Visual C++ 14.0" installed so I can use the pdf2txt library.
Edit 5
Have not had much luck with installing PDFtotext due to problems installing poppler which is nothing short of a nightmare to install.
Off the back of usr2564301 input, I have done some more testing with PDFminner. Here is the code I am using to test with:-
from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO
def pdf_to_text(path):
manager = PDFResourceManager()
retstr = BytesIO()
layout = LAParams(all_texts=True)
device = TextConverter(manager, retstr, laparams=layout)
filepath = open(path, 'rb')
interpreter = PDFPageInterpreter(manager, device)
for page in PDFPage.get_pages(filepath, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
filepath.close()
device.close()
retstr.close()
return text
As for versions I have pdfminer==20191125 installed. On Github it says "Supports PDF-1.7. (well, almost)" so maybe the problem is PDF 1.7?
Edit 6
Just tried to use PDFminer (PDF2txt.py) via Command Prompt. Using this code python C:\Users\my_name\AppData\Local\Programs\Python\Python37-32\Scripts\pdf2txt.py -o output.txt file_name.pdf but I get the same result i.e. locked text does not come through.
Edit 7
So did some testing with a designer and we proved it is when the text is on the master page that the text is not accessible to PDFminer. The designers have multiple preset master pages so that they can drag the required one onto the frontcover. If a designer wrongly works directly on the master page the content is lock and can not be accessed via PDFminer.
Note there is not a problem when text is on the frontcover and locked. Only when on the master page.

Load spydata file

I'm coming from R + Rstudio. In RStudio, you can save objects to an .RData file using save()
save(object_to_save, file = "C:/path/where/RData/file/will/be/saved.RData")
You can then load() the objects :
load(file = "C:/path/where/RData/file/was/saved.RData")
I'm now using Spyder and Python3, and I was wondering if the same thing is possible.
I'm aware everything in the globalenv can be saved to a .spydata using this :
But I'm looking for a way to save to a .spydata file in the code. Basically, just the code under the buttons.
Bonus points if the answer includes a way to save an object (or multiple objects) and not the whole env.
(Please note I'm not looking for an answer using pickle or shelve, but really something similar to R's load() and save().)
(Spyder developer here) There's no way to do what you ask for with a command in Spyder consoles.
If you'd like to see this in a future Spyder release, please open an issue in our issues tracker about it, so we don't forget to consider it.
Considering the comment here, we can
rename the file from .spydata to .tar
extract the file (using file manager, for example). It will deliver a file .pickle (and maybe a .npy)
extract the objects saved from the environment:
import pickle
with open(path, 'rb') as f:
data_temp = pickle.load(f)
that object will be a dictionary with the objects saved.

itk ImageFileReader exception reading if I add VTK Imagewriter object creation

That's it:
I read successfully a DICOM file with itk::ImageFileReader.
Now I want to export an image.
I use vtkJPEGWriter.
When I add the line
vtkJPEGWriter* writer = vtkJPEGWriter::New();
even if that code doesn't run at the beginning of execution... I can't read the file. I comment the line, then I read the file again.
But the writer is not connected with the file reader. I don't get it. It has nothing to do at that moment!!
I'm wasting so much time, just trying to figure out what's the problem.
The problem is in the file. I don't know why it works with that file without that line. Really weird.
I just don't get it.
I will try with other files.
These lines are worked for me:
vtkSmartPointer<vtkJPEGWriter> JPEGWriter = vtkSmartPointer<vtkJPEGWriter>::New();
JPEGWriter->SetFileName("d:\\Tempx\\Pacienttest3\\Sagital.bmp");
JPEGWriter->SetInputConnection(m_pColor->GetOutputPort());
JPEGWriter->Write();
where m_pColor is kind of vtkImageMapToColors type ...

Opening hdf5 file from pandas.HDFStore - get all keys and root.attributes?

This seems a bit silly that I can't figure this out, but I'm really at a loss here.
So let's say I have this:
In[6]: store
Out[6]:
<class 'pandas.io.pytables.HDFStore'>
File path: E:\Users\Dan\Desktop\Cell1-Wash-out-001\Cell1-Wash-out-001.h5
/voltage_recording frame (shape->[3200000,4])
Which is fine, and I can access both store.voltage_recording or store.root.attributes fine.
But once I close the file, I cannot seem to how to reopen it in a way that I can return these values again.
I know with pd.read_hdf() I can return, for example, the voltage_recording key. But I can't figure out how to get the whole pandas.io.pytables.HDFStore object back.
Is there a function somewhere I'm missing? I know I can also open the file itself with pytables, but that doesn't seem to be getting me where I want to go either.
quoted from Jeff in the comments:
"you just open like normal store = pd.HDFStore(filename,mode='r')
(mode is append by default, but if you aren't modifying doesn't
matter). to_hdf/read_hdf auto open/close."

Resources