The dictionary does not contain required key: Pages - python-3.x

I am trying to convert pdf to pdf/a using PDFNetPython3. However I am getting following errors.
Main error message:
The dictionary does not contain required key: Pages
According to PDFNetPython3 docs.
from PDFNetPython3 import PDFNet, PDFACompliance
# ... some necessary code like temp_file_path_in (this is not null and has values of file_object)
pdf_a = PDFACompliance(True, tmp_file_path_in, None, PDFACompliance.e_Level2B, 0, 0, 10)
Also tried using this(got same error):
pdf_a = PDFACompliance(True, filename, None, PDFACompliance.e_Level2B, 0, 10)
I wanted to know does this Pages related to pdf page numbers or total page count. I am merging a blank pdf page with other pdf pages and converting to pdfa !
Refrence: https://www.pdftron.com/documentation/python/guides/features/pdfa/convert/
Thanks in advance!!!

The exception indicates that the document you are processing does not contain any pages. Since you are merging a blank PDF, it is likely you missed the PDFDoc.PagePushBack(page) call.
If this does not help, please share your code for creating and merging the PDF.

Related

Problem with .xls file validation on e-commerce platform

you may have noted that this is a long question, that was because I really put an effort to explain how many WTF's I am facing, and, maybe, is not that good yet, anyway, I appreciate your help!
Context
I'm doing an integration project for a client that handles a bunch of data to generate Excel files in .xls format, notice that extension!
While developing the project I was using the xlrd and xlwt python extensions, because, again, I need to create a .xls file. But at some time I had to download and extract a file and was in .csv format (but, in reality, the file contains an HTML table :c).
So I decided to use padas to read the HTML, create a data frame so I can manipulate and return a .xls excel file.
The Problem
after coding the logic and checking that the data was correct, I tried to upload this file to the e-commerce plataform.
What happened is that the platform doesn't validate my archive.
First I will briefly explain how the site work: He accepts .xls and only .xls file, probably manipulate and use them to update the database, I have access to nothing from the code source.
When I upload the file, the site takes me to a configuration page where, if I want or the site didn't relate right, I could relate excel columns to be the id or values that would be updated on the database.
The 'generico4' field expects 'smallint(5) unsigned' on the type.
An important fact is that I sent the file to my client so he could validate the data, and after many conversations between us was discovered that if he, just by downloading my file, opening, and saving, the upload works fine (the second image from my slide), important to note that he has a MacBook and me, Ubuntu. I tried to do the same thing but not worked.
He sent me this file and I tried to see the difference between both and I found nothing, the type of the numbers are the same, that is 'float', and printed via excel with the formula =TYPE(cell) returned 1.
I already tried many other things but nothing works :c
The code
Follow the code so you can have an idea of the logic
def stock_xls(data_file_path):
# This is my logic to manipulate the data
df = pd.read_html(data_file_path)[0]
df = df[[1,2]]
df.rename(columns={1:'sku', 2:'stock'}, inplace=True)
df = df.groupby(['sku']).sum()
df.reset_index(inplace=True)
df.loc[df['stock'] > 0, 'stock'] = 1
df.loc[df['stock'] == 0, 'stock'] = 2
# I create a new Worbook (via pandas was not working too)
wb_out = xlwt.Workbook()
ws_out = wb_out.add_sheet(sheetname='stock')
# Set the columns name
ws_out.write(0, 0, 'sku')
ws_out.write(0, 1, 'generico4')
# Copy DataFrame data to the WorkBook
for index, value in df.iterrows():
ws_out.write(index + 1, 0, str(value['sku']))
ws_out.write(index + 1, 1, int(value['stock']))
path = os.path.join(BASE_DIR, f'src/xls/temp/')
Path(path).mkdir(parents=True, exist_ok=True)
file_path = os.path.join(path, "stock.xls")
wb_out.save(file_path)
return file_path

filled PDF fields showing up differently in different contexts

I have a python script that creates a number of pdf forms (0 - 10) and then concatenates them into one form. The fields on the compiled PDF show up differently in 4 different contexts. I am developing in debian linux, and the pdf viewer (Okular) does not show any fields within the compiled PDF, whereas on Windows 10, if I open the pdf with chrome, I have to hover over the field to see the field value. It has the correct field data for the first page, however, each subsequent page is just a duplicate of the first page, which is incorrect. If I open the pdf with Microsoft Edge, it correctly displays the form data for each page, however when I go to print with edge, none of the form data shows up.
I am using pdfrw for writing to pdf, and pypdf2 for merging. I have tried a number of different things, including attempting to flatten the pdf with python (which there is very little support for btw), reading and writing instead of merging, attempting to convert the form fields into text, along with many other things that I have since forgotten about since they did not work.
def writeToPdf(unfilled, output, data, fields):
'''Function writes the data from data to unfilled, and saves it as output'''
# TODO: Use literal declarations for lists, dicts, etc
checkboxes = [
'misconduct_complete',
'misconduct_incomplete',
'not_final_exam',
'supervise_exam',
'not_final_home_exam',
'not_final_assignment',
'not_final_oral_exam',
'not_final_lab_exam',
'not_final_practical_exam',
'not_final_other'
]
template_pdf = pdfrw.PdfReader(unfilled)
annotations = template_pdf.pages[0][Annot_Key]
for annotation in annotations:
# TODO: Singly nested if's with no else's suggest a logic problem, find a clearer way to do this.
if annotation[Subtype_Key] == Widget_Subtype_Key:
if annotation[Annot_Field_Key]:
key = annotation[Annot_Field_Key][1:-1]
if key in fields:
if key in checkboxes:
annotation.update(pdfrw.PdfDict(AS=pdfrw.PdfName('Yes')))
else:
if(key == 'course'):
annotation.update(pdfrw.PdfDict(V='{}'.format(data[key][0:8])))
else:
annotation.update(pdfrw.PdfDict(V='{}'.format(data[key])))
pdfrw.PdfWriter().write(output, template_pdf)
def set_need_appearances_writer(writer):
# basically used to ensured there are not
# overlapping form fields, which makes printing hard
try:
catalog = writer._root_object
# get the AcroForm tree and add "/NeedAppearances attribute
if "/AcroForm" not in catalog:
writer._root_object.update({
NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
except Exception as e:
print('set_need_appearances_writer() catch : ', repr(e))
return writer
def mergePDFs(listOfPdfPaths, outputPDf):
'''Function Merges a list of pdfs into a single one, and saves it to outputPDf'''
pdf_writer = PdfFileWriter()
set_need_appearances_writer(pdf_writer)
pdf_writer.setPageMode('/UseOC')
for path in listOfPdfPaths:
pdf_reader = PdfFileReader(path)
for page in range(pdf_reader.getNumPages()):
pdf_writer.addPage(pdf_reader.getPage(page))
with open(outputPDf, 'wb') as fh:
pdf_writer.write(fh)
As mentioned above, there is different results for different contexts. Within Debian Linux, the okular view shows no forms, within windows 10 google chrome shows duplicate fields after the first page (but I have to hover over/click the field), Microsoft Edge shows the correct with each page having its own field data, and if i look at the print preview, it also shows no form data
If anyone else is having this quite obscure problem, the behavior is unspecified for the use case that I was dealing with (template fillable form with the same field names). The only solution that is available with python at the moment (at least that I found in my many hours researching and testing) was to flatten the pdf, create a separate pdf, and write the form data to the desired locations (I did this with reportlab), then to overlay the template pdf with the created pdf. Overall this is not a good solution for many reasons, so if you have a better one, please Post it!

Use pdfplumber to find text in PDF, return page number, then return table

I downloaded 42 PDFs which are each formatted similarly. Each has various tables, one of which is labeled "Campus Reported Incidents." That particular table is on a different page in each PDF. I want to write a function that will search for the page that has "Campus Reported Incidents" and scrape that table so that I can put it into a dataframe.
I figured that I could use PDFPlumber to search for the string "Campus Reported Incidents" and return the page number. I would then write a function that uses the page number to scrape the table I want, and I would loop that function through every PDF. However, I keep on getting the error "argument is not iterable" or "type object is not subscriptable." I looked through the PDFPlumber documentation but it didn't help my problem.
Here is one example of code that I tried:
url = "pdfs/example.pdf"
import pdfplumber
pdf = pdfplumber.open(url)
for page in range[0:len(pdf.pages)]:
if 'Total number of physical restraints' in pdf.pages[page]:
print(pdf.page_number)
I see this post was from a while ago but maybe this response will still help you or someone else.
The error looks like it's coming from the way you are looping through the pages. The range object is not a list, which is why you're seeing the "type object is not subscriptable" error message. Instead, try to "Enumerate" through the pages. The "i" will give you access to the index (aka current count in the loop). The "pg", will give you access to the page object in the PDF pages. I didn't use the "pg" variable below, but you could use that instead of "pages[i]" if you want.
The code below should print the tables from each page, as well as give you access to the tables to manipulate them further.
import pdfplumber
pdf_file = "pdfs/example.pdf"
tables=[]
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for i,pg in enumerate(pages):
tbl = pages[i].extract_tables()
print(f'{i} --- {tbl}')
This is nothing to do with pdfplumber.
It should be range() not range[].
Please try below:
url = "pdfs/example.pdf"
import pdfplumber
pdf = pdfplumber.open(url)
for page in range(0:len(pdf.pages)):
if 'Total number of physical restraints' in pdf.pages[page]:
print(pdf.page_number)

Wand Image from PDF doesn't apply resizing

I'm using wand in a Django project, to generate a thumbnail from different kind of files, e.g pdf, all the thumbnail generation process is done in memory, the source file is get from a request and the thumbnail is saved to a temporary file, then Django FileFiled saves the image in the correct path, but the thumbnail generated keeps the initial size, this is my code:
with image.Image(file=self.content.file, format="png") as im: # self.content is a django model FileField didn't saved yet, so the file inside is still in memory (from the request)
im.resize(200, 200)
name = self.content.file.name
self.temp = tempfile.NamedTemporaryFile()
im.save(file=self.temp)
self.thumbnail = InMemoryUploadedFile(self.temp, None, name + ".png", 'image/png', 0, 0, None) # then self.thumnail as FileField saves the image
Do you have any idea what happen? could be a bug? I've already reported it as issue on wand github page.
The problem comes from the fact that your PDF has more than one page. If you only resize the first page (which is the one you want to display), it works. Try adding the following line after your with statement:
im = image.Image(image=im.sequence[0])
But I agree with you that your version should work as well.

How to convert selected pdf page with gm

I am converting different images and pdf files with "gm" module for nodejs. Image types go successfully but when I want to convert PDF to image have problems. I need to covert only one selected page from pdf file to jpg/png. If I pass whole pdf file to "gm" it saves to image only first page, but I cannot find the way to save another page.
gm(file).toBuffer(format.toUpperCase(),
function (err, buffer) {
// so in buffer now we have converted image
}
Thank you.
You can use gm.selectFrame like this
gm(file).selectFrame(0).toBuffer() // To get first page
gm(file).selectFrame(1).toBuffer() // To get second page
// for only first pdf page use:
gm(file, 'pdf.pdf[0]').toBuffer(...)
// for only second pdf page use:
gm(file, 'pdf.pdf[1]').toBuffer(...)
There is spindrift for manipulating pdf (includes image conversion).
You can define your pdf using (You don't have you use all of the commands):
var pdf = spindrift('in.pdf')
.pages(7, 24)
.page(1)
.even()
.odd()
.rotate(90)
.compress()
.uncompress()
.crop(100, 100, 300, 200) // left, bottom, right, top
Later on convert to image:
// Use the 'index' property of an image element to extract an image:
pdf.extractImageStream(0)
If you have to use gm, you can do what #Ben Fortune suggested in his comment and split the pdf first.

Resources