I have a python script that creates a number of pdf forms (0 - 10) and then concatenates them into one form. The fields on the compiled PDF show up differently in 4 different contexts. I am developing in debian linux, and the pdf viewer (Okular) does not show any fields within the compiled PDF, whereas on Windows 10, if I open the pdf with chrome, I have to hover over the field to see the field value. It has the correct field data for the first page, however, each subsequent page is just a duplicate of the first page, which is incorrect. If I open the pdf with Microsoft Edge, it correctly displays the form data for each page, however when I go to print with edge, none of the form data shows up.
I am using pdfrw for writing to pdf, and pypdf2 for merging. I have tried a number of different things, including attempting to flatten the pdf with python (which there is very little support for btw), reading and writing instead of merging, attempting to convert the form fields into text, along with many other things that I have since forgotten about since they did not work.
def writeToPdf(unfilled, output, data, fields):
'''Function writes the data from data to unfilled, and saves it as output'''
# TODO: Use literal declarations for lists, dicts, etc
checkboxes = [
'misconduct_complete',
'misconduct_incomplete',
'not_final_exam',
'supervise_exam',
'not_final_home_exam',
'not_final_assignment',
'not_final_oral_exam',
'not_final_lab_exam',
'not_final_practical_exam',
'not_final_other'
]
template_pdf = pdfrw.PdfReader(unfilled)
annotations = template_pdf.pages[0][Annot_Key]
for annotation in annotations:
# TODO: Singly nested if's with no else's suggest a logic problem, find a clearer way to do this.
if annotation[Subtype_Key] == Widget_Subtype_Key:
if annotation[Annot_Field_Key]:
key = annotation[Annot_Field_Key][1:-1]
if key in fields:
if key in checkboxes:
annotation.update(pdfrw.PdfDict(AS=pdfrw.PdfName('Yes')))
else:
if(key == 'course'):
annotation.update(pdfrw.PdfDict(V='{}'.format(data[key][0:8])))
else:
annotation.update(pdfrw.PdfDict(V='{}'.format(data[key])))
pdfrw.PdfWriter().write(output, template_pdf)
def set_need_appearances_writer(writer):
# basically used to ensured there are not
# overlapping form fields, which makes printing hard
try:
catalog = writer._root_object
# get the AcroForm tree and add "/NeedAppearances attribute
if "/AcroForm" not in catalog:
writer._root_object.update({
NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
except Exception as e:
print('set_need_appearances_writer() catch : ', repr(e))
return writer
def mergePDFs(listOfPdfPaths, outputPDf):
'''Function Merges a list of pdfs into a single one, and saves it to outputPDf'''
pdf_writer = PdfFileWriter()
set_need_appearances_writer(pdf_writer)
pdf_writer.setPageMode('/UseOC')
for path in listOfPdfPaths:
pdf_reader = PdfFileReader(path)
for page in range(pdf_reader.getNumPages()):
pdf_writer.addPage(pdf_reader.getPage(page))
with open(outputPDf, 'wb') as fh:
pdf_writer.write(fh)
As mentioned above, there is different results for different contexts. Within Debian Linux, the okular view shows no forms, within windows 10 google chrome shows duplicate fields after the first page (but I have to hover over/click the field), Microsoft Edge shows the correct with each page having its own field data, and if i look at the print preview, it also shows no form data
If anyone else is having this quite obscure problem, the behavior is unspecified for the use case that I was dealing with (template fillable form with the same field names). The only solution that is available with python at the moment (at least that I found in my many hours researching and testing) was to flatten the pdf, create a separate pdf, and write the form data to the desired locations (I did this with reportlab), then to overlay the template pdf with the created pdf. Overall this is not a good solution for many reasons, so if you have a better one, please Post it!
Related
I am trying to automate the process of creating PNGs (screenshots) of a Tableau Dashboard with different filter values specified.
Ex.
stock_list = ["Microsoft","Apple","Google"]
for i in stock_list:
param_dict = {"stock_filter": f"vf_Stock={i}"}
png = conn.query_view_image(view_id=id, parameter_dict=param_dict)
with open(f"{i}.png","wb") as file:
file.write(png.content)
In this example, conn is a Tableau connection I've already established (seems to work). The dashboard filter (Stock) has no values with special characters - therefore no translation (e.g., parse.quote()) required, and the workbook/dashboard id is found earlier in the script using querying.get_views_dataframe.
The script produces an output, but it is printing one view (e.g., Stock filter = Microsoft) to all the files. Why am I not able to retrieve the other filtered views? Am I missing something?
you may have noted that this is a long question, that was because I really put an effort to explain how many WTF's I am facing, and, maybe, is not that good yet, anyway, I appreciate your help!
Context
I'm doing an integration project for a client that handles a bunch of data to generate Excel files in .xls format, notice that extension!
While developing the project I was using the xlrd and xlwt python extensions, because, again, I need to create a .xls file. But at some time I had to download and extract a file and was in .csv format (but, in reality, the file contains an HTML table :c).
So I decided to use padas to read the HTML, create a data frame so I can manipulate and return a .xls excel file.
The Problem
after coding the logic and checking that the data was correct, I tried to upload this file to the e-commerce plataform.
What happened is that the platform doesn't validate my archive.
First I will briefly explain how the site work: He accepts .xls and only .xls file, probably manipulate and use them to update the database, I have access to nothing from the code source.
When I upload the file, the site takes me to a configuration page where, if I want or the site didn't relate right, I could relate excel columns to be the id or values that would be updated on the database.
The 'generico4' field expects 'smallint(5) unsigned' on the type.
An important fact is that I sent the file to my client so he could validate the data, and after many conversations between us was discovered that if he, just by downloading my file, opening, and saving, the upload works fine (the second image from my slide), important to note that he has a MacBook and me, Ubuntu. I tried to do the same thing but not worked.
He sent me this file and I tried to see the difference between both and I found nothing, the type of the numbers are the same, that is 'float', and printed via excel with the formula =TYPE(cell) returned 1.
I already tried many other things but nothing works :c
The code
Follow the code so you can have an idea of the logic
def stock_xls(data_file_path):
# This is my logic to manipulate the data
df = pd.read_html(data_file_path)[0]
df = df[[1,2]]
df.rename(columns={1:'sku', 2:'stock'}, inplace=True)
df = df.groupby(['sku']).sum()
df.reset_index(inplace=True)
df.loc[df['stock'] > 0, 'stock'] = 1
df.loc[df['stock'] == 0, 'stock'] = 2
# I create a new Worbook (via pandas was not working too)
wb_out = xlwt.Workbook()
ws_out = wb_out.add_sheet(sheetname='stock')
# Set the columns name
ws_out.write(0, 0, 'sku')
ws_out.write(0, 1, 'generico4')
# Copy DataFrame data to the WorkBook
for index, value in df.iterrows():
ws_out.write(index + 1, 0, str(value['sku']))
ws_out.write(index + 1, 1, int(value['stock']))
path = os.path.join(BASE_DIR, f'src/xls/temp/')
Path(path).mkdir(parents=True, exist_ok=True)
file_path = os.path.join(path, "stock.xls")
wb_out.save(file_path)
return file_path
I'm a noob PyQt5 user following a tutorial and I'm confused how I might extend the sample code below.
The two handlers canInsertFromMimeData and insertFromMimeData Qt5 methods accept an image mime datatype dragged and dropped onto document (that works great). They both receive a signal parameter source which receives a QMimeData object.
However, If I try to paste an image copied from the Windows clipboard into the document it just crashes as there is no handler for this.
Searching the Qt5 documentation at https://doc.qt.io/qt-5/qmimedata.html just leads me to further confusion as I'm not a C++ programmer and I'm using Python 3.x and PyQt5 to do this.
How would I write a handler to allow an image copied to the clipboard to be pasted into the document directly?
class TextEdit(QTextEdit):
def canInsertFromMimeData(self, source):
if source.hasImage():
return True
else:
return super(TextEdit, self).canInsertFromMimeData(source)
def insertFromMimeData(self, source):
cursor = self.textCursor()
document = self.document()
if source.hasUrls():
for u in source.urls():
file_ext = splitext(str(u.toLocalFile()))
if u.isLocalFile() and file_ext in IMAGE_EXTENSIONS:
image = QImage(u.toLocalFile())
document.addResource(QTextDocument.ImageResource, u, image)
cursor.insertImage(u.toLocalFile())
else:
# If we hit a non-image or non-local URL break the loop and fall out
# to the super call & let Qt handle it
break
else:
# If all were valid images, finish here.
return
elif source.hasImage():
image = source.imageData()
uuid = hexuuid()
document.addResource(QTextDocument.ImageResource, uuid, image)
cursor.insertImage(uuid)
return
super(TextEdit, self).insertFromMimeData(source)
code source: https://www.learnpyqt.com/examples/megasolid-idiom-rich-text-editor/
I was exactly in the same position as you. I am also new to Python, so there might be mistakes.
The variable uuid in document.addResource(QTextDocument.ImageResource, uuid, image) is not working. It should be a path -> QUrl(uuid).
Now you can insert the image. However, because the path to an image from the clipboard is changing, it would be better to use a different path, for example to the directory where you are also saving the files.
Also be aware that the user has to select the file type when saving (.html)
For my own project I am going to print the file as pdf. That way you dont have to worry about paths to images ^-^
I got around this by converting to base64 inline embedding of the images, then no resource files as it is all in one file.
I have a dictionary file called “labels” that contains text objects.
Screen capture of file
When I display the contents of this file, I get the following:
{'175.123.98.240': Text(-0.15349206308126684, -0.6696533109609498, '175.123.98.240'),
'54.66.152.105': Text(-1.0, -0.5455880938500245, '54.66.152.105'),
'62.97.116.82': Text(0.948676253595717, 0.6530664635187481, '62.97.116.82'),
'24.73.75.234': Text(0.849485905682265, -0.778703553136851, '24.73.75.234'),
'1.192.128.23': Text(0.2883091762715677, -0.03432011446968225, '1.192.128.23'),
'183.82.9.19': Text(-0.8855214994079628, 0.7201660238351776, '183.82.9.19'),
'14.63.160.219': Text(-0.047457773060320695, 0.655032585063581, '14.63.160.219')}
I want to change the IP address in the text object portion such that the file looks like this:
{'175.123.98.240': Text(-0.15349206308126684, -0.6696533109609498, 'xxx.123.98.240'),
'54.66.152.105': Text(-1.0, -0.5455880938500245, 'xxx.66.152.105'),
'62.97.116.82': Text(0.948676253595717, 0.6530664635187481, 'xxx.97.116.82'),
'24.73.75.234': Text(0.849485905682265, -0.778703553136851, 'xxx.73.75.234'),
'1.192.128.23': Text(0.2883091762715677, -0.03432011446968225, 'xxx.192.128.23'),
'183.82.9.19': Text(-0.8855214994079628, 0.7201660238351776, 'xxx.82.9.19'),
'14.63.160.219': Text(-0.047457773060320695, 0.655032585063581, 'xxx.63.160.219')}
This file is used for printing labels on a networkx graph.
I have a couple of questions.
Can the contents of a text object be modified?
If so, can it be changed without iterating through the file since the number of changes could range from 3 to 6,000, depending on what I am graphing?
How would I do it?
I did consider changing the IP address before I created my node and edge files but that resulted in separate IP address being clustered incorrectly. For example: 173.6.48.24 and 1.6.48.24 would both be converted to xxx.6.48.24.
Changing the IP address at the time of printing the labels seems like the only sensible method.
I am hoping someone could point me in the right direction. I have never dealt with text objects and I am out of my depth on this one.
Thanks
Additional information
The original data set is a list of IP addresses that have attack several honeypots I am running. I have taken the data and catalogued the data based on certain attack criteria.
The data that I showed was just one of the small attack networks. The label file was generated using the code:
labels = nx.draw_networkx_labels(compX, pos=pos_df)
Where compX is the file containing the data to be graphed and pos_df is the layout of the graph. In this case, I used nx.spring_layout().
I can display the contents of the label file using:
for k,v in labels.items():
print(v)
However, “v” contains the text object, which I do not seem to be able to work with. The content of “v” is a follows:
Text(-0.15349206308126684, -0.6696533109609498, '175.123.98.240')
Text(-1.0, -0.5455880938500245, '54.66.152.105')
Text(0.948676253595717, 0.6530664635187481, '62.97.116.82')
Text(0.849485905682265, -0.778703553136851, '24.73.75.234')
Text(0.2883091762715677, -0.03432011446968225, '1.192.128.23')
Text(-0.8855214994079628, 0.7201660238351776, '183.82.9.19')
Text(-0.047457773060320695, 0.655032585063581, '14.63.160.219')
This is where I am stuck. I do not seem to be able to come up with any code that does not return some kind of “'Text' object has no attribute xxxx”.
As for replacing the first octet, I have the following code that works on a dataframe and I have just been experimenting to see if I can adapt it but so far, no luck:
df[column_ID] = df[column_ID].apply(lambda x: "xxx."+".".join(x.split('.')[1:4])) # Replace First octet
As I said, I would prefer not to iterate through the file. This cluster has seven entries; others can contain up to 6,000 nodes – granted the graph looks like a hairball with this many nodes, but most are between 3 and 25 nodes. I have a total of 60 clusters and as I collect more information, this number will rise.
I found a solution to replacing text inside a text object:
1) Convert text object to string
2) Find the position to be changed and make the change
3) Use set_text() to make the change to the text object
Example code
# Anonymize Source IP address
for k,v in labels.items():
a = str(v)
a = a[a.find(", '"):]
a = 'xxx' + a[a.find("."):][:-2]
v.set_text(a)
I need to read the records from mainframe file and apply the some filters on record values.
So I am looking for a solution to convert the mainframe file to csv or text or Excel workbook so that I can easily perform the operations on the file.
I also need to validate the records count.
Who said anything about EBCDIC? The OP didn't.
If it is all text then FTP'ing with EBCDIC to ASCII translation is doable, including within Python.
If not then either:
The extraction and conversion to CSV needs to happen on z/OS. Perhaps with a COBOL program. Then the CSV can be FTP'ed down with
or
The data has to be FTP'ed BINARY and then parsed and bits of it translated.
But, as so often is the case, we need more information.
I was recently processing the hardcopy log and wanted to break the record apart. I used python to do this as the record was effectively a fixed position record with different data items at fixed locations in the record. In my case the entire record was text but one could easily apply this technique to convert various colums to an appropriate type.
Here is a sample record. I added a few lines to help visualize the data offsets used in the code to access the data:
1 2 3 4 5 6 7 8 9
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
N 4000000 PROD 19114 06:27:04.07 JOB02679 00000090 $HASP373 PWUB02#C STARTED - INIT 17
Note the fixed column positions for the various items and how they are referenced by position. Using this technique you could process the file and create a CSV with the output you want for processing in Excel.
For my case I used Python 3.
def processBaseMessage(self, message):
self.command = message[1]
self.routing = list(message[2:9])
self.routingCodes = [] # These are routing codes extracted from the system log.
self.sysname = message[10:18]
self.date = message[19:24]
self.time = message[25:36]
self.ident = message[37:45]
self.msgflags = message[46:54]
self.msg = [ message[56:] ]
You can then format into the form you need for further processing. There are other ways to process mainframe data but based on the question this approach should suit your needs but there are many variations.