Problem with .xls file validation on e-commerce platform - excel

you may have noted that this is a long question, that was because I really put an effort to explain how many WTF's I am facing, and, maybe, is not that good yet, anyway, I appreciate your help!
Context
I'm doing an integration project for a client that handles a bunch of data to generate Excel files in .xls format, notice that extension!
While developing the project I was using the xlrd and xlwt python extensions, because, again, I need to create a .xls file. But at some time I had to download and extract a file and was in .csv format (but, in reality, the file contains an HTML table :c).
So I decided to use padas to read the HTML, create a data frame so I can manipulate and return a .xls excel file.
The Problem
after coding the logic and checking that the data was correct, I tried to upload this file to the e-commerce plataform.
What happened is that the platform doesn't validate my archive.
First I will briefly explain how the site work: He accepts .xls and only .xls file, probably manipulate and use them to update the database, I have access to nothing from the code source.
When I upload the file, the site takes me to a configuration page where, if I want or the site didn't relate right, I could relate excel columns to be the id or values that would be updated on the database.
The 'generico4' field expects 'smallint(5) unsigned' on the type.
An important fact is that I sent the file to my client so he could validate the data, and after many conversations between us was discovered that if he, just by downloading my file, opening, and saving, the upload works fine (the second image from my slide), important to note that he has a MacBook and me, Ubuntu. I tried to do the same thing but not worked.
He sent me this file and I tried to see the difference between both and I found nothing, the type of the numbers are the same, that is 'float', and printed via excel with the formula =TYPE(cell) returned 1.
I already tried many other things but nothing works :c
The code
Follow the code so you can have an idea of the logic
def stock_xls(data_file_path):
# This is my logic to manipulate the data
df = pd.read_html(data_file_path)[0]
df = df[[1,2]]
df.rename(columns={1:'sku', 2:'stock'}, inplace=True)
df = df.groupby(['sku']).sum()
df.reset_index(inplace=True)
df.loc[df['stock'] > 0, 'stock'] = 1
df.loc[df['stock'] == 0, 'stock'] = 2
# I create a new Worbook (via pandas was not working too)
wb_out = xlwt.Workbook()
ws_out = wb_out.add_sheet(sheetname='stock')
# Set the columns name
ws_out.write(0, 0, 'sku')
ws_out.write(0, 1, 'generico4')
# Copy DataFrame data to the WorkBook
for index, value in df.iterrows():
ws_out.write(index + 1, 0, str(value['sku']))
ws_out.write(index + 1, 1, int(value['stock']))
path = os.path.join(BASE_DIR, f'src/xls/temp/')
Path(path).mkdir(parents=True, exist_ok=True)
file_path = os.path.join(path, "stock.xls")
wb_out.save(file_path)
return file_path

Related

How to read the most recent Excel export into a Pandas dataframe without specifying the file name?

I frequent a real estate website that shows recent transactions, from which I will download data to parse within a Pandas dataframe. Everything about this dataset remains identical every time I download it (regarding the column names, that is).
The name of the Excel output may change, though. For example, if I already have download a few of these in my Downloads folder, the file that's exported may read "Generic_File_(3)" or "Generic_File_(21)" if I already have a few older "Generic_File" exports in that folder from a previous export.
Ideally, I'd like my workflow to look like this: export this Excel file of real estate sales, then run a Python script to read in the most recent export as a Pandas dataframe. The catch is, I don't want to have to go in and change the filename in the script to match the appending number of the Excel export everytime. I want the pd.read_excel method to simply read the "Generic_File" that is appended with the largest number (which will obviously correspond to the most rent export).
I suppose I could always just delete old exports out of my Downloads folder so the newest, freshest export is always named the same ("Generic_File", in this case), but I'm looking for a way to ensure I don't have to do this. Are wildcards the best path forward, or is there some other method to always read in the most recently downloaded Excel file from my Downloads folder?
I would use the OS package and create a method to read to file names in the downloads folder. Parsing string filenames you could then find the file following your specified format with the highest copy number. Something like the following might help you get started.
import os
downloads = os.listdir('C:/Users/[username here]/Downloads/')
is_file = [True if '.' in item else False for item in downloads]
files = [item for keep, item in zip(is_file, downloads) if keep]
** INSERT CODE HERE TO IDENTIFY THE FILE OF INTEREST **
Regex might be the best way to find matches if you have a diverse listing of files in your downloads folder.

How to keep the share properties of an excel with python openpyxl?

I have trouble trying to keep the sharing properties of an excel. I tried this :Python and openpyxl is saving my shared workbook as unshared but the part of vout just cancels all the modification I made with the script
To explain the problem :
There's an excel file that is shared in which people can do some modification
Python reads and writes on it
When I save the workbook in the excel file, it automatically either drops the sharing property or when I try to keep it, it just doesn't do any modification
Can someone help me please ?
I'll get a little more precise, as requested.
The sharing mode is the one Microsoft provides. You can see the button below:
Share button Excel
The excel is stored on a server. Several users can write on it at the same time but when I launch my script, it stops automatically the sharing property, so everyone that is writing on it just can't do modification anymore and every modif they did is lost.
First I treated my Excel normally :
DLT=openpyxl.load_workbook(myPath)
ws=DLT['DLT']
...my modifications on ws...
DLT.save()
DLT.close()
But then I tried this (Python and openpyxl is saving my shared workbook as unshared)
DLT=openpyxl.load_workbook(myPath)
ws=DLT['DLT']
zin = zipfile.ZipFile(myPath, 'r')
buffers = []
for item in zin.infolist():
buffers.append((item, zin.read(item.filename)))
zin.close()
...my modif on ws...
DLT.save()
zout = zipfile.ZipFile(myPath, 'w')
for item, buffer in buffers:
zout.writestr(item, buffer)
zout.close()
DLT.close()
The second one just doesn't save my modification on ws.
The thing I would like to do, is not to get rid of the sharing property. I would need to keep it while I write on it. Not sure if it is possible. I have one alternative solution that is to use another file, and just copy/paste by hand the new data from this file to the DLT one.
well... after playing with it back and forth, for some weird reason zipfile.infolist() does contains the sheet data as well, so here's my way to fine tune it, using the shared_pyxl_save example the previous gentleman provided
basically instead of letting the old file overriding the sheet's data, use the old one
def shared_pyxl_save(file_path, workbook):
"""
`file_path`: path to the shared file you want to save
`workbook`: the object returned by openpyxl.load_workbook()
"""
zin = zipfile.ZipFile(file_path, 'r')
buffers = []
for item in zin.infolist():
if "sheet1.xml" not in item.filename:
buffers.append((item, zin.read(item.filename)))
zin.close()
workbook.save(file_path)
""" loop through again to find the sheet1.xmls and put it into buffer, else will show up error"""
zin2 = zipfile.ZipFile(file_path, 'r')
for item in zin2.infolist():
if "sheet1.xml" in item.filename:
buffers.append((item, zin2.read(item.filename)))
zin2.close()
#finally saves the file
zout = zipfile.ZipFile(file_path, 'w')
for item, buffer in buffers:
zout.writestr(item, buffer)
zout.close()
workbook.close()

Import Excel file into ngx-datatable - Angular 8

I have seen multiple posts on exporting ngx-datatable to csv/xlsx. However, I did not come across any post which says Import Excel file into ngx-datatable which is basically what I need. I need to read an excel file that user uploads and display into ngx-datatable (so basically excel file acting as source for ngx-datatable)
Any guidelines / help links to proceed will be a great help.
If you can transform into an csv file, there is this lib called ngx-csv-parser (https://www.npmjs.com/package/ngx-csv-parser) that helps to format data in object array, the way you need to send to a ngx-datatable. It says it is designed for Angular 13 but has compability with previous versions. I've tested it in Angular 10 and it does work.
It has a setting to use header or no. If you do, you can shape your columns prop from ngx-datatable with the same name of the header.
Example:
Lets say you have a csv file like this:
ColumnA,ColumnB,ColumnC
a,b,c
The output of using this lib (the way it is said in its readme) with header= true will be:
csvRecords = [{ColumnA: a, ColumnB: b, ColumnC: c}]
Lets say you also have an array of columns:
columns = [
{name="A", prop: ColumnA},
{name="B", prop: ColumnB},
{name="C", prop: ColumnC}
]
Then use columns and csvRecords in your html.
<ngx-datatable
class="material"
[rows]="csvRecords"
[columns]="columns"
>
</ngx-datatable>
Your table will be filled with data from your csv.

Avoid overwriting of files with "for" loop

I have a list of dataframes (df_cleaned) created from multiple csv files chosen by the user.
My objective is to save each dataframe within the df_cleaned list as a separate csv file locally.
I have the following code done which saves the file with its original title. But I see that it overwrites and manages to save a copy of only the last dataframe.
How can I fix it? According to my very basic knowledge perhaps I could use a break-continue statement in the loop? But I do not know how to implement it correctly.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{}.csv'.format(name))
print('Saving of files as csv is complete.')
You can create a different name for each file, as an example in the following I attach the index to name:
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{0}_{1}.csv'.format(name,i))
print('Saving of files as csv is complete.')
this will create a list of files named <name>_N.csv with N = 0, ..., len(df_cleaned)-1.
A very easy way of solving. Just figured out the answer myself. Posting to help someone else.
fileNames is a list I created at the start of the code to save the
names of the files chosen by the user.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\TrainData\{}.csv'.format(fileNames[i]))
print('Saving of files as csv is complete.')
Saves a separate copy for each file in the defined directory.

filled PDF fields showing up differently in different contexts

I have a python script that creates a number of pdf forms (0 - 10) and then concatenates them into one form. The fields on the compiled PDF show up differently in 4 different contexts. I am developing in debian linux, and the pdf viewer (Okular) does not show any fields within the compiled PDF, whereas on Windows 10, if I open the pdf with chrome, I have to hover over the field to see the field value. It has the correct field data for the first page, however, each subsequent page is just a duplicate of the first page, which is incorrect. If I open the pdf with Microsoft Edge, it correctly displays the form data for each page, however when I go to print with edge, none of the form data shows up.
I am using pdfrw for writing to pdf, and pypdf2 for merging. I have tried a number of different things, including attempting to flatten the pdf with python (which there is very little support for btw), reading and writing instead of merging, attempting to convert the form fields into text, along with many other things that I have since forgotten about since they did not work.
def writeToPdf(unfilled, output, data, fields):
'''Function writes the data from data to unfilled, and saves it as output'''
# TODO: Use literal declarations for lists, dicts, etc
checkboxes = [
'misconduct_complete',
'misconduct_incomplete',
'not_final_exam',
'supervise_exam',
'not_final_home_exam',
'not_final_assignment',
'not_final_oral_exam',
'not_final_lab_exam',
'not_final_practical_exam',
'not_final_other'
]
template_pdf = pdfrw.PdfReader(unfilled)
annotations = template_pdf.pages[0][Annot_Key]
for annotation in annotations:
# TODO: Singly nested if's with no else's suggest a logic problem, find a clearer way to do this.
if annotation[Subtype_Key] == Widget_Subtype_Key:
if annotation[Annot_Field_Key]:
key = annotation[Annot_Field_Key][1:-1]
if key in fields:
if key in checkboxes:
annotation.update(pdfrw.PdfDict(AS=pdfrw.PdfName('Yes')))
else:
if(key == 'course'):
annotation.update(pdfrw.PdfDict(V='{}'.format(data[key][0:8])))
else:
annotation.update(pdfrw.PdfDict(V='{}'.format(data[key])))
pdfrw.PdfWriter().write(output, template_pdf)
def set_need_appearances_writer(writer):
# basically used to ensured there are not
# overlapping form fields, which makes printing hard
try:
catalog = writer._root_object
# get the AcroForm tree and add "/NeedAppearances attribute
if "/AcroForm" not in catalog:
writer._root_object.update({
NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
except Exception as e:
print('set_need_appearances_writer() catch : ', repr(e))
return writer
def mergePDFs(listOfPdfPaths, outputPDf):
'''Function Merges a list of pdfs into a single one, and saves it to outputPDf'''
pdf_writer = PdfFileWriter()
set_need_appearances_writer(pdf_writer)
pdf_writer.setPageMode('/UseOC')
for path in listOfPdfPaths:
pdf_reader = PdfFileReader(path)
for page in range(pdf_reader.getNumPages()):
pdf_writer.addPage(pdf_reader.getPage(page))
with open(outputPDf, 'wb') as fh:
pdf_writer.write(fh)
As mentioned above, there is different results for different contexts. Within Debian Linux, the okular view shows no forms, within windows 10 google chrome shows duplicate fields after the first page (but I have to hover over/click the field), Microsoft Edge shows the correct with each page having its own field data, and if i look at the print preview, it also shows no form data
If anyone else is having this quite obscure problem, the behavior is unspecified for the use case that I was dealing with (template fillable form with the same field names). The only solution that is available with python at the moment (at least that I found in my many hours researching and testing) was to flatten the pdf, create a separate pdf, and write the form data to the desired locations (I did this with reportlab), then to overlay the template pdf with the created pdf. Overall this is not a good solution for many reasons, so if you have a better one, please Post it!

Resources