Unit test for reading an excel file witth pandas - excel

I need to write a unit test case for the below code :
def read_data(self, data):
"""Read data from excel file.
:param data:str, data in file
:return:str, data after reading excel file
"""
try:
read_data = pd.read_excel(data)
return read_data
except Exception as e:
logger.info("Not able to read data. Error :- {}".format(e))
raise e
I am reading an excel file in the above code, which gives me data like this:
refer screenshot.
So, How to store the above data after reading from excel sheet as dummy data so that I can assert it to my original data?
Thanks

Necroposting this because I had the same need.
this answer can point you in the right direction:
See also Saving the Dataframe output to a string in the XlsxWriter docs.
From the example you can build something like this:
import pandas as pd
import io
# Create a Pandas dataframe from the data.
df = pd.DataFrame({'Data': [10, 20, 30, 20, 15, 30, 45]})
output = io.BytesIO()
# Use the BytesIO object as the filehandle.
writer = pd.ExcelWriter(output, engine='xlsxwriter')
# Write the data frame to the BytesIO object.
df.to_excel(writer, sheet_name='Sheet1', index=False)
writer.save()
# Read the BytesIO object back to a data frame - here you should use your method
xlsx_data = pd.read_excel(output)
# Assert that the data frame is the same as the original
pd.testing.assert_frame_equal(xlsx_data, df)
Basically you flip the problem around: you build a data frame with some data in it, save it in a temporary file-like object, pass that object to your method, and then assert that the data is the same as the one you created.
NOTE: It needs pandas 0.17+

Related

How to get specific data from excel

Any idea on how can I acccess or get the box data (see image) under TI_Binning tab of excel file using python? What module or similar code you can recommend to me? I just need those specifica data and append it on other file such as .txt file.
Getting the data you circled:
import pandas as pd
df = pd.read_excel('yourfilepath', 'TI_Binning', skiprows=2)
df = df[['Number', 'Name']]
To appending to an existing text file:
import numpy as np
with open("filetoappenddata.txt", "ab") as f:
np.savetxt(f, df.values)
More info here on np.savetxt for formats to fit your output need.

How to format a spark dataframe in to excel file with colours on basis of data and then write to a Azure storage

Problem Statement is :- Data is structured table in Spark , you need to query it and convert into a format and write in xlsx file , with colour coding such as mandatory columns orange , optional columns yellow and the row where things are missing RED
there are different approaches , but it didn't work as style get loosed when you try to write
Tried converting spark dataframe did conditional formating and using BlockBlobService create_blob_from text trying to write but didnt worked
from io import BytesIO
from azure.storage.blob import BlockBlobService
blobService = BlockBlobService(account_name="storageaccountname", account_key="Storage Key",protocol='https')
# sample = pd.DataFrame(sample_dict)
sample = pd_data_df
# Create a Pandas Excel writer using XlsxWriter as the engine.
output = BytesIO()
writer = pd.ExcelWriter(output, engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
sample.to_excel(writer, sheet_name='Sheet1')
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Add a format.
format1 = workbook.add_format({'bg_color': 'red'})
# Get the dimensions of the dataframe.
(max_row, max_col) = sample.shape
# Apply a conditional format to the required cell range.
worksheet.conditional_format(1, 1, max_row, max_col,
{'type': 'blanks',
'format': format1})
# Close the Pandas Excel writer and output the Excel file.
writer.save()
xlsx_data = output.getvalue()
blobService.create_blob_from_bytes(container_name,frolder_path_with_file_name, xlsx_data)
## Need to write xlsx_data to blob storage from here

How to import an image into a sheet in an excel file using StyleFrame library?

I am trying to create an excel file which will have two sheets in it. I would like to import an image file into the first sheet followed by a dataframe/table in the other. There are no issues while creating the table in the second sheet but I am having issues while importing the image file into the first sheet.
From what I have read online, the insert_image() method works only if ExcelWriter() has the 'engine=xlsxwriter' parameter specified. However, StyleFrame explicitly mentions that it doesn't support the engine parameter. Is there a way around this?
Here's my code:
import pandas as pd
import PIL.Image
from styleframe import StyleFrame, Styler, utils
imageFile = 'test.jpeg'
writer = StyleFrame.ExcelWriter('file.xlsx')
resDf = pd.DataFrame(data={'ColA': [1, 2], 'ColB': [3, 4]})
tempDf = StyleFrame(resDf)
tempDf.apply_headers_style(styler_obj=Styler(bold=True, font_size=11,font_color=utils.colors.black, font=utils.fonts.calibri, protection=True))
tempDf.apply_column_style(cols_to_style=['ColA'], width=14, styler_obj=Styler(bold=False, font=utils.fonts.calibri,font_size=11, horizontal_alignment=utils.horizontal_alignments.center))
tempDf.apply_column_style(cols_to_style=['ColB'], width=14, styler_obj=Styler(bold=False, font=utils.fonts.calibri,font_size=11, horizontal_alignment=utils.horizontal_alignments.center))
#Insert the above styleframe object into the second sheet in the excel
tempDf.to_excel(writer, startrow=0, sheet_name='Second')
img = PIL.Image.open(imageFile)
newImg = img.resize((650,500))
newImg.save(imageFile)
#I would like to insert the new image file into the first sheet of the excel. How can I do it here?
writer.save()
writer.book.close()

How to ignore corrupted rows in pandas to_sql

I save pandas dataframe to postgres db table using to_sql(). If one of rows is of incorrect format, psycopg2.DataError is raised and whole dataframe does not get saved. I tried to catch the error and save rows one by one with chunksize = 1, but the result is the same. How to ignore corrupted rows? This is the code I use:
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://postgres#localhost/db_name')
df = pd.read_csv(filename, chunksize=CHUNKSIZE, error_bad_lines=False)
for chunk in df:
try:
chunk.to_sql(TABLE_NAME, con=engine,)
except:
chunk.to_sql(TABLE_NAME, con=engine, chunksize=1)
I expect to_sql to ignore only corrupted rows and save all other. Is it achievable? The current workaround is to split the dataframe into smaller parts and save them one by one. It is expensive though.
I would suggest a similar approach as yours, but with the addition of exceptions:
def insert_data(df, table_name):
try:
df.to_sql(table_name, con=engine, if_exists='append', index=False)
except Exception as error:
print(error)
df.to_sql(table_name, con=engine, if_exists='append', index=False, chunksize=1000)

Compress a CSV file written to a StringIO Buffer in Python3

I'm parsing text from pdf files into rows of ordered char metadata; I need to serialize these files to cloud storage, which is all working fine, however due to their size I'd also like to gzip these files but I've run into some issues there.
Here is my code:
import io
import csv
import zlib
# This data file is sent over Flask
page_position_data = pdf_parse_page_layouts(data_file)
field_order = ['char', 'position', 'page']
output_buffer = io.StringIO()
writer = csv.DictWriter(output_buffer, field_order)
writer.writeheader()
for page, rows in page_position_data.items():
for text_char_data_row in rows:
writer.writerow(text_char_data_row)
stored_format = zlib.compress(output_buffer)
This reads each row into the io.StringIO Buffer successfully, but gzip/zlib seem to only work with bytes-like objects like io.BytesIO so the last line errors; I cannot create read a csv into a BytesIO Buffer because DictWriter/Writer error unless io.StringIO() is used.
Thank you for your help!
I figured this out and wanted to show my answer for anyone who runs into this:
The issue is that zlib.compress is expecting a Bytes-like object; this actually doesn't mean either StringIO or BytesIO as both of these are "file-like" objects which implment read() and your normal unix file handles.
All you have to do to fix this is use StringIO() to write the csv file to and then call get the string from the StringIO() object and encode it into a bytestring; it can then be compressed by zlib.
import io
import csv
import zlib
# This data file is sent over Flask
page_position_data = pdf_parse_page_layouts(data_file)
field_order = ['char', 'position', 'page']
output_buffer = io.StringIO()
writer = csv.DictWriter(output_buffer, field_order)
writer.writeheader()
for page, rows in page_position_data.items():
for text_char_data_row in rows:
writer.writerow(text_char_data_row)
encoded = output_buffer.getvalue().encode()
stored_format = zlib.compress(encoded)
I have an alternative answer for anyone interested which should use less intermediate space, it needs python 3.3 and over to use the getbuffer() method:
from io import BytesIO, TextIOWrapper
import csv
import zlib
def compress_csv(series):
byte_buf = BytesIO()
fp = TextIOWrapper(byte_buf, newline='', encoding='utf-8')
writer = csv.writer(fp)
for row in series:
writer.writerow(row)
compressed = zlib.compress(byte_buf.getbuffer())
fp.close()
byte_buf.close()
return compressed

Resources