Is there an equivalent of ExcelWriter in PySpark? - excel

I am exporting my dataframe to an Excel and conditionally formatting it with colors (So no PyExcelerate for me) and what takes the most time by far is the conversion toPandas, i was wondering if there is a way to do it with the spark dataframe, the code is this:
excel_writer_global = pd.ExcelWriter("excel_output.xlsx", engine='xlsxwriter')
# Create a Pandas dataframe from some data.
print_seconds_since_start("To pandas")
pd_df_a_escribir = df_a_escribir.toPandas()
print_seconds_since_start("Fin to pandas")
# Convert the dataframe to an XlsxWriter Excel object.
pd_df_a_escribir.to_excel(excel_writer, sheet_name=name_hoja)
# Get the xlsxwriter workbook and worksheet objects.
workbook = excel_writer.book
worksheet = excel_writer.sheets[name_hoja]
It would have to be a quicker solution as that is the problem right now,
thanks a lot in advance!

Related

pandas dataframe to a single sheet in multisheet excel file

Sometimes we open multi sheet excel file, do some operations in one sheet and then save it back in the same file or make a new file while saving. Given the operations are done in pandas dataframe, how can I copy back the result to the target sheet?
import openpyxl as op
from openpyxl.utils.dataframe import dataframe_to_rows
import pandas as pd
wbk=op.load_workbook("fileName.xlsx")
wsht=wbk['verbList']
#create dataframe with sheet data and operate
df = pd.read_excel("fileName.xlsx", sheet_name="verbList")
df.insert(0,"newCol2","") #sample operation
dataframe_to_rows(df, index=False, header=True) #dataframe converted to rows
#for loop from dataframe_to_rows moves back rows to excel file
#trying to avoid loops here
wsht["B1"].value="verbs"
wbk.save(basePath + "fileName-update.xlsx")
Any idea anyone?
If any other python excel library does the job, please let know.

Issue when exporting dataframe to csv

I'm working on a mechanical engineering project. For the following code, the user enters the number of cylinders that their compressor has. A dataframe is then created with the correct number of columns and is exported to Excel as a CSV file.
The outputted dataframe looks exactly like I want it to as shown in the first link, but when opened in Excel it looks like the image in the second link:
1.my dataframe
2.Excel Table
Why is my dataframe not exporting properly to Excel and what can I do to get the same dataframe in Excel?
import pandas as pd
CylinderNo=int(input('Enter CylinderNo: '))
new_number=CylinderNo*3
list1=[]
for i in range(1,CylinderNo+1):
for j in range(0,3):
Cylinder_name=str('CylinderNo ')+str(i)
list1.append(Cylinder_name)
df = pd.DataFrame(list1,columns =['Kurbel/Zylinder'])
list2=['Triebwerk', 'Packung','Ventile']*CylinderNo
Bauteil = {'Bauteil': list2}
df2 = pd.DataFrame (Bauteil, columns = ['Bauteil'])
new=pd.concat([df, df2], axis=1)
list3=['Nan','Nan','Nan']*CylinderNo
Bewertung={'Bewertung': list3}
df3 = pd.DataFrame (Bewertung, columns = ['Bewertung'])
new2=pd.concat([new, df3], axis=1)
Empfehlung={'Empfehlung': list3}
df4 = pd.DataFrame (Empfehlung, columns = ['Empfehlung'])
new3=pd.concat([new2, df4], axis=1)
new3.set_index('Kurbel/Zylinder')
new3 = new3.set_index('Kurbel/Zylinder', append=True).swaplevel(0,1)
#export dataframe to csv
new3.to_csv('new3.csv')
To be clear, a comma-separated values (CSV) file is not an Excel format type or table. It is a delimited text file that Excel like other applications can open.
What you are comparing is simply presentation. Both data frames are exactly the same. For multindex data frames, Pandas print output does not repeat index values for readability on the console or IDE like Jupyter. But such values are not removed from underlying data frame only its presentation. If you re-order indexes, you will see this presentation changes. The full complete data frame is what is exported to CSV. And ideally for data integrity, you want the full data set exported with to_csv to be import-able back into Pandas with read_csv (which can set indexes) or other languages and applications.
Essentially, CSV is an industry format to store and transfer data. Consider using Excel spreadsheets, HTML markdown, or other reporting formats for your presentation needs. Therefore, to_csv may not be the best method. You can try to build text file manually with Python i/o write methods, with open('new.csv', 'w') as f, but will be an extensive workaround See also #Jeff's answer here but do note the latter part of solution does remove data.

Read excel into pandas dataframe without modifying the values of excel?

I am reading an xlsx file using Python's Pandas pd.read_excel(myfile.xlsx,sheet_name="my_sheet",header=2) and writing the df to a csv file using df.to_csv.
The excel file contains several columns with percentage values in it (e.g. 27.44 %). In the dataframe the values are getting converted to 0.2744, i don't want any modification in data. How can i achieve this?
I already tried:
using lambda function to convert back value from 0.2744 to 27.44 % but this i don't want this because the column names/index are not fixed. It can be any col contain the % values
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet",header=5,dtype={'column_name':str}) - Didn't work
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet",header=5,dtype={'column_name':object}) - Didn't work
Tried xlrd module also, but that too converted % values to float.
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet")
df.to_csv(mycsv.csv,sep=",",index=False)
from your xlsx save the file directly in csv format
To import your csv file use pandas library as follow:
import pandas as pd
df=pd.read_csv('my_sheet.csv') #in case your file located in the same directory
more information on pandas.read_csv

Pandas read_csv to adding some very small values to the dataframe

When i use pandas read_csv, pandas add some little value to the dataframe, it went from -0.079257 to -0.07925700000000001, why is this happening and how can I fix this? It also only happen to some specific values, while others seems fine.
I've tried using float_precision but seems doesn't do anything, I'm new to pandas
df = pd.read_csv('filepath')
print(df.iat[0,0])
Dataset Link
I changed the dataset file type from txt to csv manually using notepad.
Dataset Image
This is because your original data have a np.float32 precision.
import pandas as pd
df = pd.read_csv('./avila/avila-ts.txt')
print(df.iat[0,0]) # 0.13029200000000002
# stored as np.float32
df.to_csv('./my.csv',float_format=np.float32, index_label=False)
df_1 = pd.read_csv('./my.csv')
print(df_1.iat[0,0]) # 0.13029200000000002
# stored as np.float16
df.to_csv('./my.csv',float_format=np.float16, index_label=False)
df_1 = pd.read_csv('./my.csv')
print(df_1.iat[0,0]) # 0.1302
I don't know what your data is structured. could you open the data and check, better still screenshot.
data = pandas.read_csv('filepath')
data.head()

Pandas: Generating a data frame from each spreadsheet in a large excel file

I have a large excel file which I have imported into pandas, made up of 92 sheets.
I want to use a loop or some tool to generate dataframes from the data in each spreadsheet (one dataframe from each spreadsheet), which also automatically names each dataframe.
I have only just started using pandas and jupyter so I am not very experienced at all.
This is the code I have so far:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime
%matplotlib inline
concdata = pd.ExcelFile('Documents/Research Project/Data-Ana/11July-27Dec.xlsx')
I also have a list of all the spreadsheet names:
#concdata.sheet_names
Thanks!
Instead of making each DataFrame its own variable you can assign each sheet a name in a Python dictionary like so:
dfs = {}
for sheet in concdata.sheet_names:
dfs[sheet] = concdata.parse(sheet)
And then access each DataFrame with the sheet name:
dfs['sheet_name_here']
Doing it this way allows you to have amortised O(1) lookup of sheets.

Resources