Iteratively read excel sheet names, split and save them as new columns for each sheet in Python - python-3.x

Let's say we have many excel files with the multiple sheets as the following file data1.xlsx:
Sheet 1: 2021_q1_bj
a b c d
0 1 2 23 2
1 2 3 45 5
Sheet 2: 2021_q2_bj
a b c d
0 1 2 23 6
1 2 3 45 7
Sheet 3: 2019_q1_sh
a b c
0 1 2 23
1 2 3 45
Sheet 4: 2019_q2_sh
a b c
0 1 2 23
1 2 3 40
I need to obtain sheet name for each sheet, then split them by _, store the first part as year, the second part as quarter, and the last part as city.
Finaly I will save them back to excel file with multiple sheets.
ie., for the first sheet:
a b c d year quarter city
0 1 2 23 2 2021 q1 bj
1 2 3 45 5 2021 q1 bj
2 1 2 23 6 2021 q1 bj
3 2 3 45 7 2021 q1 bj
How could I achive this in Python? Thanks.
To loop all the excel files:
base_dir = './'
file_list = os.listdir(base_dir)
for file in file_list:
if '.xlsx' in file:
file_path = os.path.join(file_path, )
dfs = pd.read_excel()

You can use use f = pd.ExcelFile('data1.xlsx') to read the excel file in as an object, then loop through the list of sheet names by iterating through f.sheet_names, splitting each sheet name such as the "2019_q1_sh" string into the appropriate year, quarter, city and setting these as values of new columns in the DataFrame you are reading in from each sheet.
Then create a dictionary with sheet names as keys, and the corresponding modified DataFrame as the values. You can create a custom save_xls function that takes in such a dictionary and saves it, as described in this helpful answer.
Update: since you want to loop through all excel files in your current directory, you can use the glob library to get all of the files with extension .xlsx and loop through each of these files, read them in, and save a new file with the string new_ in front of the file name
import pandas as pd
from pandas import ExcelWriter
import glob
"""
Save a dictionary of dataframes to an excel file, with each dataframe as a separate page
Reference: https://stackoverflow.com/questions/14225676/save-list-of-dataframes-to-multisheet-excel-spreadsheet
"""
def save_xls(dict_df, path):
writer = ExcelWriter(path)
for key in dict_df:
dict_df[key].to_excel(writer, key)
writer.save()
## loop through all excel files
for filename in glob.glob("*.xlsx"):
f = pd.ExcelFile(filename)
dict_dfs = {}
for sheet_name in f.sheet_names:
df_new = f.parse(sheet_name = sheet_name)
## get the year and quarter from the sheet name
year, quarter, city = sheet_name.split("_")
df_new["year"] = year
df_new["quarter"] = quarter
df_new["city"] = city
## populate dictionary
dict_dfs[sheet_name] = df_new
save_xls(dict_df = dict_dfs, path = "new_" + filename)

Related

Is there a way to export multiple pandas Dataframes in different sheet names using "to_csv" [duplicate]

I need to Export or save pandas Multiple Dataframe in an excel in different tabs?
Let's suppose my df's is:
df1:
Id Name Rank
1 Scott 4
2 Jennie 8
3 Murphy 1
df2:
Id Name Rank
1 John 14
2 Brown 18
3 Claire 11
df3:
Id Name Rank
1 Shenzen 84
2 Dass 58
3 Ghouse 31
df4:
Id Name Rank
1 Zen 104
2 Ben 458
3 Susuie 198
These are my four Dataframes and I need to Export as an Excel with 4 tabs i.e, df1,df2,df3,df4.
A simple method would be to hold your items in a collection and use the pd.ExcelWriter Class
Lets use a dictionary.
#1 Create a dictionary with your tab name and dataframe.
dfs = {'df1' : df1, 'df2' : df2...}
#2 create an excel writer object.
writer = pd.ExcelWriter('excel_file_name.xlsx')
#3 Loop over your dictionary write and save your excel file.
for name,dataframe in dfs.items():
dataframe.to_excel(writer,name,index=False)
writer.save()
adding a path.
from pathlib import Path
trg_path = Path('your_target_path')
writer = pd.ExcelWriter(trg_path.joinpath('excel_file.xlsx'))
Using xlsxwriter, you could do something like the following:
import xlsxwriter
import pandas as pd
### Create df's here ###
writer = pd.ExcelWriter('C:/yourFilePath/example.xslx', engine='xlsxwriter')
workbook = writer.book
### First df tab
worksheet1 = workbook.add_worksheet({}.format('df1') # The value in the parentheses is the tab name, so you can make that dynamic or hard code it
row = 0
col = 0
for Name, Rank in (df1):
worksheet.write(row, col, Name)
worksheet.write(row, col + 1, Rank)
row += 1
### Second df tab
worksheet2 = workbook.add_worksheet({}.format('df2')
row = 0
col = 0
for Name, Rank in (df2):
worksheet.write(row, col, Name)
worksheet.write(row, col + 1, Rank)
row += 1
### as so on for as many tabs as you want to create
workbook.close()
xlsxwriter allows you to do a lot of formatting as well. If you want to do that check out the docs

how to apply functions on multiple excel sheets in a loop in Python?

i have a excel file with a data like this on 57 sheets
Cate asso_num
1 "a" 33
2 "a" 67
3 "b" 97
4 "b" 60
i want to group by and get the mean of each category
def grouping( excel_file_location):
# should read all the excel sheets i.e 57 sheets currently in a loop (i dont know how to do it)
fil = pd.read_excel(...)
fil = fil.groupby("Cate").agg({"asso_num":"mean"})
# and should write in that same excel sheet
I want it do it from by writing function only
You can do the following:
def grouping(excel_file_location):
sheets_to_df= pd.read_excel(excel_file_location, sheet_name=None)
df = pd.concat(sheets_to_df, ignore_index=True)
df = df.groupby("Cate").agg({"asso_num":"mean"})
return df
So. In my example I created an excel with the data you provided and made three sheets with exact copies of it and gave:
path = r"C:\....\SDEGOSSONDEVARENNE\Sheets.xlsx"
Doing grouping(path)
returned:
asso_num
Cate
a 50.0
b 78.5
You can also reset the index
grouping(path).reset_index()
which gives
Cate asso_num
0 a 50.0
1 b 78.5

Renaming columns in dataframe w.r.t another specific column

BACKGROUND: Large excel mapping file with about 100 columns and 200 rows converted to .csv. Then stored as dataframe. General format of df as below.
Starts with a named column (e.g. Sales) and following two columns need to be renamed. This pattern needs to be repeated for all columns in excel file.
Essentially: Link the subsequent 2 columns to the "parent" one preceding them.
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 x x
2 x x
3 x x
APPROACH FOR SOLUTION: I assume it would be possible to begin with an index (e.g. index of Sales column 1 = x) and then rename the following two columns as (x+1) and (x+2).
Then take in the text for the next named column (e.g. Validation) and so on.
I know the rename() function for dataframes.
BUT, not sure how to apply the iteratively for changing column titles.
EXPECTED OUTPUT: Unnamed 2 & 3 changed to Sales_Commented and Sales_No_Comment, respectively.
Similarly Unnamed 5 & 6 change to Validation_Commented and Validation_No_Comment.
Again, repeated for all 100 columns of file.
EDIT: Due to the large number of cols in the file, creating a manual list to store column names is not a viable solution. I have already seen this elsewhere on SO. Also, the amount of columns and departments (Sales, Validation) changes in different excel files with the mapping. So a dynamic solution is required.
Sales Sales_Commented Sales_No_Comment Validation Validation_Commented Validation_No_Comment
0 Commented No comment Commented No comment
1 x x
2 x
3 x x x
As a python novice, I considered a possible approach for the solution using the limited knowledge I have, but not sure what this would look like as a workable code.
I would appreciate all help and guidance.
1.You need is to make a list with the column names that you would want.
2.Make it a dict with the old column names as the keys and new column name as the values.
3. Use df.rename(columns = your_dictionary).
import numpy as np
import pandas as pd
df = pd.read_excel("name of the excel file",sheet_name = "name of sheet")
print(df.head())
Output>>>
Sales Unnamed : 2 Unnamed : 3 Validation Unnamed : 5 Unnamed : 6 Unnamed :7
0 NaN Commented No comment NaN Comment No comment Extra
1 1.0 2 1 1.0 1 1 1
2 3.0 1 1 1.0 1 1 1
3 4.0 3 4 5.0 5 6 6
4 5.0 1 1 1.0 21 3 6
# get new names based on the values of a previous named column
new_column_names = []
counter = 0
for col_name in df.columns:
if (col_name[:7].strip()=="Unnamed"):
new_column_names.append(base_name+"_"+df.iloc[0,counter].replace(" ", "_"))
else:
base_name = col_name
new_column_names.append(base_name)
counter +=1
# convert to dict key pair
dictionary = dict(zip(df.columns.tolist(),new_column_names))
# rename columns
df = df.rename(columns=dictionary)
# drop first column
df = df.iloc[1:].reset_index(drop=True)
print(df.head())
Output>>
Sales Sales_Commented Sales_No_comment Validation Validation_Comment Validation_No_comment Validation_Extra
0 1.0 2 1 1.0 1 1 1
1 3.0 1 1 1.0 1 1 1
2 4.0 3 4 5.0 5 6 6
3 5.0 1 1 1.0 21 3 6

How to write content of a list into an Excel sheet using openpyxl

I have the following list:
d_list = ["No., Start Name, Destination, Distance (miles)",
"1,ALBANY,NY CRAFT,28",
"2,GRACO,PIONEER,39",
"3,FONDA,ROME,41",
"4,NICCE,MARRINERS,132",
"5,TOUCAN,SUBVERSIVE,100",
"6,POLL,CONVERGENCE,28",
"7,STONE HOUSE,HUDSON VALLEY,9",
"8,GLOUCESTER GRAIN,BLACK MUDD POND,75",
"9,ARMY LEAGUE,MUMURA,190",
"10,MURRAY,FARMINGDALE,123"]
So, basically, the list consists of thousands of elements (just showed here a sample of 10), each is a string of comma separated elements. I'd like to write this into a new worksheet in a workbook.
Note: the workbook already exists and contains other sheets, I'm just adding a new sheet with this data.
My code:
import openpyxl
wb = openpyxl.load_workbook('data.xlsx')
sheet = wb.create_sheet(title='distance')
for i in range(len(d_list)):
sheet.append(list(d_list[i]))
I'm expecting (in this example) 11 rows of data, each with 4 columns. However, I'm getting 11 rows alright but with each character of each string written in each cell! I think am almost there ... what am I missing? (Note: I've read through all the available posts related to this topic, but couldn't find any that answers this specific type of of question, hence I'm asking).
Many thanks!
You can use pandas to solve this:
1.) Convert your list into a dataframe:
In [231]: l
Out[231]:
['No., Start Name, Destination, Distance (miles)',
'1,ALBANY,NY CRAFT,28',
'2,GRACO,PIONEER,39',
'3,FONDA,ROME,41',
'4,NICCE,MARRINERS,132',
'5,TOUCAN,SUBVERSIVE,100',
'6,POLL,CONVERGENCE,28',
'7,STONE HOUSE,HUDSON VALLEY,9',
'8,GLOUCESTER GRAIN,BLACK MUDD POND,75',
'9,ARMY LEAGUE,MUMURA,190',
'10,MURRAY,FARMINGDALE,123']
In [228]: df = pd.DataFrame([i.split(",") for i in l])
In [229]: df
Out[229]:
0 1 2 3
0 No. Start Name Destination Distance (miles)
1 1 ALBANY NY CRAFT 28
2 2 GRACO PIONEER 39
3 3 FONDA ROME 41
4 4 NICCE MARRINERS 132
5 5 TOUCAN SUBVERSIVE 100
6 6 POLL CONVERGENCE 28
7 7 STONE HOUSE HUDSON VALLEY 9
8 8 GLOUCESTER GRAIN BLACK MUDD POND 75
9 9 ARMY LEAGUE MUMURA 190
10 10 MURRAY FARMINGDALE 123
2.) Write the above Dataframe to excel in a new-sheet in 4 columns:
import numpy as np
from openpyxl import load_workbook
path = "data.xlsx"
book = load_workbook(path)
writer = pd.ExcelWriter(path, engine = 'openpyxl')
writer.book = book
df.to_excel(writer, sheet_name = 'distance')
writer.save()
writer.close()

How can I input values from a list or dataframe into each cell in existing excel file?

So basically, I want to update a worksheet with new data, overwriting existing cells in excel. Both files have the same column names (I do not want to create a new workbook nor add a new column).
Here I am retreiving the data that I want:
import pandas as pd
df1 = pd.read_csv
print(df1)
Ouput (I just copy and pasted the first 5 rows, there are about 500 rows total):
Index Type Stage CDID Period Index Value
0 812008000 6 2 JTV9 201706 121.570
1 812008000 6 2 JTV9 201707 121.913
2 812008000 6 2 JTV9 201708 121.686
3 812008000 6 2 JTV9 201709 119.809
4 812008000 6 2 JTV9 201710 119.841
5 812128000 6 1 K2VA 201706 122.030
The existing excel file has the same columns (and row total) as df1, but I just want to have the 'Index' column repopulated with the new values. Let's just say it looks like this (i.e. so I want the previous values for Index to go into the corresponding column):
Index Type Stage CDID Period Index Value
0 512901100 6 2 JTV9 201706 121.570
1 412602034 6 2 JTV9 201707 121.913
2 612307802 6 2 JTV9 201708 121.686
3 112808360 6 2 JTV9 201709 119.809
4 912233066 6 2 JTV9 201710 119.841
5 312128003 6 1 K2VA 201706 122.030
Here I am retrieving the excel file, and attempting to overwrite it:
from win32com.client import Dispatch
import os
xl = Dispatch("Excel.Application")
xl.Visible = True
wbs_path = ('folder path')
for wbname in os.listdir(wbs_path):
if not wbname.endswith("file name.xlsx"):
continue
wb = xl.Workbooks.Open(wbs_path + '\\' + wbname)
sh = wb.Worksheets("sheet name")
sh.Range("A1:A456").Value = df1[["Index"]]
wb.Save()
wb.Close()
xl.Quit()
But this doesn't do anything.
If I type in strings, such as:
h.Range("A1:A456").Value = 'o', 'x', 'c'
This repeats o in cells through A1 through to A456 (it updates the spreadsheet), but ignores x and c. I have tried converting df1 into a list and numpy array, but this doesn't work.
Does anyone know a solution or alternative workaround?
If the index of the dataframe is the same you can update columns by using update(). It could work like this:
df1.update(df2['Index'].to_frame())
Note: the to frame() is probably not needed
EDIT:
Since you try to update a excel-file and not a dataframe, my answer is probably not enough.
For this part I would suggest to load the file into a dataframe, update the data and save it.
df1 = pd.read_excel('file.xlsx', sheet_name='sheet_name')
# do the update
writer = pd.ExcelWriter('file.xlsx')
df1.to_excel(writer,sheet_name='sheet_name', engine='xlsxwriter')
writer.save()

Resources