how to apply functions on multiple excel sheets in a loop in Python? - python-3.x

i have a excel file with a data like this on 57 sheets
Cate asso_num
1 "a" 33
2 "a" 67
3 "b" 97
4 "b" 60
i want to group by and get the mean of each category
def grouping( excel_file_location):
# should read all the excel sheets i.e 57 sheets currently in a loop (i dont know how to do it)
fil = pd.read_excel(...)
fil = fil.groupby("Cate").agg({"asso_num":"mean"})
# and should write in that same excel sheet
I want it do it from by writing function only

You can do the following:
def grouping(excel_file_location):
sheets_to_df= pd.read_excel(excel_file_location, sheet_name=None)
df = pd.concat(sheets_to_df, ignore_index=True)
df = df.groupby("Cate").agg({"asso_num":"mean"})
return df
So. In my example I created an excel with the data you provided and made three sheets with exact copies of it and gave:
path = r"C:\....\SDEGOSSONDEVARENNE\Sheets.xlsx"
Doing grouping(path)
returned:
asso_num
Cate
a 50.0
b 78.5
You can also reset the index
grouping(path).reset_index()
which gives
Cate asso_num
0 a 50.0
1 b 78.5

Related

Iteratively read excel sheet names, split and save them as new columns for each sheet in Python

Let's say we have many excel files with the multiple sheets as the following file data1.xlsx:
Sheet 1: 2021_q1_bj
a b c d
0 1 2 23 2
1 2 3 45 5
Sheet 2: 2021_q2_bj
a b c d
0 1 2 23 6
1 2 3 45 7
Sheet 3: 2019_q1_sh
a b c
0 1 2 23
1 2 3 45
Sheet 4: 2019_q2_sh
a b c
0 1 2 23
1 2 3 40
I need to obtain sheet name for each sheet, then split them by _, store the first part as year, the second part as quarter, and the last part as city.
Finaly I will save them back to excel file with multiple sheets.
ie., for the first sheet:
a b c d year quarter city
0 1 2 23 2 2021 q1 bj
1 2 3 45 5 2021 q1 bj
2 1 2 23 6 2021 q1 bj
3 2 3 45 7 2021 q1 bj
How could I achive this in Python? Thanks.
To loop all the excel files:
base_dir = './'
file_list = os.listdir(base_dir)
for file in file_list:
if '.xlsx' in file:
file_path = os.path.join(file_path, )
dfs = pd.read_excel()
You can use use f = pd.ExcelFile('data1.xlsx') to read the excel file in as an object, then loop through the list of sheet names by iterating through f.sheet_names, splitting each sheet name such as the "2019_q1_sh" string into the appropriate year, quarter, city and setting these as values of new columns in the DataFrame you are reading in from each sheet.
Then create a dictionary with sheet names as keys, and the corresponding modified DataFrame as the values. You can create a custom save_xls function that takes in such a dictionary and saves it, as described in this helpful answer.
Update: since you want to loop through all excel files in your current directory, you can use the glob library to get all of the files with extension .xlsx and loop through each of these files, read them in, and save a new file with the string new_ in front of the file name
import pandas as pd
from pandas import ExcelWriter
import glob
"""
Save a dictionary of dataframes to an excel file, with each dataframe as a separate page
Reference: https://stackoverflow.com/questions/14225676/save-list-of-dataframes-to-multisheet-excel-spreadsheet
"""
def save_xls(dict_df, path):
writer = ExcelWriter(path)
for key in dict_df:
dict_df[key].to_excel(writer, key)
writer.save()
## loop through all excel files
for filename in glob.glob("*.xlsx"):
f = pd.ExcelFile(filename)
dict_dfs = {}
for sheet_name in f.sheet_names:
df_new = f.parse(sheet_name = sheet_name)
## get the year and quarter from the sheet name
year, quarter, city = sheet_name.split("_")
df_new["year"] = year
df_new["quarter"] = quarter
df_new["city"] = city
## populate dictionary
dict_dfs[sheet_name] = df_new
save_xls(dict_df = dict_dfs, path = "new_" + filename)

Merge data frames based on column with different rows

I have multiple csv files that I read into individual data frames based on their name in the directory, like so
# ask user for path
path = input('Enter the path for the csv files: ')
os.chdir(path)
# loop over filenames and read into individual dataframes
for fname in os.listdir(path):
if fname.endswith('Demo.csv'):
demoRaw = pd.read_csv(fname, encoding = 'utf-8')
if fname.endswith('Key2.csv'):
keyRaw = pd.read_csv(fname, encoding = 'utf-8')
Then I filter to only keep certain columns
# filter to keep desired columns only
demo = demoRaw.filter(['Key', 'Sex', 'Race', 'Age'], axis=1)
key = keyRaw.filter(['Key', 'Key', 'Age'], axis=1)
Then I create a list of the above dataframes and use reduce to merge them on Key
# create list of data frames for combined sheet
dfs = [demo, key]
# merge the list of data frames on the Key
combined = reduce(lambda left,right: pd.merge(left,right,on='Key'), dfs)
Then I drop the auto generated column, create an Excel writer and write to a csv
# drop the auto generated index colulmn
combined.set_index('RecordKey', inplace=True)
# create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('final.xlsx', engine='xlsxwriter')
# write to csv
combined.to_excel(writer, sheet_name='Combined')
meds.to_excel(writer, sheet_name='Meds')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
The problem is some files have keys that aren't in others. For example
Demo file
Key Sex Race Age
1 M W 52
2 F B 25
3 M L 78
Key file
Key Key2 Age
1 7325 52
2 4783 25
3 1367 78
4 9435 21
5 7247 65
Right now, it will only include rows if there is a matching key in each (in other words it just leaves out the rows with keys not in the other files). How can I combine all rows from all files, even if keys don't match? So the end result will look like this
Key Sex Race Age Key2 Age
1 M W 52 7325 52
2 F B 25 4783 25
3 M L 78 1367 78
4 9435 21
5 7247 65
I don't care if the empty cells are blanks, NaN, #N/A, etc. Just as long as I can identify them.
Replace combined = reduce(lambda left,right: pd.merge(left,right,on='Key'), dfs) With: combined=pd.merge(demo,key, how='outer', on='Key') You will have to specificy the 'outer' to join both the full table of Key and Demo

How to write content of a list into an Excel sheet using openpyxl

I have the following list:
d_list = ["No., Start Name, Destination, Distance (miles)",
"1,ALBANY,NY CRAFT,28",
"2,GRACO,PIONEER,39",
"3,FONDA,ROME,41",
"4,NICCE,MARRINERS,132",
"5,TOUCAN,SUBVERSIVE,100",
"6,POLL,CONVERGENCE,28",
"7,STONE HOUSE,HUDSON VALLEY,9",
"8,GLOUCESTER GRAIN,BLACK MUDD POND,75",
"9,ARMY LEAGUE,MUMURA,190",
"10,MURRAY,FARMINGDALE,123"]
So, basically, the list consists of thousands of elements (just showed here a sample of 10), each is a string of comma separated elements. I'd like to write this into a new worksheet in a workbook.
Note: the workbook already exists and contains other sheets, I'm just adding a new sheet with this data.
My code:
import openpyxl
wb = openpyxl.load_workbook('data.xlsx')
sheet = wb.create_sheet(title='distance')
for i in range(len(d_list)):
sheet.append(list(d_list[i]))
I'm expecting (in this example) 11 rows of data, each with 4 columns. However, I'm getting 11 rows alright but with each character of each string written in each cell! I think am almost there ... what am I missing? (Note: I've read through all the available posts related to this topic, but couldn't find any that answers this specific type of of question, hence I'm asking).
Many thanks!
You can use pandas to solve this:
1.) Convert your list into a dataframe:
In [231]: l
Out[231]:
['No., Start Name, Destination, Distance (miles)',
'1,ALBANY,NY CRAFT,28',
'2,GRACO,PIONEER,39',
'3,FONDA,ROME,41',
'4,NICCE,MARRINERS,132',
'5,TOUCAN,SUBVERSIVE,100',
'6,POLL,CONVERGENCE,28',
'7,STONE HOUSE,HUDSON VALLEY,9',
'8,GLOUCESTER GRAIN,BLACK MUDD POND,75',
'9,ARMY LEAGUE,MUMURA,190',
'10,MURRAY,FARMINGDALE,123']
In [228]: df = pd.DataFrame([i.split(",") for i in l])
In [229]: df
Out[229]:
0 1 2 3
0 No. Start Name Destination Distance (miles)
1 1 ALBANY NY CRAFT 28
2 2 GRACO PIONEER 39
3 3 FONDA ROME 41
4 4 NICCE MARRINERS 132
5 5 TOUCAN SUBVERSIVE 100
6 6 POLL CONVERGENCE 28
7 7 STONE HOUSE HUDSON VALLEY 9
8 8 GLOUCESTER GRAIN BLACK MUDD POND 75
9 9 ARMY LEAGUE MUMURA 190
10 10 MURRAY FARMINGDALE 123
2.) Write the above Dataframe to excel in a new-sheet in 4 columns:
import numpy as np
from openpyxl import load_workbook
path = "data.xlsx"
book = load_workbook(path)
writer = pd.ExcelWriter(path, engine = 'openpyxl')
writer.book = book
df.to_excel(writer, sheet_name = 'distance')
writer.save()
writer.close()

Assigning variables to cells in a Pandas table (Python)

I'm working on a script that takes test data from a website, assigns the data to a variable, then creates a pie chart of the responses for later analysis. I'm able to pull the data without a problem and format the information into a table, but I can't figure out how to assign a specific variable to a cell in the table.
For example, say question 1 had 20% of students answer A, 20% answer B, 30% answer C, and 30% answer D. I would like to take this information and assign it to the variables 1A for A, 1B, for B, etc.
I think the answer lies in this code. I've tried splitting columns and rows, but it looks like the column header doesn't correlate to the data below it. I'm also attaching the results of 'print(df)' below.
header = table.find_all('tr')[2]
cols = header.find_all('td')
cols = [ele.text.strip() for ele in cols]
cols = cols[0:3] + cols[4:8] + cols[9:]
df = pd.DataFrame(data, columns = cols)
print(df)
A/1 B/2 C/3 D/4 CORRECT MC ANSWER
0 6 84 1 9 B
1 6 1 91 2 C
2 12 1 14 72 D
3 77 3 11 9 A
4 82 7 8 2 A
Do you want try something like this with 'autopct'?
df1 = df.T.set_axis(['Question '+str(i+1) for i in df.T.columns.values], axis=1, inplace=False).iloc[:4]
ax = df1.plot.pie(subplots=True,autopct='%1.1f%%',layout=(5,1),figsize=(3,15),legend=False)

How to delete entire row which has empty cell in any column, of a excel sheet containing both texts and numbers

This is my data in excel. How to delete entire row having any empty cell in any column in MATLAB. The sheet contains both texts and numbers,
(col1 col2 col3)
OAS01 0 74
OAS02 0 55
OAS03 0.5 73
OAS04 24
OAS05 21
OAS06 20
OAS07 0 74
OAS08 0 52
OAS09 1 30
OAS01 81
I want to get output like this by deleting of entire and all rows which have any empty cell
(col1 col2 col3)
OAS01 0 74
OAS02 0 55
OAS03 0.5 73
OAS07 0 74
OAS08 0 52
OAS09 1 30
I have tryied but not working well
[num, text, a] = xlsread('data.xlsx');
for i = 1:size(data,1)
if isnan(num(i,1))
a(i,:) =[];
num(i,:) =[];
else
a(i,:) =a(i,:);
num(i,:) =num(i,:);
end
end
xlswrite('newfile.xlsx', a);
Much more elegant way:
T = {'a','b';'','c';'d','e'}
>> T =
'a' 'b'
'' 'c'
'd' 'e'
T(any(cellfun(#isempty,T),2),:) = []
>> T =
'a' 'b'
'd' 'e'
------EDIT-----
OP said it is not working, so I checked, and it is because empty cells gets loaded as NaNs by the xlsread function, so this line should fix it:
[num, text, a] = xlsread('data.xlsx');
a(any(cellfun(#(x) any(isnan(x)),a),2),:) = [];
where a is the 3 by 3 cell that the OP loaded in.
Explanations: cellfun is largely used and well documented, in this particular case, we are interested in setting rows with NaN to [], so we are using matlab's isnan to detect cells which contains NaN, we then wrap the any function outside, which returns a boolean 1 if there is a NaN or a 0 if there isn't a NaN. The outer any generates the boolean index of which (0 is a row with no NaNs and 1 is a row with NaNs) we filter the data on.

Resources