How can I input values from a list or dataframe into each cell in existing excel file? - excel

So basically, I want to update a worksheet with new data, overwriting existing cells in excel. Both files have the same column names (I do not want to create a new workbook nor add a new column).
Here I am retreiving the data that I want:
import pandas as pd
df1 = pd.read_csv
print(df1)
Ouput (I just copy and pasted the first 5 rows, there are about 500 rows total):
Index Type Stage CDID Period Index Value
0 812008000 6 2 JTV9 201706 121.570
1 812008000 6 2 JTV9 201707 121.913
2 812008000 6 2 JTV9 201708 121.686
3 812008000 6 2 JTV9 201709 119.809
4 812008000 6 2 JTV9 201710 119.841
5 812128000 6 1 K2VA 201706 122.030
The existing excel file has the same columns (and row total) as df1, but I just want to have the 'Index' column repopulated with the new values. Let's just say it looks like this (i.e. so I want the previous values for Index to go into the corresponding column):
Index Type Stage CDID Period Index Value
0 512901100 6 2 JTV9 201706 121.570
1 412602034 6 2 JTV9 201707 121.913
2 612307802 6 2 JTV9 201708 121.686
3 112808360 6 2 JTV9 201709 119.809
4 912233066 6 2 JTV9 201710 119.841
5 312128003 6 1 K2VA 201706 122.030
Here I am retrieving the excel file, and attempting to overwrite it:
from win32com.client import Dispatch
import os
xl = Dispatch("Excel.Application")
xl.Visible = True
wbs_path = ('folder path')
for wbname in os.listdir(wbs_path):
if not wbname.endswith("file name.xlsx"):
continue
wb = xl.Workbooks.Open(wbs_path + '\\' + wbname)
sh = wb.Worksheets("sheet name")
sh.Range("A1:A456").Value = df1[["Index"]]
wb.Save()
wb.Close()
xl.Quit()
But this doesn't do anything.
If I type in strings, such as:
h.Range("A1:A456").Value = 'o', 'x', 'c'
This repeats o in cells through A1 through to A456 (it updates the spreadsheet), but ignores x and c. I have tried converting df1 into a list and numpy array, but this doesn't work.
Does anyone know a solution or alternative workaround?

If the index of the dataframe is the same you can update columns by using update(). It could work like this:
df1.update(df2['Index'].to_frame())
Note: the to frame() is probably not needed
EDIT:
Since you try to update a excel-file and not a dataframe, my answer is probably not enough.
For this part I would suggest to load the file into a dataframe, update the data and save it.
df1 = pd.read_excel('file.xlsx', sheet_name='sheet_name')
# do the update
writer = pd.ExcelWriter('file.xlsx')
df1.to_excel(writer,sheet_name='sheet_name', engine='xlsxwriter')
writer.save()

Related

Iteratively read excel sheet names, split and save them as new columns for each sheet in Python

Let's say we have many excel files with the multiple sheets as the following file data1.xlsx:
Sheet 1: 2021_q1_bj
a b c d
0 1 2 23 2
1 2 3 45 5
Sheet 2: 2021_q2_bj
a b c d
0 1 2 23 6
1 2 3 45 7
Sheet 3: 2019_q1_sh
a b c
0 1 2 23
1 2 3 45
Sheet 4: 2019_q2_sh
a b c
0 1 2 23
1 2 3 40
I need to obtain sheet name for each sheet, then split them by _, store the first part as year, the second part as quarter, and the last part as city.
Finaly I will save them back to excel file with multiple sheets.
ie., for the first sheet:
a b c d year quarter city
0 1 2 23 2 2021 q1 bj
1 2 3 45 5 2021 q1 bj
2 1 2 23 6 2021 q1 bj
3 2 3 45 7 2021 q1 bj
How could I achive this in Python? Thanks.
To loop all the excel files:
base_dir = './'
file_list = os.listdir(base_dir)
for file in file_list:
if '.xlsx' in file:
file_path = os.path.join(file_path, )
dfs = pd.read_excel()
You can use use f = pd.ExcelFile('data1.xlsx') to read the excel file in as an object, then loop through the list of sheet names by iterating through f.sheet_names, splitting each sheet name such as the "2019_q1_sh" string into the appropriate year, quarter, city and setting these as values of new columns in the DataFrame you are reading in from each sheet.
Then create a dictionary with sheet names as keys, and the corresponding modified DataFrame as the values. You can create a custom save_xls function that takes in such a dictionary and saves it, as described in this helpful answer.
Update: since you want to loop through all excel files in your current directory, you can use the glob library to get all of the files with extension .xlsx and loop through each of these files, read them in, and save a new file with the string new_ in front of the file name
import pandas as pd
from pandas import ExcelWriter
import glob
"""
Save a dictionary of dataframes to an excel file, with each dataframe as a separate page
Reference: https://stackoverflow.com/questions/14225676/save-list-of-dataframes-to-multisheet-excel-spreadsheet
"""
def save_xls(dict_df, path):
writer = ExcelWriter(path)
for key in dict_df:
dict_df[key].to_excel(writer, key)
writer.save()
## loop through all excel files
for filename in glob.glob("*.xlsx"):
f = pd.ExcelFile(filename)
dict_dfs = {}
for sheet_name in f.sheet_names:
df_new = f.parse(sheet_name = sheet_name)
## get the year and quarter from the sheet name
year, quarter, city = sheet_name.split("_")
df_new["year"] = year
df_new["quarter"] = quarter
df_new["city"] = city
## populate dictionary
dict_dfs[sheet_name] = df_new
save_xls(dict_df = dict_dfs, path = "new_" + filename)

How to organise different datasets on Excel into the same layout/order (using pandas)

I have multiple Excel spreadsheets containing the same types of data but they are not in the same order. For example, if file 1 has the results of measurements A, B, C and D from River X printed in columns 1, 2, 3 and 4, respectively but file 2 has the same measurements taken for a different river, River Y, printed in columns 6, 7, 8, and 9 respectively, is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe (i.e. make it so that Sheet2 has the measurements for River Y printed in columns 1, 2, 3 and 4)? Sometimes the data is presented horizontally, not vertically as described above, too. If I have the same measurements for, say, 400 different rivers on 400 separate sheets, but the presentation/layout of data is erratic with regards to each individual file, it would be useful to be able to put a single order on every spreadsheet without having to manually shift columns on Excel.
Is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe?
You can get a list of columns from one of your dataframes and then sort that. Next you can use the sorted order to reorder your remaining dataframes. I've created an example below:
import pandas as pd
import numpy as np
# Create an example of your problem
root = 'River'
suffix = list('123')
cols_1 = [root + '_' + each_suffix for each_suffix in suffix]
cols_2 = [root + '_' + each_suffix for each_suffix in suffix[::]]
data = np.arange(9).reshape(3,3)
df_1 = pd.DataFrame(columns=cols_1, data=data)
df_2 = pd.DataFrame(columns=cols_2, data=data)
df_1
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
df_2
[out] River_3 River_2 River_1
0 0 1 2
1 3 4 5
2 6 7 8
col_list = df_1.columns.to_list() # Get a list of column names use .sort() to sort in place or
sorted_col_list = sorted(col_list, reverse=False) # Use reverse True to invert the order
def rearrange_df_cols(df, target_order):
df = df[target_order]
print(df)
return df
rearrange_df_cols(df_1, sorted_col_list)
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
rearrange_df_cols(df_2, sorted_col_list)
[out] River_1 River_2 River_3
0 2 1 0
1 5 4 3
2 8 7 6
You can write a function based on what's above and apply it to all of your file/sheets provided that all columns names exist (NB the must be written identically).
Sometimes the data is presented horizontally, not vertically as described above, too.
This would be better as a separate question. In principle you should check the dimension of your data e.g. df.shape and based of the shape you can either use df.transpose() and then your function to reorder the columns names or directly use your function to reorder the column names.

Renaming columns in dataframe w.r.t another specific column

BACKGROUND: Large excel mapping file with about 100 columns and 200 rows converted to .csv. Then stored as dataframe. General format of df as below.
Starts with a named column (e.g. Sales) and following two columns need to be renamed. This pattern needs to be repeated for all columns in excel file.
Essentially: Link the subsequent 2 columns to the "parent" one preceding them.
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 x x
2 x x
3 x x
APPROACH FOR SOLUTION: I assume it would be possible to begin with an index (e.g. index of Sales column 1 = x) and then rename the following two columns as (x+1) and (x+2).
Then take in the text for the next named column (e.g. Validation) and so on.
I know the rename() function for dataframes.
BUT, not sure how to apply the iteratively for changing column titles.
EXPECTED OUTPUT: Unnamed 2 & 3 changed to Sales_Commented and Sales_No_Comment, respectively.
Similarly Unnamed 5 & 6 change to Validation_Commented and Validation_No_Comment.
Again, repeated for all 100 columns of file.
EDIT: Due to the large number of cols in the file, creating a manual list to store column names is not a viable solution. I have already seen this elsewhere on SO. Also, the amount of columns and departments (Sales, Validation) changes in different excel files with the mapping. So a dynamic solution is required.
Sales Sales_Commented Sales_No_Comment Validation Validation_Commented Validation_No_Comment
0 Commented No comment Commented No comment
1 x x
2 x
3 x x x
As a python novice, I considered a possible approach for the solution using the limited knowledge I have, but not sure what this would look like as a workable code.
I would appreciate all help and guidance.
1.You need is to make a list with the column names that you would want.
2.Make it a dict with the old column names as the keys and new column name as the values.
3. Use df.rename(columns = your_dictionary).
import numpy as np
import pandas as pd
df = pd.read_excel("name of the excel file",sheet_name = "name of sheet")
print(df.head())
Output>>>
Sales Unnamed : 2 Unnamed : 3 Validation Unnamed : 5 Unnamed : 6 Unnamed :7
0 NaN Commented No comment NaN Comment No comment Extra
1 1.0 2 1 1.0 1 1 1
2 3.0 1 1 1.0 1 1 1
3 4.0 3 4 5.0 5 6 6
4 5.0 1 1 1.0 21 3 6
# get new names based on the values of a previous named column
new_column_names = []
counter = 0
for col_name in df.columns:
if (col_name[:7].strip()=="Unnamed"):
new_column_names.append(base_name+"_"+df.iloc[0,counter].replace(" ", "_"))
else:
base_name = col_name
new_column_names.append(base_name)
counter +=1
# convert to dict key pair
dictionary = dict(zip(df.columns.tolist(),new_column_names))
# rename columns
df = df.rename(columns=dictionary)
# drop first column
df = df.iloc[1:].reset_index(drop=True)
print(df.head())
Output>>
Sales Sales_Commented Sales_No_comment Validation Validation_Comment Validation_No_comment Validation_Extra
0 1.0 2 1 1.0 1 1 1
1 3.0 1 1 1.0 1 1 1
2 4.0 3 4 5.0 5 6 6
3 5.0 1 1 1.0 21 3 6

Removing repetitive/duplicate occurance in excel using python

I am trying to remove the repetitive/duplicate Names which is coming under NAME column. I just want to keep the 1st occurrence from the repetitive/duplicate names by using python script.
This is my input excel:
And need output like this:
This isn't removing duplicates per say you're just filling duplicate keys in one column as blanks, I would handle this as follows :
by creating a mask where you return a true/false boolean if the row is == the row above.
assuming your dataframe is called df
mask = df['NAME'].ne(df['NAME'].shift())
df.loc[~mask,'NAME'] = ''
explanation :
what we are doing above is the following,
first selecting a single column, or in pandas terminology a series, we then apply a .ne (not equal to) which in effect is !=
lets see this in action.
import pandas as pd
import numpy as np
# create data for dataframe
names = ['Rekha', 'Rekha','Jaya','Jaya','Sushma','Nita','Nita','Nita']
defaults = ['','','c-default','','','c-default','','']
classes = ['forth','third','foruth','fifth','fourth','third','fifth','fourth']
now, lets create a dataframe similar to yours.
df = pd.DataFrame({'NAME' : names,
'DEFAULT' : defaults,
'CLASS' : classes,
'AGE' : [np.random.randint(1,5) for len in names],
'GROUP' : [np.random.randint(1,5) for len in names]}) # being lazy with your age and group variables.
so, if we did df['NAME'].ne('Omar') which is the same as [df['NAME'] != 'Omar'] we would get.
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
so, with that out of the way, we want to see if the name in row 1 (remember python is a 0 index language so row 1 is actually the 2nd physical row) is .eq to the row above.
we do this by calling [.shift][2] hyperlinked for more info.
what this basically does is shift the rows by its index with a defined variable number, lets call this n.
if we called df['NAME'].shift(1)
0 NaN
1 Rekha
2 Rekha
3 Jaya
4 Jaya
5 Sushma
6 Nita
7 Nita
we can see here that that Rekha has moved down
so putting that all together,
df['NAME'].ne(df['NAME'].shift())
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
we assign this to a self defined variable called mask you could call this whatever you want.
we then use [.loc][2] which lets you access your dataframe by labels or a boolean array, in this instance an array.
however, we only want to access the booleans which are False so we use a ~ which inverts the logic of our array.
NAME DEFAULT CLASS AGE GROUP
1 Rekha third 1 4
3 Jaya fifth 1 1
6 Nita fifth 1 2
7 Nita fourth 1 4
all we need to do now is change these rows to blanks as your initial requirment, and we are left with.
NAME DEFAULT CLASS AGE GROUP
0 Rekha forth 2 2
1 third 1 4
2 Jaya c-default forth 3 3
3 fifth 1 1
4 Sushma fourth3 1
5 Nita c-default third 4 2
6 fifth 1 2
7 fourth1 4
hope that helps!

Pandas - How to skip the first row of a csv file to be made the header with combining multiple csv files [duplicate]

I'm reading in a pandas DataFrame using pd.read_csv. I want to keep the first row as data, however it keeps getting converted to column names.
I tried header=False but this just deleted it entirely.
(Note on my input data: I have a string (st = '\n'.join(lst)) that I convert to a file-like object (io.StringIO(st)), then build the csv from that file object.)
You want header=None the False gets type promoted to int into 0 see the docs emphasis mine:
header : int or list of ints, default ‘infer’ Row number(s) to use as
the column names, and the start of the data. Default behavior is as if
set to 0 if no names passed, otherwise None. Explicitly pass header=0
to be able to replace existing names. The header can be a list of
integers that specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be skipped
(e.g. 2 in this example is skipped). Note that this parameter ignores
commented lines and empty lines if skip_blank_lines=True, so header=0
denotes the first line of data rather than the first line of the file.
You can see the difference in behaviour, first with header=0:
In [95]:
import io
import pandas as pd
t="""a,b,c
0,1,2
3,4,5"""
pd.read_csv(io.StringIO(t), header=0)
Out[95]:
a b c
0 0 1 2
1 3 4 5
Now with None:
In [96]:
pd.read_csv(io.StringIO(t), header=None)
Out[96]:
0 1 2
0 a b c
1 0 1 2
2 3 4 5
Note that in latest version 0.19.1, this will now raise a TypeError:
In [98]:
pd.read_csv(io.StringIO(t), header=False)
TypeError: Passing a bool to header is invalid. Use header=None for no
header or header=int or list-like of ints to specify the row(s) making
up the column names
I think you need parameter header=None to read_csv:
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b
2,1
1,1"""
df = pd.read_csv(StringIO(temp),header=None)
print (df)
0 1
0 a b
1 2 1
2 1 1
If you're using pd.ExcelFile to read all the excel file sheets then:
df = pd.ExcelFile("path_to_file.xlsx")
df.sheet_names # Provide the sheet names in the excel file
df = df.parse(2, header=None) # Parsing the 2nd sheet in the file with header = None
df
Output:
0 1
0 a b
1 1 1
2 0 1
3 5 2
You can set custom column name in order to prevent this:
Let say if you have two columns in your dataset then:
df = pd.read_csv(your_file_path, names = ['first column', 'second column'])
You can also generate programmatically column names if you have more than and can pass a list in front of names attribute.

Resources