Compare column names of multiple pandas dataframes - python-3.x

In the below code, I created list of dataframes. Now I want to check if all the dataframes in dataframes list has same column names (I just want to compare headers, not the values) and if the condition is not met, it should error out.
dataframes = []
list_of_files = os.listdir(os.path.join(folder_location, quarter, "inputs"))
for files in list_of_files:
df = pd.read_excel(os.path.join(folder_location, quarter, "inputs", files), header=[0,1], sheetname= "Ratings Inputs", parse_cols ="B:AC", index_col=None).reset_index()
df.columns = pd.MultiIndex.from_tuples([tuple(df.columns.names)]
+ list(df.columns)[1:])
dataframes.append(df)

Not the most elegant solution, but it will get you there:
np.all([sorted(dataframes[0].columns) == sorted(i.columns) for i in dataframes])
sorted serves the purpose of both transforming into lists and making sure they dont fail because they are in different order

Related

Comparing two list of DataFrames and outputting the difference only (using loops)

UPDATE 2:
Following Ouyang Ze's advice in the comments. I added float as my data type when reading in the list of Excel files. Additionally, I used df.info() to make sure my data types were indeed matching.
# Read in data from list of XLSX files and create a list of DataFrames
list_banks2 = [pd.read_excel(filename, sheet_name='Total ALL',skiprows=16, nrows=360, usecols='E:AR', dtype='float') for filename in all_filenames2]
Spyder IDE variable explorer for DataFrame 2:
Same DataFrame values copy and pasted into Excel:
Here the blue colored values should be rounded so it would match. The yellow colored values are what is confusing me. Even when I copy it into Excel it shows there is no difference in value, index, and data type. The red colored values should show. As such, rather than a (26,2) dataframe being outputted, there should only be a (3,2) dataframe with only three different values in the 2 dataframe comparison. Any ideas?
UPDATE 1:
I believe the problem lies in the fact that my 2 dataframes values are different by 0.00000001 to 0.000000000000000001 which is causing it to show as different when I use my formula to compare 2 dataframes. Since I am using a loop to manipulate my dataframes, I cannot seem to be able to round my values to 2 or 3 significant digits. I have tried the following methods:
# Method 1:
for banks in list_banks1:
banks.drop([14], axis=1, inplace=True)
banks.round(2)
banks.index +=1
# Method 2:
for banks in list_banks1:
banks.drop([14], axis=1, inplace=True)
banks.apply(lambda x:round(x,2))
banks.index +=1
# Method 3:
for banks in list_banks1:
banks.drop([14], axis=1, inplace=True)
banks.round(2).astype(int)
banks.index +=1
# Method 4:
for banks in list_banks1:
banks.drop([14], axis=1, inplace=True)
banks.round(2).astype(float)
banks.index +=1
However, when I explore my variables in the Spyder IDE, I find that my dataframes have not been rounded to 2 significant digits. While it shows as 2 significant digits in the IDE, when I copy and paste the values into an Excel cell, it still shows the original value. As such, I thought I should be rounding the values from my list of dataframes, but I believe I cannot do that and will receive an error.
QUESTION:
I am working on a small task/project where I loop through two directories (one being a parent directory) and creating a list of DataFrames from the list of Excel (.xlsx) files.
## Recursive find all files in named directory and create list of files with XLSX extension
os.chdir(input_drc1)
extension = 'xlsx'
all_filenames = [i for i in glob.glob('0*.{}'.format(extension))]
list_banks = ["000","002","004","005","015","017","019","021","024","029","030","032","033","034","038","039"]
## Recursive find all files in named directory and create list of files with XLSX extension
os.chdir(input_drc2)
extension = 'xlsx'
all_filenames2 = [i for i in glob.glob('0*.{}'.format(extension))]
I then looped through the list of DataFrames to manipulate some data (dropped a column, rounded the values to 1 sig.fig, and increased the index by 1).
## Read in data from list of XLSX files
list_banks1 = [pd.read_excel(filename, sheet_name='Total (except non-resident)',skiprows=16, nrows=360, usecols='E:AR') for filename in all_filenames]
## Drop column 14 from all DataFrames in list of DataFrames
for banks in list_banks1:
banks.drop([14], axis=1, inplace=True)
banks.round(1)
banks.index +=1
# banks.apply(lambda x: pd.Series.round(x,1))
## Drop column 14 from all DataFrames in list of DataFrames
for banks in list_banks2:
banks.drop([14], axis=1, inplace=True)
banks.round(1)
banks.index +=1
# banks.apply(lambda x: pd.Series.round(x,1))
Ideally, my next step was to loop through the list of DataFrames from one directory and compare it to the list of DataFrames from the other directory. Unfortunately, I could not figure out if I could create a loop for this situation, so I created a loop to name each unique DataFrame for comparison purposes.
## Create unique DataFrames for each DataFrame in list
for i in range(0,len(list_banks)):
globals()[f'df{list_banks[i]}LL'] = list_banks1[i]
## Create unique DataFrames for each DataFrame in list
for i in range(0,len(list_banks)):
globals()[f'df{list_banks[i]}LC'] = list_banks2[i]
However, when I use the formula def diff_pd, I still get an output with equal values as seen in the picture below. The id column represents my row number and the col column represents my column number used in the original Excel files.
## Compare LL and LC dataframes and output differences only.
check_df000 = diff_pd(df000LC,df000LL)
check_df002 = diff_pd(df002LC,df002LL)
check_df002 = check_df002.sort_values(by=["col"], inplace=True)
I believe my formula is working since it outputs less values than the total values in the Excel files, but it shows some that I believe have a different sig.fig. I tried rounding to no avail. I also tried pandas df.compare function but it is not the result I desired. I just want differences between the each pair of DataFrames from two separate directories to show with their column and row numbers. Ideally, without creating 24 unqiue DataFrames but just using two list of DataFrames instead.
Any and all help would be greatly appreciated. Additionally, since I am a novice in this, and trying to learn, any guidance on cleaning up my code and streamlining it would be amazing.
## Import necessary modules
import os
import pandas as pd
import numpy as np
import glob
import xlwings as xw
from datetime import date, datetime
from dateutil.relativedelta import relativedelta
## Formula to compare two dataframes for differences (accounts for NaN values)
def diff_pd(df1, df2):
"""Identify differences between two pandas DataFrames"""
assert (df1.columns == df2.columns).all(), \
"DataFrame column names are different"
if any(df1.dtypes != df2.dtypes):
"Data Types are different, trying to convert"
df2 = df2.astype(df1.dtypes)
if df1.equals(df2):
return None
else:
# need to account for np.nan != np.nan returning True
diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
ne_stacked = diff_mask.stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(diff_mask)
changed_from = df1.values[difference_locations]
changed_to = df2.values[difference_locations]
return pd.DataFrame({'from': changed_from, 'to': changed_to},
index=changed.index)
## Variables
## Subtracted a month to match report release month.
input_date = datetime.now()
output_date1 = input_date-relativedelta(months=+1)
output_date2 = input_date-relativedelta(months=+2)
output_date3 = input_date-relativedelta(months=+13)
reportingCurrMonth = str(int(output_date1.strftime("%m"))).zfill(2)
reportingLastMonth = str(int(output_date2.strftime("%m"))).zfill(2)
reportingCurrYear = str(int(output_date1.strftime("%Y")))
reportingCurrYear2 = str(int(output_date1.strftime("%y")))
reportingLastYear = str(int(output_date3.strftime("%Y")))
reportingLastYear2 = str(int(output_date3.strftime("%y")))
## File Directory & Files
input_drc = r"D:/BOM/Work Files/3. Monthly Rural-Area Credit Report/"
input_drc1 = input_drc+reportingCurrYear+"/zeelt"+reportingCurrYear2+"-m"+reportingCurrMonth
input_drc2 = input_drc+reportingCurrYear+"/zeelt"+reportingCurrYear2+"-m"+reportingLastMonth
#Intialize xlwing App and hide processes
app = xw.App(visible=False)
app.display_alerts=False
## Recursive find all files in named directory and create list of files with XLSX extension
os.chdir(input_drc1)
extension = 'xlsx'
all_filenames = [i for i in glob.glob('0*.{}'.format(extension))]
list_banks = ["000","002","004","005","015","017","019","021","024","029","030","032","033","034","038","039"]
## Read in data from list of XLSX files
list_banks1 = [pd.read_excel(filename, sheet_name='Total (except non-resident)',skiprows=16, nrows=360, usecols='E:AR') for filename in all_filenames]
## Drop column 14 from all DataFrames in list of DataFrames
for banks in list_banks1:
banks.drop([14], axis=1, inplace=True)
banks.round(1)
banks.index +=1
# banks.apply(lambda x: pd.Series.round(x,1))
## Create unique DataFrames for each DataFrame in list
for i in range(0,len(list_banks)):
globals()[f'df{list_banks[i]}LL'] = list_banks1[i]
## Recursive find all files in named directory and create list of files with XLSX extension
os.chdir(input_drc2)
extension = 'xlsx'
all_filenames2 = [i for i in glob.glob('0*.{}'.format(extension))]
## Read in data from list of XLSX files and create a list of DataFrames
list_banks2 = [pd.read_excel(filename, sheet_name='Total ALL',skiprows=16, nrows=360, usecols='E:AR') for filename in all_filenames2]
## Drop column 14 from all DataFrames in list of DataFrames
for banks in list_banks2:
banks.drop([14], axis=1, inplace=True)
banks.round(1)
banks.index +=1
# banks.apply(lambda x: pd.Series.round(x,1))
## Create unique DataFrames for each DataFrame in list
for i in range(0,len(list_banks)):
globals()[f'df{list_banks[i]}LC'] = list_banks2[i]
## Compare LL and LC dataframes and output differences only.
check_df000 = diff_pd(df000LC,df000LL)
check_df002 = diff_pd(df002LC,df002LL)
check_df002 = check_df002.sort_values(by=["col"], inplace=True)
# Kill xlwing processes to close all Excel files
app.kill()

Splitting Multiple values inside a Pandas Column into Separate Columns

I have a dataframe with column which contains two different column values and their name as follows:
How Do I transform it into separate columns?
So far, I tried Following:
use df[col].apply(pd.Series) - It didn't work since data in column is not in dictionary format.
Tried separating columns by a semi-colon (";") sign but It is not a good idea since the given dataframe might have n number of column based on response.
EDIT:
Data in plain text format:
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
How about:
df2 = (df["ClusterName"]
.str.replace("Date:", "")
.str.replace("Bucket:", "")
.str.split(";", expand=True))
df2.columns = ["Date", "Bucket"]
EDIT:
Without hardcoding the variable names, here's a quick hack. You can clean it up (and make less silly variable names):
df_temp = df.ClusterName.str.split(";", expand=True)
cols = []
for col in df_temp:
df_temptemp = df_temp[col].str.split(":", expand=True)
df_temp[col] = df_temptemp[1]
cols.append(df_temptemp.iloc[0, 0])
df_temp.columns = cols
So .. maybe like this ...
Setup the data frame
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
df = pd.DataFrame(data=d)
df
Parse over the dataframe breaking apart by colon and semi-colon
ls = []
for index, row in df.iterrows():
splits = row['ClusterName'].split(';')
print(splits[0].split(':')[1],splits[1].split(':')[1])
ls.append([splits[0].split(':')[1],splits[1].split(':')[1]])
df = pd.DataFrame(ls, columns =['Date', 'Bucket'])

How to split pandas dataframe into multiple dataframes based on unique string value without aggregating

I have a df with multiple country codes in a column (US, CA, MX, AU...) and want to split this one df into multiple ones based on these country code values, but without aggregating it.
I've tried a for loop but was only able to get one df and it was aggregated with groupby().
I gave up trying to figure it out so I split them based on str.match and wrote one line for each country code. Is there a nice for loop that could achieve the same as below code? If it would write a csv file for each new df that would be fantastic.
us = df[df['country_code'].str.match("US")]
mx = df[df['country_code'].str.match("MX")]
ca = df[df['country_code'].str.match("CA")]
au = df[df['country_code'].str.match("AU")]
.
.
.
We can write a for loop which takes each code and uses query to get the correct part of the data. Then we write it to csv with to_csv also using f-string:
codes = ['US', 'MX', 'CA', 'AU']
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
temp.to_csv(f'df_{code}.csv')
note: f_string only work if Python >= 3.5
To keep the dataframes:
codes = ['US', 'MX', 'CA', 'AU']
dfs=[]
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
dfs.append(temp)
temp.to_csv(f'df_{code}.csv')
Then you can acces them with the index, for example: print(dfs[0]) or print(dfs[1]).

Transfer cell values from different columns and sheets from multiple excel files with same structure into a single dataframe

I have a reporting sheet in excel that contains a set of datapoints that I want to compile from multiple files with the same format into a master dataset.
The initial step I undertook was to extract the data points I need from multiple sheet into one pandas dataframe. See the steps below
I initally imported the excel file and parsed it
import pandas as pd
xl = pd.ExcelFile(r"C:\Users\Nicola\Desktop\ISP 2016-20 Ops-Technical Form.xlsm")
df = xl.parse("FSL, WASH, DRM") #name of sheet #1
Then I located the data points needed for synthesis
a=df.iloc[5:20,3:5]
a1=df.iloc[6:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
Then I concatenated and equalised columns positioning to maintain the whole list of values within the same column:
dfcon=pd.concat(([a,b]))
dfcon2=pd.concat(([a1,b1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
And lastly created a dataframe with the string of values I need
master=pd.DataFrame(dfcont2)
finalmaster=master.transpose()
The next two steps I wish to pursue are:
1) Replicate the same code for 50 excel files
2) Compile all string of values from this set of excel files into one single pandas dataframe without running this code over again and compile manually by exporting it into excel.
Any support would be greatly appreciated. Thanks
I believe need loop by file names created by glob and last concat together (all files have same structure):
import glob
dfs = []
for f in glob.glob('*.xlsm'):
df = pd.read_excel(io=f, sheet_name=1)
a=df.iloc[5:20,3:5]
a1=df.iloc[6:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
dfcon=pd.concat(([a,b]))
dfcon2=pd.concat(([a1,b1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
dfs.append(dfcont2.T)
out = pd.concat(dfs, ignore_index=True)
Found the solution that works for me, thank you for the input, jezrael.
To further explain:
1) Imported the files with same structure from my Desktop directory, parsed and selected the Excel sheet from which data can be extracted from different locations (iloc)
import glob
dfs = []
for f in glob.glob('C:/Users/Nicola/Desktop/OPS Form/*.xlsm'):
df = pd.ExcelFile(io=f, sheet_name=1)
df = df.parse("FSL, WASH, DRM")
a=df.iloc[5:20,3:5]
a1=df.iloc[7:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
c=df.iloc[50:56,3:5]
c1=df.iloc[38:39,10:12]
d=df.iloc[57:61,3:5]
e=df.iloc[63:71,3:5]
2) Concatenated and repositioned columns order to compose the first version of the dataframe (output)
dfcon=pd.concat(([a,b,c,d,e]))
dfcon2=pd.concat(([a1,b1,c1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
dfs.append(dfcont2.T)
3) Output presented the same string of values but repeated twice [same label and form-specific entry] from recursive data pull-outs linked to iloc locations.
output = pd.concat(dfs, ignore_index=True)
4) This last snippet simply allowed me to extract the label only once and to select all entries ordered in odd numbers. With the last concatenation, I generated the dataframe I seeked, ready to be processed analytically.
a=output[2:3]
b=output[1::2]
pd.concat([a,b], axis=0, ignore_index=True)

List iterations and regex, what is the better way to remove the text I don' t need?

We handle data from volunteers, that data is entered in to a form using ODK. When the data is downloaded the header (column names) row contains a lot of 'stuff' we don' t need. The pattern is as follows:
'Group1/most_common/G27'
I want to replace the column names (there can be up to 200) or create a copy of the DataFrame with column names that just contain the G-code (Gxxx). I think I got it.
What is the faster or better way to do this?
IS the output reliable in terms of sort order? As of now it appears that the results list is in the same order as the original list.
y = ['Group1/most common/G95', 'Group1/most common/G24', 'Group3/plastics/G132']
import re
r = []
for x in y:
m = re.findall(r'G\d+', x)
r.append(m)
# the comprehension below is to flatten it
# append.m gives me a list of lists (each list has one item)
results = [q for t in r for q in t]
print(results)
['G95', 'G24', 'G132']
The idea would be to iterate through the column names in the DataFrame (or a copy), delete what I don't need and replace (inplace=True).
Thanks for your input.
You can use str.extract:
df = pd.DataFrame(columns=['Group1/most common/G95',
'Group1/most common/G24',
'Group3/plastics/G132'])
print (df)
Empty DataFrame
Columns: [Group1/most common/G95, Group1/most common/G24, Group3/plastics/G132]
Index: []
df.columns = df.columns.str.extract('(G\d+)', expand=False)
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []
Another solution with rsplit and select last values with [-1]:
df.columns = df.columns.str.rsplit('/').str[-1]
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []

Resources