how to compare / update with two diffrent Dataframes? - python-3.x

I'm trying to make a code to check the data update, and update it if needs to be.
The problem is the efficiency. I only got a idea of nested loop, and there should be a better way to do this.
there are two DataFrames; df_new, df_old.
I want to update df_old's data with newer one.
Also, I want to make a ChangeLog if there's a change. (and if not, just a timestamp).
Here's my sample code:
import pandas as pd
from datetime import datetime
df_new = pd.DataFrame({"id":[11,22,33,44,55], "a2":[2,3,8,9,99], "a3":[2,4,2,5,99]})
df_old = pd.DataFrame({"id":[11,22,33,44], "a2":[2,3,4,7], "a3":[2,2,2,7],"CHANGELOG":["","","",""]})
for row in df_new.itertuples():
flag = 0
for row2 in df_old.itertuples():
if row[1] == row2[1]:
p = str(datetime.now().date()) + "\n"
if row[2] != row2[2]:
p += "a2 : " + str(row[2]) + " -> " + str(row2[2]) + "\n"
df_old.at[row2[0],"a2"] = str(row[2])
if row[3] != row2[3]:
p += "a3 : " + str(row[3]) + " -> " + str(row2[3]) + "\n"
df_old.at[row2[0],"a3"] = str(row[3])
df_old.at[row2[0],"CHANGELOG"] = p
flag = 1
break
if flag == 0:
df_old = df_old.append(pd.DataFrame([row],columns = row._fields),ignore_index=True)
df_old.at[len(df_old)-1,"CHANGELOG"] = str(datetime.now().date()) + "\n" + "Created"
The code actually worked. But only with small datasets. if I run with tens of thousands rows (each), as you've already assumed, it takes too much time.
I've searched that there's pd.compare, but seems like it only works with two dataframes with same rows/columns. And... I'm stuck now.
Are there any functions or references that I can use?
Thank you in advance.

Indeed, as stated in the docs, pd.compare "[c]an only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames. So, let's first achieve this:
import pandas as pd
from datetime import date
df_old = pd.DataFrame({"id":[11,22,33,44],
"a2":[2,3,4,7],
"a3":[2,2,2,7],
"CHANGELOG":["","","",""]})
df_new = pd.DataFrame({"id":[11,22,33,44,55],
"a2":[2,3,8,9,99],
"a3":[2,4,2,5,99]})
# get slice of `df_old` with just columns that need comparing
df_slice = df_old.iloc[:,:3]
# get missing indices from `df_new`, build empty df and append to `df_slice`
missing_indices = set(df_new.index).difference(set(df_slice.index))
df_missing = pd.DataFrame(columns = df_slice.columns, index=missing_indices)
df_slice = pd.concat([df_slice,df_missing],axis=0)
print(df_slice)
id a2 a3
0 11 2 2
1 22 3 2
2 33 4 2
3 44 7 7
4 NaN NaN NaN
Now, we can use pd.compare:
# compare the two dfs, keep_shape=True: same rows remain in result
comp = df_slice.compare(df_new, keep_shape=True)
print(comp)
id a2 a3
self other self other self other
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 2 4.0
2 NaN NaN 4 8.0 NaN NaN
3 NaN NaN 7 9.0 7 5.0
4 NaN 55.0 NaN 99.0 NaN 99.0
Finally, let's apply a custom function to the comp df to generate the strings for the column CHANGELOG. Something like below:
# create func to build changelog strings per row
def create_change_log(row: pd.Series) -> str:
"""
Parameters
----------
row : pd.Series
e.g. comp.iloc[1] with ('id','self') etc. as index
Returns
-------
str
change_log_string per row.
"""
# start string for each row
string = str(date.today()) + "\n"
# get length pd.Series
length = len(row)
# get range for loop over 'self', so index 0,2,4 if len == 6
self_iloc = [*range(0,length,2)]
# get level 0 from index to retrieve orig col names: ['id'] etc.
cols = row.index.get_level_values(0)
# for loop, check scenarios
for i in self_iloc:
temp = str()
# both 'self' and 'other' vals are NaN, nothing changed
if row.isna().all():
break
# all of 'self' == NaN, entire row is new
if row.iloc[self_iloc].isna().all():
temp = 'Created\n'
string += temp
break
# set up comp for specific cols: comp 'self.1' == 'other.1' etc.
self_val, other_val = row[i], row[i+1]
# add `pd.notna()`, since np.nan == np.nan is actually `False`!
if self_val != other_val and pd.notna(self_val):
temp = f'{cols[i]} : {self_val} -> {other_val}\n'
string += temp
return string
Applied to comp:
change_log = comp.apply(lambda row: create_change_log(row), axis=1)
change_log.name = 'CHANGELOG'
# result
df_old_adj = pd.concat([df_new,change_log],axis=1)
print(df_old_adj)
id a2 a3 CHANGELOG
0 11 2 2 2022-08-29\n
1 22 3 4 2022-08-29\na3 : 2 -> 4.0\n
2 33 8 2 2022-08-29\na2 : 4 -> 8.0\n
3 44 9 5 2022-08-29\na2 : 7 -> 9.0\na3 : 7 -> 5.0\n
4 55 99 99 2022-08-29\nCreated\n
PS.1: my result has e.g. 2022-08-29\na3 : 2 -> 4.0\n where you generate 2022-08-29\na3 : 4 -> 2\n. The former seems to me correct; you want to convey: value 2 in column a3 has become (->) 4, no? Anyway, you can just switch the vars in {self_val} -> {other_val}, of course.
PS.2: comp is turning ints into floats automatically for other (= df_new). Hence, we end up with 2 -> 4.0 rather than 2 -> 4. I'd say the best solution to 'fix' this depends on the type of values you are expecting.

Related

How to correspondence of unique values ​between 2 tables?

I am fairly new to Python and I am trying to create a new function to work on my project.
The function will aim to detect which unique value is present in another column of another table.
At first, the function seeks to keep only the unique values ​​of the two tables, then merges them into a new dataframe
It's the rest that gets complicated because I would like to return which row and on which table my value is missing
If you have any other leads or thought patterns, I'm also interested.
Here is my code :
def correspondance_cle(df1, df2, col):
df11 = pd.DataFrame(df1[col].unique())
df11.columns= [col]
df11['test1'] = 1
df21 = pd.DataFrame(df2[col].unique())
df21.columns= [col]
df21['test2'] = 1
df3 = pd.merge(df11, df21, on=col, how='outer')
df3 = df3.loc[(fk['test1'].isna() == True) | (fk['test2'].isna() == True),:]
df3.info()
for row in df3[col]:
if df3['test1'].isna() == True:
print(row, "is not in df1")
else:
print(row, 'is not in df2')
Thanks to everyone who took the time to read the post.
First use outer join with remove duplicates by Series.drop_duplicates and Series.reset_index for avoid removed original indices:
df1 = pd.DataFrame({'a':[1,2,5,5]})
df2 = pd.DataFrame({'a':[2,20,5,8]})
col = 'a'
df = (df1[col].drop_duplicates().reset_index()
.merge(df2[col].drop_duplicates().reset_index(),
indicator=True,
how='outer',
on=col))
print (df)
index_x a index_y _merge
0 0.0 1 NaN left_only
1 1.0 2 0.0 both
2 2.0 5 2.0 both
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only
Then filter rows by helper column _merge:
print (df[df['_merge'].eq('left_only')])
index_x a index_y _merge
0 0.0 1 NaN left_only
print (df[df['_merge'].eq('right_only')])
index_x a index_y _merge
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only

In python, how to locate the position of the empty rows in the middle of the file and skip some rows at the beginning dynamically

The data in an excel file looks like this
A B C
1 1 1
1 1 1
D E F G H
1 1 1 1 1
1 1 1 1 1
The file is separated into two parts by one empty row in the middle of the file. They have different column names and different number of columns. I only need the second part of the file. I want to read this file as a pandas dataframe. The number of rows in the first part is not fixed, different files will have different number of rows. So if I use skiprows=4 will not work.
I actually already have a solution for that. But I want to know whether there is a better solution.
import pandas as pd
path = r'C:\Users\'
file = 'test-file.xlsx'
# Read the whole file without skipping
df_temp = pd.read_excel(path + '/' + file)
The data looks like this in pandas. Empty row will have null values in all the columns.
A B C Unnamed: 3 Unnamed: 4
0 1 1 1 NaN NaN
1 1 1 1 NaN NaN
2 NaN NaN NaN NaN NaN
3 D E F G H
4 1 1 1 1 1
5 1 1 1 1 1
I try to find all empty rows and return the index of the first empty row
first_empty_row = df_temp[df_temp.isnull().all(axis=1)].index[0]
del df_temp
Read the file again but skip number of rows by using the number provided above
df= pd.read_excel(path + '/' + file, skiprows=first_empty_row+2)
print(df)
The drawback of this solution is I need to read the file twice. If the file has a lot of rows in the first part, it might take a long time to read these useless rows. I can also possibly use readline loop rows until it reach an empty row, but that will be inefficient.
Does anyone have a better solution? Thanks
Find the position if the first empty row:
pos = df_temp[df_temp.isnull().all(axis=1)].index[0]
Then select everything after that position:
df = df_temp.iloc[pos+1:]
df.columns = df.iloc[0]
df.columns.name = ''
df = df.iloc[1:]
Your first line looks across the entire row for all null. Would it be possible to just look for the first null in the first column?
first_empty_row = df_temp[df_temp.isnull().all(axis=1)].index[0]
How does this compare in performance?
import pandas as pd
import numpy as np
data1 = {'A' : [1,1, np.NaN, 'D', 1,1],
'B' : [1,1, np.NaN, 'E', 1,1],
'C' : [1,1, np.NaN, 'F', 1,1],
'Unnamed: 3' : [np.NaN,np.NaN,np.NaN, 'G', 1,1],
'Unnamed: 4' : [np.NaN,np.NaN,np.NaN, 'H', 1,1]}
df1 = pd.DataFrame(data1)
print(df1)
A B C Unnamed: 3 Unnamed: 4
0 1 1 1 NaN NaN
1 1 1 1 NaN NaN
2 NaN NaN NaN NaN NaN
3 D E F G H
4 1 1 1 1 1
5 1 1 1 1 1
# create empty list to append the rows that need to be deleted
list1 = []
# loop through the first column of the dataframe and append the index to a list until the row is null
for index, row in df1.iterrows():
if (pd.isnull(row[0])):
list1.append(index)
break
else:
list1.append(index)
# drop the rows based on list created from for loop
df1 = df1.drop(df1.index[list1])
# reset index so you can replace the old columns names
# with the secondary column names easier
df1 = df1.reset_index(drop = True)
# create empty list to append the new column names to
temp = []
# loop through dataframe and append the new column names
for label in df1.columns:
temp.append(df1[label][0])
# replace column names with the desired names
df1.columns = temp
# drop the old column names which are always going to be at row 0
df1 = df1.drop(df1.index[0])
# reset index so it doesn't start at 1
df1 = df1.reset_index(drop = True)
print(df1)
D E F G H
0 1 1 1 1 1
1 1 1 1 1 1

counting NaNs in 'for' loop in python

I am trying to iterate over the rows in df and count consecutive rows when a certain value is NaN or 0 and start the count over if the value will change from NaN or 0. I would like to get something like this:
Value Period
0 1
0 2
0 3
NaN 4
21 NaN
4 NaN
0 1
0 2
NaN 3
I wrote the function which takes a dataframe as an argument and returns it with an additional column which denotes the count:
def calc_period(df):
period_x = []
sum_x = 0
for i in range(1,df.shape[0]):
if df.iloc[i,0] == np.nan or df.iloc[i,0] == 0:
sum_x += 1
period_x.append(sum_x)
else:
period_x.append(None)
sum_x = 0
period_x.append(sum_x)
df['period_x'] = period_x
return df
The function works well when the value is 0. But when the value is NaN the count is also NaN and I get the following result:
Value Period
0 1
0 2
0 3
NaN NaN
NaN NaN
Here is a revised version of your code:
import pandas as pd
import numpy as np
import math
def is_nan_or_zero(val):
return math.isnan(val) or val == 0
def calc_period(df):
is_first_nan_or_zero = is_nan_or_zero(df.iloc[0, 0])
period_x = [1 if is_first_nan_or_zero else np.nan]
sum_x = 1 if is_first_nan_or_zero else 0
for i in range(1,df.shape[0]):
val = df.iloc[i,0]
if is_nan_or_zero(val):
sum_x += 1
period_x.append(sum_x)
else:
period_x.append(None)
sum_x = 0
df['period_x'] = period_x
return df
There were 2 fixes:
Replacing df.iloc[i,0] == np.nan with math.isnan(val)
Remove period_x.append(sum_x) at the end, and add the first period value instead (since we start iterating from the second value)

concating two columns next to eachother

I've written a function that collects some data via input(), its unimportant to the question at hand. However at the end I need to .concat two columns.
So far I've got it working to an extent but its not perfect.
{def visualise_country():
data = pd.read_csv('tas_pr_1991_2015_AC.csv')
target_frame = get_info()
df1 = pd.DataFrame(data.loc[data['country'] == target_frame[0]])
df1 = pd.DataFrame(df1.loc[df1['year'] == int(target_frame[2])])
df1 = df1[target_frame[4]]
df2 = pd.DataFrame(data.loc[data['country'] == target_frame[1]])
df2 = pd.DataFrame(df2.loc[df2['year'] == int(target_frame[3])])
df2 = df2[target_frame[4]]
frame_list = [df1,df2]
df = pd.concat(frame_list, axis=1)
print("Data for {} in comparison with {}. Comparison years for {}: {} and {}: ".format(target_frame[0],target_frame[1],target_frame[0],target_frame[2],target_frame[1],target_frame[3]))
return df}
Target_frame is just a tuple containing the collected information necessary to collect the columns.
Output:
1 - NaN
2 - NaN
3 - NaN
4 - NaN
NaN - 5
NaN - 6
NaN - 7
NaN - 8
Desired output:
1 - 5
2 - 6
3 - 7
4 - 8
Need same index values in all DataFrames:
frame_list = [x.reset_index(drop=True) for x in [df1,df2]]
Or:
df1.index = df2.index
frame_list = [df1,df2]
df = pd.concat(frame_list, axis=1)
Or:
df1 = df1[target_frame[4]].reset_index(drop=True)
df2 = df2[target_frame[4]].reset_index(drop=True)
frame_list = [df1,df2]
df = pd.concat(frame_list, axis=1)

How to use pandas df column value in if-else expression to calculate additional columns

I am trying to calculate additional metrics from existing pandas dataframe by using an if/else condition on existing column values.
if(df['Sell_Ind']=='N').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.CurrentPrice, axis=1).astype(float).round(2)
elif(df['Sell_Ind']=='Y').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.Sold_price, axis=1).astype(float).round(2)
else:
df['MarketValue'] = df.apply(lambda row: 0)
For the if condition the MarketValue is calculated correctly but for the elif condition, its not giving the correct value.
Can anyone point me as what wrong I am doing in this code.
I think you need numpy.select, apply can be removed and multiple columns by mul:
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
Sample:
df = pd.DataFrame({'Sold_price':[7,8,9,4,2,3],
'SharesUnits':[1,3,5,7,1,0],
'CurrentPrice':[5,3,6,9,2,4],
'Sell_Ind':list('NNYYTT')})
#print (df)
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
print (df)
CurrentPrice Sell_Ind SharesUnits Sold_price MarketValue
0 5 N 1 7 5.0
1 3 N 3 8 9.0
2 6 Y 5 9 45.0
3 9 Y 7 4 28.0
4 2 T 1 2 0.0
5 4 T 0 3 0.0

Resources