I've written a function that collects some data via input(), its unimportant to the question at hand. However at the end I need to .concat two columns.
So far I've got it working to an extent but its not perfect.
{def visualise_country():
data = pd.read_csv('tas_pr_1991_2015_AC.csv')
target_frame = get_info()
df1 = pd.DataFrame(data.loc[data['country'] == target_frame[0]])
df1 = pd.DataFrame(df1.loc[df1['year'] == int(target_frame[2])])
df1 = df1[target_frame[4]]
df2 = pd.DataFrame(data.loc[data['country'] == target_frame[1]])
df2 = pd.DataFrame(df2.loc[df2['year'] == int(target_frame[3])])
df2 = df2[target_frame[4]]
frame_list = [df1,df2]
df = pd.concat(frame_list, axis=1)
print("Data for {} in comparison with {}. Comparison years for {}: {} and {}: ".format(target_frame[0],target_frame[1],target_frame[0],target_frame[2],target_frame[1],target_frame[3]))
return df}
Target_frame is just a tuple containing the collected information necessary to collect the columns.
Output:
1 - NaN
2 - NaN
3 - NaN
4 - NaN
NaN - 5
NaN - 6
NaN - 7
NaN - 8
Desired output:
1 - 5
2 - 6
3 - 7
4 - 8
Need same index values in all DataFrames:
frame_list = [x.reset_index(drop=True) for x in [df1,df2]]
Or:
df1.index = df2.index
frame_list = [df1,df2]
df = pd.concat(frame_list, axis=1)
Or:
df1 = df1[target_frame[4]].reset_index(drop=True)
df2 = df2[target_frame[4]].reset_index(drop=True)
frame_list = [df1,df2]
df = pd.concat(frame_list, axis=1)
Related
I'm trying to make a code to check the data update, and update it if needs to be.
The problem is the efficiency. I only got a idea of nested loop, and there should be a better way to do this.
there are two DataFrames; df_new, df_old.
I want to update df_old's data with newer one.
Also, I want to make a ChangeLog if there's a change. (and if not, just a timestamp).
Here's my sample code:
import pandas as pd
from datetime import datetime
df_new = pd.DataFrame({"id":[11,22,33,44,55], "a2":[2,3,8,9,99], "a3":[2,4,2,5,99]})
df_old = pd.DataFrame({"id":[11,22,33,44], "a2":[2,3,4,7], "a3":[2,2,2,7],"CHANGELOG":["","","",""]})
for row in df_new.itertuples():
flag = 0
for row2 in df_old.itertuples():
if row[1] == row2[1]:
p = str(datetime.now().date()) + "\n"
if row[2] != row2[2]:
p += "a2 : " + str(row[2]) + " -> " + str(row2[2]) + "\n"
df_old.at[row2[0],"a2"] = str(row[2])
if row[3] != row2[3]:
p += "a3 : " + str(row[3]) + " -> " + str(row2[3]) + "\n"
df_old.at[row2[0],"a3"] = str(row[3])
df_old.at[row2[0],"CHANGELOG"] = p
flag = 1
break
if flag == 0:
df_old = df_old.append(pd.DataFrame([row],columns = row._fields),ignore_index=True)
df_old.at[len(df_old)-1,"CHANGELOG"] = str(datetime.now().date()) + "\n" + "Created"
The code actually worked. But only with small datasets. if I run with tens of thousands rows (each), as you've already assumed, it takes too much time.
I've searched that there's pd.compare, but seems like it only works with two dataframes with same rows/columns. And... I'm stuck now.
Are there any functions or references that I can use?
Thank you in advance.
Indeed, as stated in the docs, pd.compare "[c]an only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames. So, let's first achieve this:
import pandas as pd
from datetime import date
df_old = pd.DataFrame({"id":[11,22,33,44],
"a2":[2,3,4,7],
"a3":[2,2,2,7],
"CHANGELOG":["","","",""]})
df_new = pd.DataFrame({"id":[11,22,33,44,55],
"a2":[2,3,8,9,99],
"a3":[2,4,2,5,99]})
# get slice of `df_old` with just columns that need comparing
df_slice = df_old.iloc[:,:3]
# get missing indices from `df_new`, build empty df and append to `df_slice`
missing_indices = set(df_new.index).difference(set(df_slice.index))
df_missing = pd.DataFrame(columns = df_slice.columns, index=missing_indices)
df_slice = pd.concat([df_slice,df_missing],axis=0)
print(df_slice)
id a2 a3
0 11 2 2
1 22 3 2
2 33 4 2
3 44 7 7
4 NaN NaN NaN
Now, we can use pd.compare:
# compare the two dfs, keep_shape=True: same rows remain in result
comp = df_slice.compare(df_new, keep_shape=True)
print(comp)
id a2 a3
self other self other self other
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 2 4.0
2 NaN NaN 4 8.0 NaN NaN
3 NaN NaN 7 9.0 7 5.0
4 NaN 55.0 NaN 99.0 NaN 99.0
Finally, let's apply a custom function to the comp df to generate the strings for the column CHANGELOG. Something like below:
# create func to build changelog strings per row
def create_change_log(row: pd.Series) -> str:
"""
Parameters
----------
row : pd.Series
e.g. comp.iloc[1] with ('id','self') etc. as index
Returns
-------
str
change_log_string per row.
"""
# start string for each row
string = str(date.today()) + "\n"
# get length pd.Series
length = len(row)
# get range for loop over 'self', so index 0,2,4 if len == 6
self_iloc = [*range(0,length,2)]
# get level 0 from index to retrieve orig col names: ['id'] etc.
cols = row.index.get_level_values(0)
# for loop, check scenarios
for i in self_iloc:
temp = str()
# both 'self' and 'other' vals are NaN, nothing changed
if row.isna().all():
break
# all of 'self' == NaN, entire row is new
if row.iloc[self_iloc].isna().all():
temp = 'Created\n'
string += temp
break
# set up comp for specific cols: comp 'self.1' == 'other.1' etc.
self_val, other_val = row[i], row[i+1]
# add `pd.notna()`, since np.nan == np.nan is actually `False`!
if self_val != other_val and pd.notna(self_val):
temp = f'{cols[i]} : {self_val} -> {other_val}\n'
string += temp
return string
Applied to comp:
change_log = comp.apply(lambda row: create_change_log(row), axis=1)
change_log.name = 'CHANGELOG'
# result
df_old_adj = pd.concat([df_new,change_log],axis=1)
print(df_old_adj)
id a2 a3 CHANGELOG
0 11 2 2 2022-08-29\n
1 22 3 4 2022-08-29\na3 : 2 -> 4.0\n
2 33 8 2 2022-08-29\na2 : 4 -> 8.0\n
3 44 9 5 2022-08-29\na2 : 7 -> 9.0\na3 : 7 -> 5.0\n
4 55 99 99 2022-08-29\nCreated\n
PS.1: my result has e.g. 2022-08-29\na3 : 2 -> 4.0\n where you generate 2022-08-29\na3 : 4 -> 2\n. The former seems to me correct; you want to convey: value 2 in column a3 has become (->) 4, no? Anyway, you can just switch the vars in {self_val} -> {other_val}, of course.
PS.2: comp is turning ints into floats automatically for other (= df_new). Hence, we end up with 2 -> 4.0 rather than 2 -> 4. I'd say the best solution to 'fix' this depends on the type of values you are expecting.
I am trying to calculate additional metrics from existing pandas dataframe by using an if/else condition on existing column values.
if(df['Sell_Ind']=='N').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.CurrentPrice, axis=1).astype(float).round(2)
elif(df['Sell_Ind']=='Y').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.Sold_price, axis=1).astype(float).round(2)
else:
df['MarketValue'] = df.apply(lambda row: 0)
For the if condition the MarketValue is calculated correctly but for the elif condition, its not giving the correct value.
Can anyone point me as what wrong I am doing in this code.
I think you need numpy.select, apply can be removed and multiple columns by mul:
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
Sample:
df = pd.DataFrame({'Sold_price':[7,8,9,4,2,3],
'SharesUnits':[1,3,5,7,1,0],
'CurrentPrice':[5,3,6,9,2,4],
'Sell_Ind':list('NNYYTT')})
#print (df)
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
print (df)
CurrentPrice Sell_Ind SharesUnits Sold_price MarketValue
0 5 N 1 7 5.0
1 3 N 3 8 9.0
2 6 Y 5 9 45.0
3 9 Y 7 4 28.0
4 2 T 1 2 0.0
5 4 T 0 3 0.0
I was wondering how I can remove rows which have a negative value but keep the NaNs. At the moment I am using:
DF = DF.ix[DF['RAF01Time'] >= 0]
But this removes the NaNs.
Thanks in advance.
You need boolean indexing with another condition with isnull:
DF = DF[(DF['RAF01Time'] >= 0) | (DF['RAF01Time'].isnull())]
Sample:
DF = pd.DataFrame({'RAF01Time':[-1,2,3,np.nan]})
print (DF)
RAF01Time
0 -1.0
1 2.0
2 3.0
3 NaN
DF = DF[(DF['RAF01Time'] >= 0) | (DF['RAF01Time'].isnull())]
print (DF)
RAF01Time
1 2.0
2 3.0
3 NaN
Another solution with query:
DF = DF.query("~(RAF01Time < 0)")
print (DF)
RAF01Time
1 2.0
2 3.0
3 NaN
You can just use < 0 and then take the inverse of the condition.
DF = DF[~(DF['RAF01Time'] < 0)]
Given the following data frame:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'A':['a','b','c','d'],
'B':['d',np.nan,'c','f']})
df1
A B
0 a d
1 b NaN
2 c c
3 d f
I'd like to insert blank rows before each row.
The desired result is:
A B
0 NaN NaN
1 a d
2 NaN NaN
3 b NaN
4 NaN NaN
5 c c
6 NaN NaN
7 d f
In reality, I have many rows.
Thanks in advance!
I think you could change your index like #bananafish did and then use reindex:
df1.index = range(1, 2*len(df1)+1, 2)
df2 = df1.reindex(index=range(2*len(df1)))
In [29]: df2
Out[29]:
A B
0 NaN NaN
1 a d
2 NaN NaN
3 b NaN
4 NaN NaN
5 c c
6 NaN NaN
7 d f
Use numpy and pd.DataFrame
def pir(df):
nans = np.where(np.empty_like(df.values), np.nan, np.nan)
data = np.hstack([nans, df.values]).reshape(-1, df.shape[1])
return pd.DataFrame(data, columns=df.columns)
pir(df1)
Testing and Comparison
Code
def banana(df):
df1 = df.set_index(np.arange(1, 2*len(df)+1, 2))
df2 = pd.DataFrame(index=range(0, 2*len(df1), 2), columns=df1.columns)
return pd.concat([df1, df2]).sort_index()
def anton(df):
df = df.set_index(np.arange(1, 2*len(df)+1, 2))
return df.reindex(index=range(2*len(df)))
def pir(df):
nans = np.where(np.empty_like(df.values), np.nan, np.nan)
data = np.hstack([nans, df.values]).reshape(-1, df.shape[1])
return pd.DataFrame(data, columns=df.columns)
Results
pd.concat([f(df1) for f in [banana, anton, pir]],
axis=1, keys=['banana', 'anton', 'pir'])
Timing
A bit roundabout but this works:
df1.index = range(1, 2*len(df1)+1, 2)
df2 = pd.DataFrame(index=range(0, 2*len(df1), 2), columns=df1.columns)
df3 = pd.concat([df1, df2]).sort()
I have two dataframes (df1 and df2) that each have the same rows and columns. I would like to take the maximum of these two dataframes, element-by-element. In addition, the result of any element-wise maximum with a number and NaN should be the number. The approach I have implemented so far seems inefficient:
def element_max(df1,df2):
import pandas as pd
cond = df1 >= df2
res = pd.DataFrame(index=df1.index, columns=df1.columns)
res[(df1==df1)&(df2==df2)&(cond)] = df1[(df1==df1)&(df2==df2)&(cond)]
res[(df1==df1)&(df2==df2)&(~cond)] = df2[(df1==df1)&(df2==df2)&(~cond)]
res[(df1==df1)&(df2!=df2)&(~cond)] = df1[(df1==df1)&(df2!=df2)]
res[(df1!=df1)&(df2==df2)&(~cond)] = df2[(df1!=df1)&(df2==df2)]
return res
Any other ideas? Thank you for your time.
A more readable way to do this in recent versions of pandas is concat-and-max:
import scipy as sp
import pandas as pd
A = pd.DataFrame([[1., 2., 3.]])
B = pd.DataFrame([[3., sp.nan, 1.]])
pd.concat([A, B]).max(level=0)
#
# 0 1 2
# 0 3.0 2.0 3.0
#
You can use where to test your df against another df, where the condition is True, the values from df are returned, when false the values from df1 are returned. Additionally in the case where NaN values are in df1 then an additional call to fillna(df) will use the values from df to fill those NaN and return the desired df:
In [178]:
df = pd.DataFrame(np.random.randn(5,3))
df.iloc[1,2] = np.NaN
print(df)
df1 = pd.DataFrame(np.random.randn(5,3))
df1.iloc[0,0] = np.NaN
print(df1)
0 1 2
0 2.671118 1.412880 1.666041
1 -0.281660 1.187589 NaN
2 -0.067425 0.850808 1.461418
3 -0.447670 0.307405 1.038676
4 -0.130232 -0.171420 1.192321
0 1 2
0 NaN -0.244273 -1.963712
1 -0.043011 -1.588891 0.784695
2 1.094911 0.894044 -0.320710
3 -1.537153 0.558547 -0.317115
4 -1.713988 -0.736463 -1.030797
In [179]:
df.where(df > df1, df1).fillna(df)
Out[179]:
0 1 2
0 2.671118 1.412880 1.666041
1 -0.043011 1.187589 0.784695
2 1.094911 0.894044 1.461418
3 -0.447670 0.558547 1.038676
4 -0.130232 -0.171420 1.192321