Update dataframe cells according to match cells within another dataframe in pandas [duplicate] - python-3.x

I have two dataframes in python. I want to update rows in first dataframe using matching values from another dataframe. Second dataframe serves as an override.
Here is an example with same data and code:
DataFrame 1 :
DataFrame 2:
I want to update update dataframe 1 based on matching code and name. In this example Dataframe 1 should be updated as below:
Note : Row with Code =2 and Name= Company2 is updated with value 1000 (coming from Dataframe 2)
import pandas as pd
data1 = {
'Code': [1, 2, 3],
'Name': ['Company1', 'Company2', 'Company3'],
'Value': [200, 300, 400],
}
df1 = pd.DataFrame(data1, columns= ['Code','Name','Value'])
data2 = {
'Code': [2],
'Name': ['Company2'],
'Value': [1000],
}
df2 = pd.DataFrame(data2, columns= ['Code','Name','Value'])
Any pointers or hints?

Using DataFrame.update, which aligns on indices (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html):
>>> df1.set_index('Code', inplace=True)
>>> df1.update(df2.set_index('Code'))
>>> df1.reset_index() # to recover the initial structure
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0

You can using concat + drop_duplicates which updates the common rows and adds the new rows in df2
pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')
Out[1280]:
Code Name Value
0 1 Company1 200
0 2 Company2 1000
2 3 Company3 400
Update due to below comments
df1.set_index(['Code', 'Name'], inplace=True)
df1.update(df2.set_index(['Code', 'Name']))
df1.reset_index(drop=True, inplace=True)

You can merge the data first and then use numpy.where, here's how to use numpy.where
updated = df1.merge(df2, how='left', on=['Code', 'Name'], suffixes=('', '_new'))
updated['Value'] = np.where(pd.notnull(updated['Value_new']), updated['Value_new'], updated['Value'])
updated.drop('Value_new', axis=1, inplace=True)
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0

There is a update function available
example:
df1.update(df2)
for more info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html

You can align indices and then use combine_first:
res = df2.set_index(['Code', 'Name'])\
.combine_first(df1.set_index(['Code', 'Name']))\
.reset_index()
print(res)
# Code Name Value
# 0 1 Company1 200.0
# 1 2 Company2 1000.0
# 2 3 Company3 400.0

Assuming company and code are redundant identifiers, you can also do
import pandas as pd
vdic = pd.Series(df2.Value.values, index=df2.Name).to_dict()
df1.loc[df1.Name.isin(vdic.keys()), 'Value'] = df1.loc[df1.Name.isin(vdic.keys()), 'Name'].map(vdic)
# Code Name Value
#0 1 Company1 200
#1 2 Company2 1000
#2 3 Company3 400

You can use pd.Series.where on the result of left-joining df1 and df2
merged = df1.merge(df2, on=['Code', 'Name'], how='left')
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value)
>>> df1
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can change the line to
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value).astype(int)
in order to return the value to be an integer.

There's something I often do.
I merge 'left' first:
df_merged = pd.merge(df1, df2, how = 'left', on = 'Code')
Pandas will create columns with extension '_x' (for your left dataframe) and
'_y' (for your right dataframe)
You want the ones that came from the right. So just remove any columns with '_x' and rename '_y':
for col in df_merged.columns:
if '_x' in col:
df_merged .drop(columns = col, inplace = True)
if '_y' in col:
new_name = col.strip('_y')
df_merged .rename(columns = {col : new_name }, inplace=True)

Append the dataset
Drop the duplicate by code
Sort the values
combined_df = combined_df.append(df2).drop_duplicates(['Code'],keep='last').sort_values('Code')

None of the above solutions worked for my particular example, which I think is rooted in the dtype of my columns, but I eventually came to this solution
indexes = df1.loc[df1.Code.isin(df2.Code.values)].index
df1.at[indexes,'Value'] = df2['Value'].values

Related

Concatenate dataframes in Pandas using an iteration but it doesn't work

I have several dataframes indexed more or less by the same MultiIndex (a few values may be missing on each dataframe, but the total rows exceeds 70K and the missing values is always less than 10). I want to attach/merge/concatenate to all of them a given dataframe (with same indexation). I tried doing this using a for iteration with a tuple, as in the example here. However, at the end, all my data frames do not merge. I provide a simple example where this happens. Why they do not merge?
df1 = pd.DataFrame(np.arange(12).reshape(4,3), index = ["A", "B", "C", "D"], columns = ["1st", "2nd", "3rd"])
df2 = df1 + 2
df3 = df1 - 2
for df in (df1, df2):
df = pd.merge(df, df3, left_index = True, right_index = True, how = "inner")
df1, df2
What is your expected result?
In the for loop, df is the loop variable and also the result on the left-hand side of the assignment statement. Here is the same loop with print statements to provide additional information. I think you are over-writing intermediate results.
for df in (df1, df2):
print(df)
print('-----')
df = pd.merge(df, df3, left_index = True, right_index = True, how = "inner")
print(df)
print('==========', end='\n\n')
print(df)
You could combine df1, df2 and df3 like this.
print(pd.concat([df1, df2, df3], axis=1))
1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd
A 0 1 2 2 3 4 -2 -1 0
B 3 4 5 5 6 7 1 2 3
C 6 7 8 8 9 10 4 5 6
D 9 10 11 11 12 13 7 8 9
UPDATE
Here is an idiomatic way to import and concatenate several CSV files, possibly in multiple directories. In short: read each file into a separate data frame; add each data frame to a list; concatenate once at the end.
Reference: https://pandas.pydata.org/docs/user_guide/cookbook.html#reading-multiple-files-to-create-a-single-dataframe
import pandas as pd
from pathlib import Path
df = list()
for filename in Path.cwd().rglob('*.csv'):
with open(filename, 'rt') as handle:
t = pd.read_csv(handle)
df.append(t)
print(filename.name, t.shape)
df = pd.concat(df)
print('\nfinal: ', df.shape)
penny.csv (62, 8)
penny-2020-06-24.csv (144, 9)
...etc
final: (474, 20)

Combine data from two columns into one without affecting the data values

I have two columns in a data frame. I want to combine those columns into a single column.
df = pd.DataFrame({'a': [500, 200, 13, 47], 'b':['$', '€', .586,.02]})
df
Out:
a b
0 500 $
1 200 €
2 13 .586
3 47 .02
I want to merge that two columns without affecting the data.
Expected output:
df
Out:
a
0 500$
1 200€
2 13.586
3 47.02
Please help me with this...
I tried the below solution, but it does not work for me,
df.b=np.where(df.b,df.b,df.a)
df.loc[df['b'] == '', 'b'] = df['a']
First solution working with convert both columns to strings and then join with +, last convert Series to one column DataFrame - but it working only if numbers less like 1 for column b:
df1 = df.astype(str)
df = (df1.a + df1.b.str.replace(r'^0', '')).to_frame('a')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02
Or if want mixed values numeric for last 2 rows and strings for first 2 rows use:
m = df.b.apply(lambda x: isinstance(x, str))
df.loc[m, 'a'] = df.loc[m, 'a'].astype(str) + df.b
df.loc[~m, 'a'] += df.pop('b')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02

Merge Pandas dataframes to create a list for duplicate matches

I have two dataframes:
df1 = pd.DataFrame([['ida', 1], ['idb', 2], ['idc', 3]], columns=['A','B'])
df2 = pd.DataFrame([['idb', 20], ['ida', 10], ['idb', 21], ['idb', 22]], columns=['A', 'C'])
and I would like to append the data from df2to df1into a list:
df3 =
|A B C
---------------
0 |ida 1 [10]
1 |idb 2 [20, 21, 22]
2 |idc 3 NaN
I can merge both frames:
df1.merge(df2, how='left')
A B C
0 ida 1 10.0
1 idb 2 20.0
2 idb 2 21.0
3 idb 2 22.0
4 idc 3 NaN
But then how do I "merge" matching rows? Also, in reality df2 is a lot larger and I only want to copy the columns "C", not columns "D", "E", "F"...
Alternatively, I can create a new column in df1 and then iterate over df2 to fill it:
for n, row in df2.iterrows():
idx = df1.index[row['A'] == df1['A']]
for i in idx: # hopefully only 1 or 0 values in idx
<assign value> df1.at[i, 'A'] = ???
The reason I want to have lists is that there is a flexible number of 'C'-values and I later want to calculate the average, standard deviation, ...
Edit: Typo
With version 0.24.x upwards of pandas you can use:
import numpy as np
import pandas as pd
df3 = (df1.merge(
df2.groupby('A')['C'].apply(np.array),
how='left',
left_on='A',
right_index=True))
And for your summary statistics:
df3['C'].apply(lambda x: np.std(x))
df3['C'].apply(lambda x: np.mean(x))
This is a perfect example of merging and after that groupby with applying the list function like the following:
# Merge on key columns A
df3 = pd.merge(df1, df2, on='A', how='outer')
# Output1
A B C
0 ida 1 10.0
1 idb 2 20.0
2 idb 2 21.0
3 idb 2 22.0
4 idc 3 NaN
# Groupby and apply list to keep values
df_final = df3.groupby('A').C.apply(list).reset_index()
A C
0 ida [10.0]
1 idb [20.0, 21.0, 22.0]
2 idc [nan]
EDIT:
If you want to only bring certain columns after a merge you can do the following:
df3 = pd.merge(df1, df2[['A', 'C']], on='A', how='outer')

How to sum columns in python based on column with not empty string

df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3],
'Sum over columns':[1,10,8,5,10]})
Hi everybody, could you please help me with following issue:
I'm trying to sum over columns to get a sum of data1 and data2.
If column with string (key1) is not NaN and if column with string (key2) is not NaN then sum data1 and data2. The result I want is shown in the sum column. Thank your for your help!
Try using the .apply method of df on axis=1 and numpy's array multiplication function to get your desired output:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3]})
df['Sum over columns'] = df.apply(lambda x: np.multiply(x[0:2], ~x[2:4].isnull()).sum(), axis=1)
Or:
df['Sum over columns'] = np.multiply(df[['data1','data2']], ~df[['key1','key2']].isnull()).sum(axis=1)
Either one of them should yield:
# data1 data2 key1 key2 Sum over columns
# 0 2 1 NaN ab 1
# 1 5 5 a aa 10
# 2 8 9 b NaN 8
# 3 5 6 b NaN 5
# 4 7 3 a one 10
I hope this helps.

Element-wise Maximum of Two DataFrames Ignoring NaNs

I have two dataframes (df1 and df2) that each have the same rows and columns. I would like to take the maximum of these two dataframes, element-by-element. In addition, the result of any element-wise maximum with a number and NaN should be the number. The approach I have implemented so far seems inefficient:
def element_max(df1,df2):
import pandas as pd
cond = df1 >= df2
res = pd.DataFrame(index=df1.index, columns=df1.columns)
res[(df1==df1)&(df2==df2)&(cond)] = df1[(df1==df1)&(df2==df2)&(cond)]
res[(df1==df1)&(df2==df2)&(~cond)] = df2[(df1==df1)&(df2==df2)&(~cond)]
res[(df1==df1)&(df2!=df2)&(~cond)] = df1[(df1==df1)&(df2!=df2)]
res[(df1!=df1)&(df2==df2)&(~cond)] = df2[(df1!=df1)&(df2==df2)]
return res
Any other ideas? Thank you for your time.
A more readable way to do this in recent versions of pandas is concat-and-max:
import scipy as sp
import pandas as pd
A = pd.DataFrame([[1., 2., 3.]])
B = pd.DataFrame([[3., sp.nan, 1.]])
pd.concat([A, B]).max(level=0)
#
# 0 1 2
# 0 3.0 2.0 3.0
#
You can use where to test your df against another df, where the condition is True, the values from df are returned, when false the values from df1 are returned. Additionally in the case where NaN values are in df1 then an additional call to fillna(df) will use the values from df to fill those NaN and return the desired df:
In [178]:
df = pd.DataFrame(np.random.randn(5,3))
df.iloc[1,2] = np.NaN
print(df)
df1 = pd.DataFrame(np.random.randn(5,3))
df1.iloc[0,0] = np.NaN
print(df1)
0 1 2
0 2.671118 1.412880 1.666041
1 -0.281660 1.187589 NaN
2 -0.067425 0.850808 1.461418
3 -0.447670 0.307405 1.038676
4 -0.130232 -0.171420 1.192321
0 1 2
0 NaN -0.244273 -1.963712
1 -0.043011 -1.588891 0.784695
2 1.094911 0.894044 -0.320710
3 -1.537153 0.558547 -0.317115
4 -1.713988 -0.736463 -1.030797
In [179]:
df.where(df > df1, df1).fillna(df)
Out[179]:
0 1 2
0 2.671118 1.412880 1.666041
1 -0.043011 1.187589 0.784695
2 1.094911 0.894044 1.461418
3 -0.447670 0.558547 1.038676
4 -0.130232 -0.171420 1.192321

Resources