How to copy values from other dataframe based on condition (same values of specific column)? - python-3.x

I have two dataframes (df1 and df2) and they look like this:
data1 = {'col1':[1,2,3,4,1,2,3,4,1,2,3,4], 'col2':np.arange(1,13)*2}
df1 = pd.DataFrame(data1)
data2 = {'x': [1,2,3,4], 'y': [10,20,40,5]}
df2 = pd.DataFrame(data2)
I would like to add a new column 'col3' to df1 with the values of df2['y'] when df1['col1'] is equal to df2['x']. So my df1 would stay like:
col1 col2 col3
1 2 10
2 4 20
3 6 40
4 8 5
1 10 10
2 12 20
3 14 40
4 16 5
1 18 10
2 20 20
3 22 40
4 24 5
Anyone could help me?

Use map with the dictionary creating from df2
df1['col3'] = df1.col1.map(dict(df2[['x', 'y']].values))
or
df1['col3'] = df1.col1.map(dict(zip(df2.x, df2.y)))
Out[886]:
col1 col2 col3
0 1 2 10
1 2 4 20
2 3 6 40
3 4 8 5
4 1 10 10
5 2 12 20
6 3 14 40
7 4 16 5
8 1 18 10
9 2 20 20
10 3 22 40
11 4 24 5

Use a merge:
df1.merge(df2, how='left', left_on='col1', right_on='x') \
[['col1', 'col2', 'y']] \
.rename(columns={'y': 'col3'})

Related

Reassigning multiple columns with same array

I've been breaking my head over this simple thing. Ik we can assign a single value to multiple columns using .loc. But how to assign multiple columns with the same array.
Ik I can do this. Let's say we have a dataframe df in which I wish to replace some columns with the array arr:
df=pd.DataFrame({'a':[random.randint(1,25) for i in range(5)],'b':[random.randint(1,25) for i in range(5)],'c':[random.randint(1,25) for i in range(5)]})
>>df
a b c
0 14 8 5
1 10 25 9
2 14 14 8
3 10 6 7
4 4 18 2
arr = [i for i in range(5)]
#Suppose if I wish to replace columns `a` and `b` with array `arr`
df['a'],df['b']=[arr for j in range(2)]
Desired output:
a b c
0 0 0 16
1 1 1 10
2 2 2 1
3 3 3 20
4 4 4 11
Or I can also do this in a loopwise assignment. But is there a more efficient way without repetition or loops?
Let's try with assign:
cols = ['a', 'b']
df.assign(**dict.fromkeys(cols, arr))
a b c
0 0 0 5
1 1 1 9
2 2 2 8
3 3 3 7
4 4 4 2
I did an assign statement df.a = df.b = arr
df=pd.DataFrame({'a':[random.randint(1,25) for i in range(5)],'b':[random.randint(1,25) for i in range(5)],'c':[random.randint(1,25) for i in range(5)]})
arr = [i for i in range(5)]
df
a b c
0 2 8 18
1 17 15 25
2 6 5 17
3 12 15 25
4 10 10 6
df.a = df.b = arr
df
a b c
0 0 0 18
1 1 1 25
2 2 2 17
3 3 3 25
4 4 4 6

Creating an aggregate columns in pandas dataframe

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ORDER':["A", "A", "B", "B"], 'var1':[2, 3, 1, 5],'a1_bal':[1,2,3,4], 'a1c_bal':[10,22,36,41], 'b1_bal':[1,2,33,4], 'b1c_bal':[11,22,3,4], 'm1_bal':[15,2,35,4]})
df
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal
0 A 2 1 10 1 11 15
1 A 3 2 22 2 22 2
2 B 1 3 36 33 3 35
3 B 5 4 41 4 4 4
I want to create new columns as below:
a1_final_bal = sum(a1_bal, a1c_bal)
b1_final_bal = sum(b1_bal, b1c_bal)
m1_final_bal = m1_bal (since we only have m1_bal field not m1c_bal, so it will renain as it is)
I don't want to hardcode this step because there might be more such columns as "c_bal", "m2_bal", "m2c_bal" etc..
My final data should look something like below
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal a1_final_bal b1_final_bal m1_final_bal
0 A 2 1 10 1 11 15 11 12 15
1 A 3 2 22 2 22 2 24 24 2
2 B 1 3 36 33 3 35 38 36 35
3 B 5 4 41 4 4 4 45 8 4
You could try something like this. I am not sure if its exactly what you are looking for, but I think it should work.
dfforgroup = df.set_index(['ORDER','var1']) #Creates MultiIndex
dfforgroup.columns = dfforgroup.columns.str[:2] #Takes first two letters of remaining columns
df2 = dfforgroup.groupby(dfforgroup.columns,axis=1).sum().reset_index().drop(columns =
['ORDER','var1']).add_suffix('_final_bal') #groups columns by their first two letters and sums the columns up
df = pd.concat([df,df2],axis=1) #concatenates new columns to original df

Create an aggregate column based on other columns in pandas dataframe

I have a dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'group' :["A","A","B","B","B"],
'A1_val' :[4,5,7,6,5],
'A1M_val' :[10,100,100,10,1],
'AB_val' :[4,5,7,6,5],
'ABM_val' :[10,100,100,10,1],
'AM_VAL' : [4,5,7,6,5]
}
# Create DataFrame
df1 = pd.DataFrame(data)
df1
group A1_val A1M_val AB_val ABM_val AM_VAL
0 A 4 10 4 10 4
1 A 5 100 5 100 5
2 B 7 100 7 100 7
3 B 6 10 6 10 6
4 B 5 1 5 1 5
Step 1: I want to create columns as below:
A1_agg_val = sum of A1_val + A1M_val (stripping M out of the column and if the name matches then sum it)
Similarly, AB_agg_val = AB_val + ABM_val
Since there is no matching columns for 'AM_VAL', AM_agg_val = AM_val
My expected output:
group A1_val A1M_val AB_val ABM_val AM_VAL A1_AGG_val AB_AGG_val A_AGG_val
0 A 4 10 4 10 4 14 14 4
1 A 5 100 5 100 5 105 105 5
2 B 7 100 7 100 7 107 107 7
3 B 6 10 6 10 6 16 16 6
4 B 5 1 5 1 5 6 6 5
you can use groupby on axis=1
out = (df1.assign(**df1.loc[:,df1.columns.str.lower().str.endswith('_val')]
.groupby(lambda x: x[:2],axis=1).sum().add_suffix('_agg_value')))
print(out)
group A1_val A1M_val AB_val ABM_val AM_VAL A1_agg_value AB_agg_value \
0 A 4 10 4 10 4 14 14
1 A 5 100 5 100 5 105 105
2 B 7 100 7 100 7 107 107
3 B 6 10 6 10 6 16 16
4 B 5 1 5 1 5 6 6
AM_agg_value
0 4
1 5
2 7
3 6
4 5

Subset and Loop to create a new column [duplicate]

With the DataFrame below as an example,
In [83]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
df
Out[83]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
What would be a simple way to generate a new column containing some aggregation of the data over one of the columns?
For example, if I sum values over items in A
In [84]:
df.groupby('A').sum()['values']
Out[84]:
A
1 25
2 45
Name: values
How can I get
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
In [20]: df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
In [21]: df
Out[21]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
In [22]: df['sum_values_A'] = df.groupby('A')['values'].transform(np.sum)
In [23]: df
Out[23]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
I found a way using join:
In [101]:
aggregated = df.groupby('A').sum()['values']
aggregated.name = 'sum_values_A'
df.join(aggregated,on='A')
Out[101]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
Anyone has a simpler way to do it?
This is not so direct but I found it very intuitive (the use of map to create new columns from another column) and can be applied to many other cases:
gb = df.groupby('A').sum()['values']
def getvalue(x):
return gb[x]
df['sum'] = df['A'].map(getvalue)
df
In [15]: def sum_col(df, col, new_col):
....: df[new_col] = df[col].sum()
....: return df
In [16]: df.groupby("A").apply(sum_col, 'values', 'sum_values_A')
Out[16]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

Expanding/Duplicating dataframe rows based on condition

I am an R user who has recently started using Python 3 for data management. I am struggling with a way to expand/duplicate data frame rows based on a condition. I also need to be able to expand rows in a variable way. I'll illustrate with this example.
I have this data:
df = pd.DataFrame([[1, 10], [1,15], [2,10], [2, 15], [2, 20], [3, 10], [3, 15]], columns = ['id', 'var'])
df
Out[6]:
id var
0 1 10
1 1 15
2 2 10
3 2 15
4 2 20
5 3 10
6 3 15
I would like to expand rows for both ID == 1 and ID == 3. I would also like to expand each ID == 1 row by 1 duplicate each, and I would like to expand each ID == 3 row by 2 duplicates each. The result would look like this:
df2
Out[8]:
id var
0 1 10
1 1 10
2 1 15
3 1 15
4 2 10
5 2 15
6 2 20
7 3 10
8 3 10
9 3 10
10 3 15
11 3 15
12 3 15
13 3 15
I've been trying to use np.repeat, but I am failing to think of a way that I can use both ID condition and variable duplication numbers at the same time. Index ordering does not matter here, only that the rows are duplicated appropriately. I apologize in advance if this is an easy question. Thanks in advance for any help and feel free to ask clarifying questions.
This should do it:
dup = {1: 1, 3:2} #what value and how much to add
res = df.copy()
for k, v in dup.items():
for i in range(v):
res = res.append(df.loc[df['id']==k], ignore_index=True)
res.sort_values(['id', 'var'], inplace=True)
res.reset_index(inplace=True, drop=True)
res
# id var
#0 1 10
#1 1 10
#2 1 15
#3 1 15
#4 2 10
#5 2 15
#6 2 20
#7 3 10
#8 3 10
#9 3 10
#10 3 15
#11 3 15
#12 3 15
P.S. your desired solution had 7 values for id 3 while your description implies 6 values.
I think below code gets your job done:
df_1=df.loc[df.id==1]
df_3=df.loc[df.id==3]
df1=df.append([df_1]*1,ignore_index=True)
df1.append([df_3]*2,ignore_index=True).sort_values(by='id')
id var
0 1 10
1 1 15
7 1 10
8 1 15
2 2 10
3 2 15
4 2 20
5 3 10
6 3 15
9 3 10
10 3 15
11 3 10
12 3 15

Resources