How to replenish a data frame based on another one? - python-3.x

Given two data frames. One contains a column of repeated values (a, in this case). The other contains what this value corresponds to (in this example, it corresponds to some "d" values). How do I efficiently replenish the first data frame with a new column, values in which correspond to some existent column, according to a rule recorded in the other data frame. Here is an example code that works really slow:
import pandas as pd
import numpy as np
d1 = pd.DataFrame(np.asarray([[1,2,3], [2,4,5], [3,4,5], [2,1,4], [3,4,5]]), columns = ['a', 'b', 'c'])
d2 = pd.DataFrame(np.asarray([[1,7], [2,8], [3,11]]), columns = ['a', 'd'])
d = np.empty((d1.shape[0],))
for i in range(d1.shape[0]):
temp = d2.loc[d2['a'] == d1.at[i,'a']]
d[i] = temp['d'].array[0]
d1['d'] = d
This is d1 original:
a b c
0 1 2 3
1 2 4 5
2 3 4 5
3 2 1 4
4 3 4 5
This is d2:
a d
0 1 7
1 2 8
2 3 11
This is a resultant d1:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11

You're probably looking for pd.merge.
In your case, d1 = d1.merge(d2, on=['a'], how='left') should do the trick.

Another way is to use map and make only the values you need.
d1['d'] = d1['a'].map(d2.set_index('a')['d'])
d1
Output:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11

Related

Python create a column based on the values of each row of another column

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A", "A", "B", "B","B"], 'GROUP': ["A_2018_1B1", "A_2018_1B1", "A_2018_1M1", "B_2018_I000_1C1", "B_2018_I000_1B1", "B_2018_I000_1C1H"], 'VAL':[1,3,8,5,8,10]})
df
ORDER GROUP VAL
0 A A_2018_1B1 1
1 A A_2018_1B1H 3
2 A A_2018_1M1 8
3 B B_2018_I000_1C1 5
4 B B_2018_I000_1B1 8
5 B B_2018_I000_1C1H 10
I want to create a column "CAL" as sum of 'VAL' where GROUP name is same for all the rows expect H character in the end. So, for example, 'VAL' column for 1st two rows will be added because the only difference between the 'GROUP' is 2nd row has H in the last. Row 3 will remain as it is, Row 4 and 6 will get added and Row 5 will remain same.
My expected output
ORDER GROUP VAL CAL
0 A A_2018_1B1 1 4
1 A A_2018_1B1H 3 4
2 A A_2018_1M1 8 8
3 B B_2018_I000_1C1 5 15
4 B B_2018_I000_1B1 8 8
5 B B_2018_I000_1C1H 10 15
Try with replace then transform
df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')
0 4
1 4
2 8
3 15
4 8
5 15
Name: VAL, dtype: int64
df['CAL'] = df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')

Create a new column with the minimum of other columns on same row

I have the following DataFrame
Input:
A B C D E
2 3 4 5 6
1 1 2 3 2
2 3 4 5 6
I want to add a new column that has the minimum of A, B and C for that row
Output:
A B C D E Goal
2 3 4 5 6 2
1 1 2 3 2 1
2 3 4 5 6 2
I have tried to use
df = df[['A','B','C]].min()
but I get errors about hashing lists and also I think this will be the min of the whole column I only want the min of the row for those specific columns.
How can I best accomplish this?
Use min along the columns with axis=1
Inline solution that produces copy that doesn't alter the original
df.assign(Goal=lambda d: d[['A', 'B', 'C']].min(1))
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Same answer put different
Add column to existing dataframe
new = df[['A', 'B', 'C']].min(axis=1)
df['Goal'] = new
df
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Add axis = 1 to your min
df['Goal'] = df[['A','B','C']].min(axis = 1)
you have to define an axis across which you are applying the min function, which would be 1 (columns).
df['ABC_row_min'] = df[['A', 'B', 'C']].min(axis = 1)

Column name and index of max value

I currently have a pandas dataframe where values between 0 and 1 are saved. I am looking for a function which can provide me the top 5 values of a column, together with the name of the column and the associated index of the values.
Sample Input: data frame with column names a:z, index 1:23, entries are values between 0 and 1
Sample Output: array of 5 highest entries in each column, each with column name and index
Edit:
For the following data frame:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I would like to get an output like (for example for the first column):
[[8,b,A], [8, c, A], [6,i,A], [5, h, A], [4,g,A]].
consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I'm going to use np.argpartition to separate each column into the 5 smallest and 10 - 5 (also 5) largest
v = df.values
i = df.index.values
k = len(v) - 5
pd.DataFrame(
i[v.argpartition(k, 0)[-k:]],
np.arange(k), df.columns
)
A B C D
0 g f i i
1 b c a d
2 h h f h
3 i b d f
4 c j h g
print(your_dataframe.sort_values(ascending=False)[0:4])

Pandas Conditionally Combine (and sum) Rows

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':[1,1,2,1,1,1],
'C':[2,4,6,3,5,7]})
df
A B C
0 A 1 2
1 A 1 4
2 A 2 6
3 B 1 3
4 B 1 5
5 B 1 7
Wherever there are duplicate rows per columns 'A' and 'B', I'd like to combine those rows and sum the value under column 'C' like this:
A B C
0 A 1 6
2 A 2 6
3 B 1 15
So far, I can at least identify the duplicates like this:
df['Dup']=df.duplicated(['A','B'],keep=False)
Thanks in advance!
use groupby() and sum():
In [94]: df.groupby(['A','B']).sum().reset_index()
Out[94]:
A B C
0 A 1 6
1 A 2 6
2 B 1 15

Pandas use variable for column names part 2

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
df
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
How can one assign column names to variables for use in referring to said column names?
For example, if I do this:
cols=['A','B']
cols2=['C','D']
I then want to do something like this:
df[cols,'F',cols2]
But the result is this:
TypeError: unhashable type: 'list'
I think you need add column F to list:
allcols = cols + ['F'] + cols2
print df[allcols]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Or:
print df[cols + ['F'] +cols2]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Need give a list with columns for reference.
In [48]: df[cols+['F']+cols2]
Out[48]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
and, consider using df.loc[:, cols+['F']+cols2], df.ix[:, cols+['F']+cols2] for slicing.
Python 3 solution:
In [154]: df[[*cols,'F',*cols2]]
Out[154]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

Resources