How to replenish a data frame based on another one?

How to replenish a data frame based on another one? - python-3.x

Given two data frames. One contains a column of repeated values (a, in this case). The other contains what this value corresponds to (in this example, it corresponds to some "d" values). How do I efficiently replenish the first data frame with a new column, values in which correspond to some existent column, according to a rule recorded in the other data frame. Here is an example code that works really slow:
import pandas as pd
import numpy as np
d1 = pd.DataFrame(np.asarray([[1,2,3], [2,4,5], [3,4,5], [2,1,4], [3,4,5]]), columns = ['a', 'b', 'c'])
d2 = pd.DataFrame(np.asarray([[1,7], [2,8], [3,11]]), columns = ['a', 'd'])
d = np.empty((d1.shape[0],))
for i in range(d1.shape[0]):
temp = d2.loc[d2['a'] == d1.at[i,'a']]
d[i] = temp['d'].array[0]
d1['d'] = d
This is d1 original:
a b c
0 1 2 3
1 2 4 5
2 3 4 5
3 2 1 4
4 3 4 5
This is d2:
a d
0 1 7
1 2 8
2 3 11
This is a resultant d1:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11

You're probably looking for pd.merge.
In your case, d1 = d1.merge(d2, on=['a'], how='left') should do the trick.

Another way is to use map and make only the values you need.
d1['d'] = d1['a'].map(d2.set_index('a')['d'])
d1
Output:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11

Related

Python create a column based on the values of each row of another column

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A", "A", "B", "B","B"], 'GROUP': ["A_2018_1B1", "A_2018_1B1", "A_2018_1M1", "B_2018_I000_1C1", "B_2018_I000_1B1", "B_2018_I000_1C1H"], 'VAL':[1,3,8,5,8,10]})
df
ORDER GROUP VAL
0 A A_2018_1B1 1
1 A A_2018_1B1H 3
2 A A_2018_1M1 8
3 B B_2018_I000_1C1 5
4 B B_2018_I000_1B1 8
5 B B_2018_I000_1C1H 10
I want to create a column "CAL" as sum of 'VAL' where GROUP name is same for all the rows expect H character in the end. So, for example, 'VAL' column for 1st two rows will be added because the only difference between the 'GROUP' is 2nd row has H in the last. Row 3 will remain as it is, Row 4 and 6 will get added and Row 5 will remain same.
My expected output
ORDER GROUP VAL CAL
0 A A_2018_1B1 1 4
1 A A_2018_1B1H 3 4
2 A A_2018_1M1 8 8
3 B B_2018_I000_1C1 5 15
4 B B_2018_I000_1B1 8 8
5 B B_2018_I000_1C1H 10 15

Try with replace then transform
df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')
0 4
1 4
2 8
3 15
4 8
5 15
Name: VAL, dtype: int64
df['CAL'] = df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')

Create a new column with the minimum of other columns on same row

I have the following DataFrame
Input:
A B C D E
2 3 4 5 6
1 1 2 3 2
2 3 4 5 6
I want to add a new column that has the minimum of A, B and C for that row
Output:
A B C D E Goal
2 3 4 5 6 2
1 1 2 3 2 1
2 3 4 5 6 2
I have tried to use
df = df[['A','B','C]].min()
but I get errors about hashing lists and also I think this will be the min of the whole column I only want the min of the row for those specific columns.
How can I best accomplish this?

Use min along the columns with axis=1
Inline solution that produces copy that doesn't alter the original
df.assign(Goal=lambda d: d[['A', 'B', 'C']].min(1))
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Same answer put different
Add column to existing dataframe
new = df[['A', 'B', 'C']].min(axis=1)
df['Goal'] = new
df
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2

Add axis = 1 to your min
df['Goal'] = df[['A','B','C']].min(axis = 1)

you have to define an axis across which you are applying the min function, which would be 1 (columns).
df['ABC_row_min'] = df[['A', 'B', 'C']].min(axis = 1)

Column name and index of max value

I currently have a pandas dataframe where values between 0 and 1 are saved. I am looking for a function which can provide me the top 5 values of a column, together with the name of the column and the associated index of the values.
Sample Input: data frame with column names a:z, index 1:23, entries are values between 0 and 1
Sample Output: array of 5 highest entries in each column, each with column name and index
Edit:
For the following data frame:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I would like to get an output like (for example for the first column):
[[8,b,A], [8, c, A], [6,i,A], [5, h, A], [4,g,A]].

consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I'm going to use np.argpartition to separate each column into the 5 smallest and 10 - 5 (also 5) largest
v = df.values
i = df.index.values
k = len(v) - 5
pd.DataFrame(
i[v.argpartition(k, 0)[-k:]],
np.arange(k), df.columns
)
A B C D
0 g f i i
1 b c a d
2 h h f h
3 i b d f
4 c j h g

print(your_dataframe.sort_values(ascending=False)[0:4])

Pandas Conditionally Combine (and sum) Rows

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':[1,1,2,1,1,1],
'C':[2,4,6,3,5,7]})
df
A B C
0 A 1 2
1 A 1 4
2 A 2 6
3 B 1 3
4 B 1 5
5 B 1 7
Wherever there are duplicate rows per columns 'A' and 'B', I'd like to combine those rows and sum the value under column 'C' like this:
A B C
0 A 1 6
2 A 2 6
3 B 1 15
So far, I can at least identify the duplicates like this:
df['Dup']=df.duplicated(['A','B'],keep=False)
Thanks in advance!

use groupby() and sum():
In [94]: df.groupby(['A','B']).sum().reset_index()
Out[94]:
A B C
0 A 1 6
1 A 2 6
2 B 1 15

Pandas use variable for column names part 2

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
df
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
How can one assign column names to variables for use in referring to said column names?
For example, if I do this:
cols=['A','B']
cols2=['C','D']
I then want to do something like this:
df[cols,'F',cols2]
But the result is this:
TypeError: unhashable type: 'list'

I think you need add column F to list:
allcols = cols + ['F'] + cols2
print df[allcols]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Or:
print df[cols + ['F'] +cols2]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

Need give a list with columns for reference.
In [48]: df[cols+['F']+cols2]
Out[48]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
and, consider using df.loc[:, cols+['F']+cols2], df.ix[:, cols+['F']+cols2] for slicing.

Python 3 solution:
In [154]: df[[*cols,'F',*cols2]]
Out[154]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to replenish a data frame based on another one? - python-3.x

You're probably looking for pd.merge. In your case, d1 = d1.merge(d2, on=['a'], how='left') should do the trick.

Another way is to use map and make only the values you need. d1['d'] = d1['a'].map(d2.set_index('a')['d']) d1 Output: a b c d 0 1 2 3 7 1 2 4 5 8 2 3 4 5 11 3 2 1 4 8 4 3 4 5 11

Related

Python create a column based on the values of each row of another column

Create a new column with the minimum of other columns on same row

Column name and index of max value

Pandas Conditionally Combine (and sum) Rows

Pandas use variable for column names part 2

Categories

Resources