This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 5 months ago.
I have a dataframe like following
Col1
Col2
Col3
A
1
10
A
2
20
A
3
30
B
1
10
B
2
20
C
4
40
C
5
70
I want the output like the following
Col1
Col2
Col3
A
3
30
B
2
20
C
5
70
In your case do sort_values + drop_duplicates
out = df.sort_values('Col2').drop_duplicates('Col1')
Out[58]:
Col1 Col2 Col3
0 A 1 10
3 B 1 10
5 C 4 40
I have the following pandas dataframe:
col1 col2 col3 .... colN
5 2 4 .... 9
1 2 3 .... 9
7 1 4 .... 0
1 4 7 .... 8
What I need is a way to determinate the order between several columns:
col1 col2 col3 .... colN
5 2 4 .... 9 ----> colN >= ... >= col5 >= col2 >= col3
1 2 3 .... 9 ----> colN >= ... >= col3 >= col2 >= col1
7 1 4 .... 0 ----> col1 >= ... >= col3 >= col2 >= colN
1 4 7 .... 8 ----> colN >= ... >= col3 >= col2 >= col1
And give them a numeric alias. For example:
colN >= ... >= col5 >= col2 >= col3 = X
colN >= ... >= col3 >= col2 >= col1 = Y
col1 >= ... >= col3 >= col2 >= colN = Z
:
:
col1 col2 col3 .... colN order
5 2 4 .... 9 X
1 2 3 .... 9 Y
7 1 4 .... 0 Z
1 4 7 .... 8 Y
:
:
The number of columns may change and the alias has to follow a patron. Example with 3 columns:
col1 >= col2 >= col3 = 1
col1 >= col3 >= col2 = 2
col2 >= col1 >= col3 = 3
col2 >= col3 >= col2 = 4
col3 >= col1 >= col2 = 5
col3 >= col2 >= col1 = 6
Thanks and regards
You can use:
df['order'] = df.apply(lambda x: '>='.join(x.sort_values(ascending=False).index), axis=1)
df['alias'] = df.groupby('order').ngroup() + 1
Input
col1 col2 col3
0 5 2 4
1 1 2 3
2 7 1 4
3 1 4 7
Output:
col1 col2 col3 order alias
0 5 2 4 col1>=col3>=col2 1
1 1 2 3 col3>=col2>=col1 2
2 7 1 4 col1>=col3>=col2 1
3 1 4 7 col3>=col2>=col1 2
Or for specific pattern:
alias_pattern = {'col1>=col3>=col2' : 2, 'col3>=col2>=col1' : 5}
df['alias'] = df['order'].map(alias_pattern)
Output:
col1 col2 col3 order alias
0 5 2 4 col1>=col3>=col2 2
1 1 2 3 col3>=col2>=col1 5
2 7 1 4 col1>=col3>=col2 2
3 1 4 7 col3>=col2>=col1 5
I would like to know how to pass from a multiindex dataframe like this:
A B
col1 col2 col1 col2
1 2 12 21
3 1 2 0
To two separated dfs. df_A:
col1 col2
1 2
3 1
df_B:
col1 col2
12 21
2 0
Thank you for the help
I think here is better use DataFrame.xs for selecting by first level:
print (df.xs('A', axis=1, level=0))
col1 col2
0 1 2
1 3 1
What need is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby(level=0, axis=1):
globals()['df_' + str(i)] = g.droplevel(level=0, axis=1)
print (df_A)
col1 col2
0 1 2
1 3 1
Better is create dictionary of DataFrames:
d = {i:g.droplevel(level=0, axis=1)for i, g in df.groupby(level=0, axis=1)}
print (d['A'])
col1 col2
0 1 2
1 3 1
I've got a pandas dataframe like this:
id foo
0 A col1
1 A col2
2 B col1
3 B col3
4 D col4
5 C col2
I'd like to create four additional columns based on unique values in foo column. col1,col2, col3, col4
id foo col1 col2 col3 col4
0 A col1 75 20 5 0
1 A col2 20 80 0 0
2 B col1 82 10 8 0
3 B col3 5 4 80 11
4 D col4 0 5 10 85
5 C col2 12 78 5 5
The logic for creating the columns is as follows:
if foo = col1 then col1 contains a random number between 75-100 and the other columns (col2, col3, col4) contains random numbers, such that the total for each row is 100
I can manually create a new column and assign a random number, but I'm unsure how to include the logic of sum for each row of 100.
Appreciate any help!
My two cents
d=[]
s=np.random.randint(75,100,size=6)
for x in 100-s:
a=np.random.randint(100, size=3)
b=np.random.multinomial(x, a /a.sum())
d.append(b.tolist())
s=[np.random.choice(x,4,replace= False) for x in np.column_stack((s,np.array(d))) ]
df=pd.concat([df,pd.DataFrame(s,index=df.index)],1)
df
id foo 0 1 2 3
0 A col1 16 1 7 76
1 A col2 4 2 91 3
2 B col1 4 4 1 91
3 B col3 78 8 8 6
4 D col4 8 87 3 2
5 C col2 2 0 11 87
IIUC,
df['col1'] = df.apply(lambda x: np.where(x['foo'] == 'col1', np.random.randint(75,100), np.random.randint(0,100)), axis=1)
df['col2'] = df.apply(lambda x: np.random.randint(0,100-x['col1'],1)[0], axis=1)
df['col3'] = df.apply(lambda x: np.random.randint(0,100-x[['col1','col2']].sum(),1)[0], axis=1)
df['col4'] = 100 - df[['col1','col2','col3']].sum(1).astype(int)
df[['col1','col2','col3','col4']].sum(1)
Output:
id foo col1 col2 col3 col4
0 A col1 92 2 5 1
1 A col2 60 30 0 10
2 B col1 89 7 3 1
3 B col3 72 12 0 16
4 D col4 41 52 3 4
5 C col2 72 2 22 4
My Approach
import numpy as np
def weird(lower, upper, k, col, cols):
first_num = np.random.randint(lower, upper)
delta = upper - first_num
the_rest = np.random.rand(k - 1)
the_rest = the_rest / the_rest.sum() * (delta)
the_rest = the_rest.astype(int)
the_rest[-1] = delta - the_rest[:-1].sum()
key = lambda x: x != col
return dict(zip(sorted(cols, key=key), [first_num, *the_rest]))
def f(c): return weird(75, 100, 4, c, ['col1', 'col2', 'col3', 'col4'])
df.join(pd.DataFrame([*map(f, df.foo)]))
id foo col1 col2 col3 col4
0 A col1 76 2 21 1
1 A col2 11 76 11 2
2 B col1 75 4 10 11
3 B col3 0 1 97 2
4 D col4 5 4 13 78
5 C col2 9 77 6 8
If we subtract the numbers between 75-100 by 75, the problem become generating a table of random number between 0-25 whose each row sums to 25. That can be solve by reverse cumsum:
num_cols = 4
# generate random number and sort them in each row
a = np.sort(np.random.randint(0,25, (len(df), num_cols)), axis=1)
# create a dataframe and attach a last column with values 25
new_df = pd.DataFrame(a)
new_df[num_cols] = 25
# compute the difference, which are our numbers and add to the dummies:
dummies = pd.get_dummies(df.foo) * 75
dummies += new_df.diff(axis=1).fillna(new_df[0]).values
And dummies is
col1 col2 col3 col4
0 76.0 13.0 2.0 9.0
1 1.0 79.0 2.0 4.0
2 76.0 5.0 8.0 9.0
3 1.0 3.0 79.0 10.0
4 1.0 2.0 1.0 88.0
5 1.0 82.0 1.0 7.0
which can be concatenated to the original dataframe.
I am trying to understand how to perform arithmetic operations on a dataframe in python.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':[2,38,7,5],'col2':[1,3,2,4]})
print (unsorted_df.sum())
This is what I'm getting (in terms of the output), but I want to have more control over which sum I am getting.
col1 52
col2 10
dtype: int64
Just wondering how I would add individual elements in the dataframe together.
Your question is not very clear but still I will try to cover all possible scenarios,
Input:
df
col1 col2
0 2 1
1 38 3
2 7 2
3 5 4
If you want the sum of columns,
df.sum(axis = 0)
Output:
col1 52
col2 10
dtype: int64
If you want the sum of rows,
df.sum(axis = 1)
0 3
1 41
2 9
3 9
dtype: int64
If you want to add a list of numbers into a column,
num = [1, 2, 3, 4]
df['col1'] = df['col1'] + num
df
Output:
col1 col2
0 3 1
1 40 3
2 10 2
3 9 4
If you want to add a list of numbers into a row,
num = [1, 2]
df.loc[0] = df.loc[0] + num
df
Output:
col1 col2
0 3 3
1 38 3
2 7 2
3 5 4
If you want to add a single number to a column,
df['col1'] = df['col1'] + 2
df
Output:
col1 col2
0 4 1
1 40 3
2 9 2
3 7 4
If you want to add a single number to a row,
df.loc[0] = df.loc[0] + 2
df
Output:
col1 col2
0 4 3
1 38 3
2 7 2
3 5 4
If you want to add a number to any number(an element of row i and column j),
df.iloc[1,1] = df.iloc[1,1] + 5
df
Output:
col1 col2
0 2 1
1 38 8
2 7 2
3 5 4