pandas fill column with random numbers with a total for each row - python-3.x

I've got a pandas dataframe like this:
id foo
0 A col1
1 A col2
2 B col1
3 B col3
4 D col4
5 C col2
I'd like to create four additional columns based on unique values in foo column. col1,col2, col3, col4
id foo col1 col2 col3 col4
0 A col1 75 20 5 0
1 A col2 20 80 0 0
2 B col1 82 10 8 0
3 B col3 5 4 80 11
4 D col4 0 5 10 85
5 C col2 12 78 5 5
The logic for creating the columns is as follows:
if foo = col1 then col1 contains a random number between 75-100 and the other columns (col2, col3, col4) contains random numbers, such that the total for each row is 100
I can manually create a new column and assign a random number, but I'm unsure how to include the logic of sum for each row of 100.
Appreciate any help!

My two cents
d=[]
s=np.random.randint(75,100,size=6)
for x in 100-s:
a=np.random.randint(100, size=3)
b=np.random.multinomial(x, a /a.sum())
d.append(b.tolist())
s=[np.random.choice(x,4,replace= False) for x in np.column_stack((s,np.array(d))) ]
df=pd.concat([df,pd.DataFrame(s,index=df.index)],1)
df
id foo 0 1 2 3
0 A col1 16 1 7 76
1 A col2 4 2 91 3
2 B col1 4 4 1 91
3 B col3 78 8 8 6
4 D col4 8 87 3 2
5 C col2 2 0 11 87

IIUC,
df['col1'] = df.apply(lambda x: np.where(x['foo'] == 'col1', np.random.randint(75,100), np.random.randint(0,100)), axis=1)
df['col2'] = df.apply(lambda x: np.random.randint(0,100-x['col1'],1)[0], axis=1)
df['col3'] = df.apply(lambda x: np.random.randint(0,100-x[['col1','col2']].sum(),1)[0], axis=1)
df['col4'] = 100 - df[['col1','col2','col3']].sum(1).astype(int)
df[['col1','col2','col3','col4']].sum(1)
Output:
id foo col1 col2 col3 col4
0 A col1 92 2 5 1
1 A col2 60 30 0 10
2 B col1 89 7 3 1
3 B col3 72 12 0 16
4 D col4 41 52 3 4
5 C col2 72 2 22 4

My Approach
import numpy as np
def weird(lower, upper, k, col, cols):
first_num = np.random.randint(lower, upper)
delta = upper - first_num
the_rest = np.random.rand(k - 1)
the_rest = the_rest / the_rest.sum() * (delta)
the_rest = the_rest.astype(int)
the_rest[-1] = delta - the_rest[:-1].sum()
key = lambda x: x != col
return dict(zip(sorted(cols, key=key), [first_num, *the_rest]))
def f(c): return weird(75, 100, 4, c, ['col1', 'col2', 'col3', 'col4'])
df.join(pd.DataFrame([*map(f, df.foo)]))
id foo col1 col2 col3 col4
0 A col1 76 2 21 1
1 A col2 11 76 11 2
2 B col1 75 4 10 11
3 B col3 0 1 97 2
4 D col4 5 4 13 78
5 C col2 9 77 6 8

If we subtract the numbers between 75-100 by 75, the problem become generating a table of random number between 0-25 whose each row sums to 25. That can be solve by reverse cumsum:
num_cols = 4
# generate random number and sort them in each row
a = np.sort(np.random.randint(0,25, (len(df), num_cols)), axis=1)
# create a dataframe and attach a last column with values 25
new_df = pd.DataFrame(a)
new_df[num_cols] = 25
# compute the difference, which are our numbers and add to the dummies:
dummies = pd.get_dummies(df.foo) * 75
dummies += new_df.diff(axis=1).fillna(new_df[0]).values
And dummies is
col1 col2 col3 col4
0 76.0 13.0 2.0 9.0
1 1.0 79.0 2.0 4.0
2 76.0 5.0 8.0 9.0
3 1.0 3.0 79.0 10.0
4 1.0 2.0 1.0 88.0
5 1.0 82.0 1.0 7.0
which can be concatenated to the original dataframe.

Related

Group by and drop duplicates in pandas dataframe

I have a pandas dataframe as below. I want to group by based on all the three columns and retain the group with the max of Col1.
import pandas as pd
df = pd.DataFrame({'col1':['A', 'A', 'A', 'A', 'B', 'B'], 'col2':['1', '1', '1', '1', '2', '3'], 'col3':['5', '5', '2', '2', '2', '3']})
df
col1 col2 col3
0 A 1 5
1 A 1 5
2 A 1 2
3 A 1 2
4 B 2 2
5 B 3 3
My expected output
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
I tried below code, but it return me the last row of each group, instead I want to sort by col3 and keep the group with max col3
df.drop_duplicates(keep='last', subset=['col1','col2','col3'])
col1 col2 col3
1 A 1 5
3 A 1 2
4 B 2 2
5 B 3 3
For Example: Here I want to drop 1st group because 2 < 5, so I want to keep the group with col3 as 5
df.sort_values(by=['col1', 'col2', 'col3'], ascending=False)
a_group = df.groupby(['col1', 'col2', 'col3'])
for name, group in a_group:
group = group.reset_index(drop=True)
print(group)
col1 col2 col3
0 A 1 2
1 A 1 2
col1 col2 col3
0 A 1 5
1 A 1 5
col1 col2 col3
0 B 2 2
col1 col2 col3
0 B 3 3
You cant group on all columns since the col you wish to retain max for has different values. Instead dont include that column in the group and consider others:
col_to_max = 'col3'
i = df.columns ^ [col_to_max]
out = df[df[col_to_max] == df.groupby(list(i))[col_to_max].transform('max')]
print(out)
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
So we can do
out = df[df.col3==df.groupby(['col1','col2'])['col3'].transform('max')]
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
I believe you can use groupby with nlargest(2). Also make sure that your 'col3' is a numerical one.
>>> df['col3'] = df['col3'].astype(int)
>>> df.groupby(['col1','col2'])['col3'].nlargest(2).reset_index().drop('level_2',axis=1)
col1 col2 col3
0 A 1 5
1 A 1 5
2 B 2 2
3 B 3 3
You can get index which doesn't has col3 max value and duplicated index and drop the intersection
ind = df.assign(max = df.groupby("col1")["col3"].transform("max")).query("max != col3").index
ind2 = df[df.duplicated(keep=False)].index
df.drop(set(ind).intersection(ind2))
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3

Shuffle pandas columns

I have the following data frame:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
2 7 8 9 2
and I would like to have a shuffled output like :
Col3 Col1 Col2 Type
0 3 1 2 1
1 6 4 5 1
2 9 7 8 2
How to achieve this?
Use DataFrame.sample with axis=1:
df = df.sample(frac=1, axis=1)
If need last column not changed position:
a = df.columns[:-1].to_numpy()
np.random.shuffle(a)
print (a)
['Col3' 'Col1' 'Col2']
df = df[np.append(a, ['Type'])]
print (df)
Col2 Col3 Col1 Type
0 3 1 2 1
1 6 4 5 1
2 9 7 8 2

Fetch column value based on dynamic input

I have a dataframe, where in I have 1 column, which contains names of column satisfying certain conditions for each row.
It's like if columns of dataframe are Index, Col1, Col2, Col3, Col_Name. Where Col_Name has either Col1 or Col2 or Col3 for each row.
Now in a new column say Col_New, I want output for each row such as if 5th row Col_Name mentions Col_1, then value of Col_1 in 5th row.
I am sorry I cannot post the code I am working on, hence gave this hypothetical example.
Obliged for any help, thanks.
IIUC you could use:
df['col_new'] = df.reset_index().apply(lambda x: df.at[x['index'], x['col_name']], axis=1)
Example:
cols = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(np.random.rand(10, 3), columns=cols)
df['Col_Name'] = np.random.choice(cols, 10)
print(df)
Col1 Col2 Col3 Col_Name
0 0.833988 0.939254 0.256450 Col2
1 0.675909 0.609494 0.641944 Col3
2 0.877474 0.971299 0.218273 Col3
3 0.201189 0.265742 0.800580 Col2
4 0.397945 0.135153 0.941313 Col2
5 0.666252 0.697983 0.164768 Col2
6 0.863377 0.839421 0.601316 Col2
7 0.138975 0.731359 0.379258 Col3
8 0.412148 0.541033 0.197861 Col2
9 0.980040 0.506752 0.823274 Col3
df['Col_New'] = df.reset_index().apply(lambda x: df.at[x['index'], x['Col_Name']], axis=1)
[out]
Col1 Col2 Col3 Col_Name Col_New
0 0.833988 0.939254 0.256450 Col2 0.939254
1 0.675909 0.609494 0.641944 Col3 0.641944
2 0.877474 0.971299 0.218273 Col3 0.218273
3 0.201189 0.265742 0.800580 Col2 0.265742
4 0.397945 0.135153 0.941313 Col2 0.135153
5 0.666252 0.697983 0.164768 Col2 0.697983
6 0.863377 0.839421 0.601316 Col2 0.839421
7 0.138975 0.731359 0.379258 Col3 0.379258
8 0.412148 0.541033 0.197861 Col2 0.541033
9 0.980040 0.506752 0.823274 Col3 0.823274
Example 2 (based on integer col references)
cols = [1, 2, 3]
np.random.seed(0)
df = pd.DataFrame(np.random.rand(10, 3), columns=cols)
df[13] = np.random.choice(cols, 10)
print(df)
1 2 3 13
0 0.548814 0.715189 0.602763 3
1 0.544883 0.423655 0.645894 3
2 0.437587 0.891773 0.963663 1
3 0.383442 0.791725 0.528895 3
4 0.568045 0.925597 0.071036 1
5 0.087129 0.020218 0.832620 1
6 0.778157 0.870012 0.978618 1
7 0.799159 0.461479 0.780529 2
8 0.118274 0.639921 0.143353 2
9 0.944669 0.521848 0.414662 3
Instead use:
df['Col_New'] = df.reset_index().apply(lambda x: df.at[int(x['index']), int(x[13])], axis=1)
1 2 3 13 Col_New
0 0.548814 0.715189 0.602763 3 0.602763
1 0.544883 0.423655 0.645894 3 0.645894
2 0.437587 0.891773 0.963663 1 0.437587
3 0.383442 0.791725 0.528895 3 0.528895
4 0.568045 0.925597 0.071036 1 0.568045
5 0.087129 0.020218 0.832620 1 0.087129
6 0.778157 0.870012 0.978618 1 0.778157
7 0.799159 0.461479 0.780529 2 0.461479
8 0.118274 0.639921 0.143353 2 0.639921
9 0.944669 0.521848 0.414662 3 0.414662
Using the example DataFrame from Chris A.
You could do it like this:
cols = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(np.random.rand(10, 3), columns=cols)
df['Col_Name'] = np.random.choice(cols, 10)
print(df)
df['Col_New'] = [df.loc[df.index[i],j]for i,j in enumerate(df.Col_Name)]
print(df)
In pandas is for this function DataFrame.lookup, also it seems need same types of values in columns and looking column, so is possible convert both to strings:
np.random.seed(123)
cols = [1, 2, 3]
df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=cols).rename(columns=str)
df['Col_Name'] = np.random.choice(cols, 5)
df['Col_New'] = df.lookup(df.index, df['Col_Name'].astype(str))
print(df)
1 2 3 Col_Name Col_New
0 2 2 6 3 6
1 1 3 9 2 3
2 6 1 0 1 6
3 1 9 0 1 1
4 0 9 3 1 0

How to perform arithmetic operations with specific elements of a dataframe?

I am trying to understand how to perform arithmetic operations on a dataframe in python.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':[2,38,7,5],'col2':[1,3,2,4]})
print (unsorted_df.sum())
This is what I'm getting (in terms of the output), but I want to have more control over which sum I am getting.
col1 52
col2 10
dtype: int64
Just wondering how I would add individual elements in the dataframe together.
Your question is not very clear but still I will try to cover all possible scenarios,
Input:
df
col1 col2
0 2 1
1 38 3
2 7 2
3 5 4
If you want the sum of columns,
df.sum(axis = 0)
Output:
col1 52
col2 10
dtype: int64
If you want the sum of rows,
df.sum(axis = 1)
0 3
1 41
2 9
3 9
dtype: int64
If you want to add a list of numbers into a column,
num = [1, 2, 3, 4]
df['col1'] = df['col1'] + num
df
Output:
col1 col2
0 3 1
1 40 3
2 10 2
3 9 4
If you want to add a list of numbers into a row,
num = [1, 2]
df.loc[0] = df.loc[0] + num
df
Output:
col1 col2
0 3 3
1 38 3
2 7 2
3 5 4
If you want to add a single number to a column,
df['col1'] = df['col1'] + 2
df
Output:
col1 col2
0 4 1
1 40 3
2 9 2
3 7 4
If you want to add a single number to a row,
df.loc[0] = df.loc[0] + 2
df
Output:
col1 col2
0 4 3
1 38 3
2 7 2
3 5 4
If you want to add a number to any number(an element of row i and column j),
df.iloc[1,1] = df.iloc[1,1] + 5
df
Output:
col1 col2
0 2 1
1 38 8
2 7 2
3 5 4

Pandas: Calculate Median of Group over Columns

Given the following data frame:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', 'A','A','A','B','B'],
'COL2' : ['AA','AA','BB','BB','BB','BB'],
'COL3' : [2,3,4,5,4,2],
'COL4' : [0,1,2,3,4,2]})
df
COL1 COL2 COL3 COL4
0 A AA 2 0
1 A AA 3 1
2 A BB 4 2
3 A BB 5 3
4 B BB 4 4
5 B BB 2 2
I would like, as efficiently as possible (i.e. via groupby and lambda x or better), to find the median of columns 3 and 4 for each distinct group of columns 1 and 2.
The desired result is as follows:
COL1 COL2 COL3 COL4 MEDIAN
0 A AA 2 0 1.5
1 A AA 3 1 1.5
2 A BB 4 2 3.5
3 A BB 5 3 3.5
4 B BB 4 4 3
5 B BB 2 2 3
Thanks in advance!
You already had the idea -- groupby COL1 and COL2 and calculate median.
m = df.groupby(['COL1', 'COL2'])[['COL3','COL4']].apply(np.median)
m.name = 'MEDIAN'
print df.join(m, on=['COL1', 'COL2'])
COL1 COL2 COL3 COL4 MEDIAN
0 A AA 2 0 1.5
1 A AA 3 1 1.5
2 A BB 4 2 3.5
3 A BB 5 3 3.5
4 B BB 4 4 3.0
5 B BB 2 2 3.0
df.groupby(['COL1', 'COL2']).median()[['COL3','COL4']]

Resources