I have a data like below:
col1 col2
A 0
A 1
B 0
A 0
C 1
B 1
C 0
I like it as below:
col1 col2 col3 col4
A 0 .33 .33
A 1 .33 .33
B 0 .5 .33
A 0 .33 .33
C 1 .5 .33
B 1 .5 .33
C 0 .5 .33
Col 1 are categories of values. Col 2 are events,i.e 0=no, 1=yes.
col 3 should be the event rate of the category,i.e
(number of times the category has value 1/total number of occurances of that category)
col 4 should be event share of the category,i.e,
(number of times the category has value1/total number of 1s across all categories,
e.g col 4 for A should be number of 1s in A divided by number of total 1s across categories A,B & C together.)
Can anyone please help
Use GroupBy.transform mean and sum, for second divide by sum of all 1 of df['col2']:
df['col3'] = df.groupby('col1')['col2'].transform('mean')
df['col4'] = df.groupby('col1')['col2'].transform('sum').div(df['col2'].sum())
print (df)
col1 col2 col3 col4
0 A 0 0.333333 0.333333
1 A 1 0.333333 0.333333
2 B 0 0.500000 0.333333
3 A 0 0.333333 0.333333
4 C 1 0.500000 0.333333
5 B 1 0.500000 0.333333
6 C 0 0.500000 0.333333
Another solution with aggregate by GroupBy.agg:
df = df.join(df.groupby('col1')['col2'].agg([('col3','mean'),('col4','sum')])
.assign(col4 = lambda x: x['col4'].div(df['col2'].sum())), on='col1')
print (df)
col1 col2 col3 col4
0 A 0 0.333333 0.333333
1 A 1 0.333333 0.333333
2 B 0 0.500000 0.333333
3 A 0 0.333333 0.333333
4 C 1 0.500000 0.333333
5 B 1 0.500000 0.333333
6 C 0 0.500000 0.333333
Related
I like to reshape a dataframe thats first column should be used to group the other columns by an additional header row.
Initial dataframe
df = pd.DataFrame(
{
'col1':['A','A','A','B','B','B'],
'col2':[1,2,3,4,5,6],
'col3':[1,2,3,4,5,6],
'col4':[1,2,3,4,5,6],
'colx':[1,2,3,4,5,6]
}
)
Trial:
Using pd.pivot() I can create an example, but this do not fit my expected one, it seems to be flipped in grouping:
df.pivot(columns='col1', values=['col2','col3','col4','colx'])
col2 col3 col4 colx
col1 A B A B A B A B
0 1.0 NaN 1.0 NaN 1.0 NaN 1.0 NaN
1 2.0 NaN 2.0 NaN 2.0 NaN 2.0 NaN
2 3.0 NaN 3.0 NaN 3.0 NaN 3.0 NaN
3 NaN 4.0 NaN 4.0 NaN 4.0 NaN 4.0
4 NaN 5.0 NaN 5.0 NaN 5.0 NaN 5.0
5 NaN 6.0 NaN 6.0 NaN 6.0 NaN 6.0
Expected output:
A B
col1 col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6
Create counter column by GroupBy.cumcount, then use DataFrame.pivot with swapping level of MultiIndex in columns by DataFrame.swaplevel, sorting it and last remove index and columns names by DataFrame.rename_axis:
df = (df.assign(g = df.groupby('col1').cumcount())
.pivot(index='g', columns='col1')
.swaplevel(0,1,axis=1)
.sort_index(axis=1)
.rename_axis(index=None, columns=[None, None]))
print(df)
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6
As an alternative to the classical pivot, you can concat the output of groupby with a dictionary comprehension, ensuring alignment with reset_index:
out = pd.concat({k: d.drop(columns='col1').reset_index(drop=True)
for k,d in df.groupby('col1')}, axis=1)
output:
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6
Given a dataset df as follows:
type module item value input
0 A a item1 2 1
1 A a item2 3 0
2 A aa item3 4 1
3 A aa item4 3 0
4 A aa item5 1 -1
5 B b item1 5 0
6 B b item2 1 -1
7 B bb item3 3 0
8 B bb item4 3 1
9 B bb item5 4 0
I need to calculate sum of pct based on the following logic: first, we only take value whose input is 0 or 1 as valid values. Then I need to groupby type, module to calculate percentage of sum, for example, the pct of first row of A-a-item1 is calculated by 2/(2 + 3) = 0.4, A-aa-item1 is calculated by 4/(4 + 3) = 0.57, not divided by 8 since input value for A-aa-item3 is -1 so it's excluded. The sum column in df2 is calculated by groupby type module then sum of sum.
df1:
type module item value input pct
0 A a item1 2 1 0.400000
1 A a item2 3 0 0.000000
2 A aa item1 4 1 0.571429
3 A aa item2 3 0 0.000000
4 A aa item3 1 -1 0.000000
5 B b item1 5 0 0.000000
6 B b item2 1 -1 0.000000
7 B bb item1 3 0 0.000000
8 B bb item2 3 1 0.300000
9 B bb item3 4 0 0.000000
df2:
type module sum
0 A a 0.40
1 A aa 0.57
2 B b 0.00
3 B bb 0.30
How could I get similar results based on the given dataset? Thanks.
You can replace not matched by conditions with Series.eq for compare by 1 with 0 and compare by 0, 1 by Series.isin and instead aggregation is used GroupBy.transform with sum for new column filled by aggregate values and divided by Series.div :
s1 = df['value'].where(df['input'].eq(1), 0)
s2 = (df.assign(value = df['value'].where(df['input'].isin([0,1]), 0))
.groupby(['type','module'])['value'].transform('sum'))
df['pct '] = s1.div(s2)
print (df)
type module item value input pct
0 A a item1 2 1 0.400000
1 A a item2 3 0 0.000000
2 A aa item3 4 1 0.571429
3 A aa item4 3 0 0.000000
4 A aa item5 1 -1 0.000000
5 B b item1 5 0 0.000000
6 B b item2 1 -1 0.000000
7 B bb item3 3 0 0.000000
8 B bb item4 3 1 0.300000
9 B bb item5 4 0 0.000000
For second DataFrame is added 2 new columns by DataFrame.assign, aggregate sum and last divide with DataFrame.pop for use and remove column value:
df2 = (df.assign(value = df['value'].where(df['input'].isin([0,1]), 0),
pct = df['value'].where(df['input'].eq(1), 0))
.groupby(['type','module'])[['value','pct']]
.sum()
.assign(pct = lambda x: x['pct'].div(x.pop('value')))
.reset_index())
print (df2)
type module pct
0 A a 0.400000
1 A aa 0.571429
2 B b 0.000000
3 B bb 0.300000
I have a dataframe like so:
id date status value
1 2009-06-17 1 NaN
1 2009-07-17 B NaN
1 2009-08-17 A NaN
1 2009-09-17 5 NaN
1 2009-10-17 0 0.55
2 2010-07-17 B NaN
2 2010-08-17 A NaN
2 2010-09-17 0 0.00
Now I want to group by id and then check if value becomes non-zero after status changes to A. So for group with id=1, status does change to A and after(in terms of date) that value also becomes non-zero. But for group with id=2, even after status changes to A, value does not become non-zero. Please note that if status does not change to A then I don't even need to check value.
So finally I want a new dataframe like this:
id check
1 True
2 False
Use:
print (df)
id date status value
0 1 2009-06-17 1 NaN
1 1 2009-07-17 B NaN
2 1 2009-08-17 A NaN
3 1 2009-09-17 5 NaN
4 1 2009-10-17 0 0.55
5 2 2010-07-17 B NaN
6 2 2010-08-17 A NaN
7 2 2010-09-17 0 0.00
8 3 2010-08-17 R NaN
9 3 2010-09-17 0 0.00
idx = df['id'].unique()
#filter A values
m = df['status'].eq('A')
#filter all rows after A per groups
df1 = df[m.groupby(df['id']).cumsum().gt(0)]
print (df1)
id date status value
2 1 2009-08-17 A NaN
3 1 2009-09-17 5 NaN
4 1 2009-10-17 0 0.55
6 2 2010-08-17 A NaN
7 2 2010-09-17 0 0.00
#compare by 0 and test if no 0 value per group and last added all posible id by reindex
df2 = (df1['value'].ne(0)
.groupby(df1['id'])
.all()
.reindex(idx, fill_value=False)
.reset_index(name='check'))
print (df2)
id check
0 1 True
1 2 False
2 3 False
How can replace the first row's value of pct as NaN for each group city and district? Thank you.
city district date pct
0 a b 2019/8/1 0.15
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 0.03
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
I can only get the first row's pct value for dataframe by df['pct'].iloc[0].
My desired output will like this:
city district date pct
0 a b 2019/8/1 NaN
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 NaN
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
Use Series.where + DataFrame.duplicated
df['pct']=df['pct'].where(df.duplicated(subset = ['city','district']))
print(df)
city district date pct
0 a b 2019/8/1 NaN
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 NaN
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
Detail:
df.duplicated(subset = ['city','district'])
0 False
1 True
2 True
3 False
4 True
5 True
dtype: bool
i have a data frame as below
df = pd.DataFrame({"Col1": ['A','B','B','A','B','B','A','B','A', 'A'],
"Col2" : [-2.21,-9.59,0.16,1.29,-31.92,-24.48,15.23,34.58,24.33,-3.32],
"Col3" : [-0.27,-0.57,0.072,-0.15,-0.21,-2.54,-1.06,1.94,1.83,0.72],
"y" : [-1,1,-1,-1,-1,1,1,1,1,-1]})
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
6 A 15.23 -1.060 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
Is there a way to split the data frame(60:40 split) such that the first 60% of values of col1 will be train and last 40% test.
Train :
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test:
Col1 Col2 Col3 y
5 B -24.48 -2.540 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
I feel like you need groupby here
s=df.groupby('Col1').Col1.cumcount()#get the count for each group
s=s//(df.groupby('Col1').Col1.transform('count')*0.6).astype(int)# get the top 60% of each group
Train=df.loc[s==0].copy()
Test=df.drop(Train.index)
Train
Out[118]:
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test
Out[119]:
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
If need split without groups:
thresh = int(len(df) * 0.6)
train = df.iloc[:thresh]
test = df.iloc[thresh:]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
EDIT: If need split per groups create treshold with GroupBy.cumcount and filtering:
thresh = int(len(df) * 0.6 / df['Col1'].nunique())
print (thresh)
3
mask = df.groupby('Col1')['Col1'].cumcount() < thresh
train = df[mask]
test = df[~mask]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
print(test)
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
IIUC, you can use numpy.split:
import numpy as np
train, test = np.split(df, [int(len(df) * 0.6)])
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1