data manipulation in pandas - python-3.x

I have a data like below:
col1 col2
A 0
A 1
B 0
A 0
C 1
B 1
C 0
I like it as below:
col1 col2 col3 col4
A 0 .33 .33
A 1 .33 .33
B 0 .5 .33
A 0 .33 .33
C 1 .5 .33
B 1 .5 .33
C 0 .5 .33
Col 1 are categories of values. Col 2 are events,i.e 0=no, 1=yes.
col 3 should be the event rate of the category,i.e
(number of times the category has value 1/total number of occurances of that category)
col 4 should be event share of the category,i.e,
(number of times the category has value1/total number of 1s across all categories,
e.g col 4 for A should be number of 1s in A divided by number of total 1s across categories A,B & C together.)
Can anyone please help

Use GroupBy.transform mean and sum, for second divide by sum of all 1 of df['col2']:
df['col3'] = df.groupby('col1')['col2'].transform('mean')
df['col4'] = df.groupby('col1')['col2'].transform('sum').div(df['col2'].sum())
print (df)
col1 col2 col3 col4
0 A 0 0.333333 0.333333
1 A 1 0.333333 0.333333
2 B 0 0.500000 0.333333
3 A 0 0.333333 0.333333
4 C 1 0.500000 0.333333
5 B 1 0.500000 0.333333
6 C 0 0.500000 0.333333
Another solution with aggregate by GroupBy.agg:
df = df.join(df.groupby('col1')['col2'].agg([('col3','mean'),('col4','sum')])
.assign(col4 = lambda x: x['col4'].div(df['col2'].sum())), on='col1')
print (df)
col1 col2 col3 col4
0 A 0 0.333333 0.333333
1 A 1 0.333333 0.333333
2 B 0 0.500000 0.333333
3 A 0 0.333333 0.333333
4 C 1 0.500000 0.333333
5 B 1 0.500000 0.333333
6 C 0 0.500000 0.333333

Related

How to reshape dataframe by creating a grouping multiheader from specific column?

I like to reshape a dataframe thats first column should be used to group the other columns by an additional header row.
Initial dataframe
df = pd.DataFrame(
{
'col1':['A','A','A','B','B','B'],
'col2':[1,2,3,4,5,6],
'col3':[1,2,3,4,5,6],
'col4':[1,2,3,4,5,6],
'colx':[1,2,3,4,5,6]
}
)
Trial:
Using pd.pivot() I can create an example, but this do not fit my expected one, it seems to be flipped in grouping:
df.pivot(columns='col1', values=['col2','col3','col4','colx'])
col2 col3 col4 colx
col1 A B A B A B A B
0 1.0 NaN 1.0 NaN 1.0 NaN 1.0 NaN
1 2.0 NaN 2.0 NaN 2.0 NaN 2.0 NaN
2 3.0 NaN 3.0 NaN 3.0 NaN 3.0 NaN
3 NaN 4.0 NaN 4.0 NaN 4.0 NaN 4.0
4 NaN 5.0 NaN 5.0 NaN 5.0 NaN 5.0
5 NaN 6.0 NaN 6.0 NaN 6.0 NaN 6.0
Expected output:
A B
col1 col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6
Create counter column by GroupBy.cumcount, then use DataFrame.pivot with swapping level of MultiIndex in columns by DataFrame.swaplevel, sorting it and last remove index and columns names by DataFrame.rename_axis:
df = (df.assign(g = df.groupby('col1').cumcount())
.pivot(index='g', columns='col1')
.swaplevel(0,1,axis=1)
.sort_index(axis=1)
.rename_axis(index=None, columns=[None, None]))
print(df)
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6
As an alternative to the classical pivot, you can concat the output of groupby with a dictionary comprehension, ensuring alignment with reset_index:
out = pd.concat({k: d.drop(columns='col1').reset_index(drop=True)
for k,d in df.groupby('col1')}, axis=1)
output:
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6

Groupby multiple columns and calculate percentage of sums in Pandas

Given a dataset df as follows:
type module item value input
0 A a item1 2 1
1 A a item2 3 0
2 A aa item3 4 1
3 A aa item4 3 0
4 A aa item5 1 -1
5 B b item1 5 0
6 B b item2 1 -1
7 B bb item3 3 0
8 B bb item4 3 1
9 B bb item5 4 0
I need to calculate sum of pct based on the following logic: first, we only take value whose input is 0 or 1 as valid values. Then I need to groupby type, module to calculate percentage of sum, for example, the pct of first row of A-a-item1 is calculated by 2/(2 + 3) = 0.4, A-aa-item1 is calculated by 4/(4 + 3) = 0.57, not divided by 8 since input value for A-aa-item3 is -1 so it's excluded. The sum column in df2 is calculated by groupby type module then sum of sum.
df1:
type module item value input pct
0 A a item1 2 1 0.400000
1 A a item2 3 0 0.000000
2 A aa item1 4 1 0.571429
3 A aa item2 3 0 0.000000
4 A aa item3 1 -1 0.000000
5 B b item1 5 0 0.000000
6 B b item2 1 -1 0.000000
7 B bb item1 3 0 0.000000
8 B bb item2 3 1 0.300000
9 B bb item3 4 0 0.000000
df2:
type module sum
0 A a 0.40
1 A aa 0.57
2 B b 0.00
3 B bb 0.30
How could I get similar results based on the given dataset? Thanks.
You can replace not matched by conditions with Series.eq for compare by 1 with 0 and compare by 0, 1 by Series.isin and instead aggregation is used GroupBy.transform with sum for new column filled by aggregate values and divided by Series.div :
s1 = df['value'].where(df['input'].eq(1), 0)
s2 = (df.assign(value = df['value'].where(df['input'].isin([0,1]), 0))
.groupby(['type','module'])['value'].transform('sum'))
df['pct '] = s1.div(s2)
print (df)
type module item value input pct
0 A a item1 2 1 0.400000
1 A a item2 3 0 0.000000
2 A aa item3 4 1 0.571429
3 A aa item4 3 0 0.000000
4 A aa item5 1 -1 0.000000
5 B b item1 5 0 0.000000
6 B b item2 1 -1 0.000000
7 B bb item3 3 0 0.000000
8 B bb item4 3 1 0.300000
9 B bb item5 4 0 0.000000
For second DataFrame is added 2 new columns by DataFrame.assign, aggregate sum and last divide with DataFrame.pop for use and remove column value:
df2 = (df.assign(value = df['value'].where(df['input'].isin([0,1]), 0),
pct = df['value'].where(df['input'].eq(1), 0))
.groupby(['type','module'])[['value','pct']]
.sum()
.assign(pct = lambda x: x['pct'].div(x.pop('value')))
.reset_index())
print (df2)
type module pct
0 A a 0.400000
1 A aa 0.571429
2 B b 0.000000
3 B bb 0.300000

Pandas groupby and conditional check on multiple columns

I have a dataframe like so:
id date status value
1 2009-06-17 1 NaN
1 2009-07-17 B NaN
1 2009-08-17 A NaN
1 2009-09-17 5 NaN
1 2009-10-17 0 0.55
2 2010-07-17 B NaN
2 2010-08-17 A NaN
2 2010-09-17 0 0.00
Now I want to group by id and then check if value becomes non-zero after status changes to A. So for group with id=1, status does change to A and after(in terms of date) that value also becomes non-zero. But for group with id=2, even after status changes to A, value does not become non-zero. Please note that if status does not change to A then I don't even need to check value.
So finally I want a new dataframe like this:
id check
1 True
2 False
Use:
print (df)
id date status value
0 1 2009-06-17 1 NaN
1 1 2009-07-17 B NaN
2 1 2009-08-17 A NaN
3 1 2009-09-17 5 NaN
4 1 2009-10-17 0 0.55
5 2 2010-07-17 B NaN
6 2 2010-08-17 A NaN
7 2 2010-09-17 0 0.00
8 3 2010-08-17 R NaN
9 3 2010-09-17 0 0.00
idx = df['id'].unique()
#filter A values
m = df['status'].eq('A')
#filter all rows after A per groups
df1 = df[m.groupby(df['id']).cumsum().gt(0)]
print (df1)
id date status value
2 1 2009-08-17 A NaN
3 1 2009-09-17 5 NaN
4 1 2009-10-17 0 0.55
6 2 2010-08-17 A NaN
7 2 2010-09-17 0 0.00
#compare by 0 and test if no 0 value per group and last added all posible id by reindex
df2 = (df1['value'].ne(0)
.groupby(df1['id'])
.all()
.reindex(idx, fill_value=False)
.reset_index(name='check'))
print (df2)
id check
0 1 True
1 2 False
2 3 False

Replace given column's first row value with NaN for each group in Pandas

How can replace the first row's value of pct as NaN for each group city and district? Thank you.
city district date pct
0 a b 2019/8/1 0.15
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 0.03
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
I can only get the first row's pct value for dataframe by df['pct'].iloc[0].
My desired output will like this:
city district date pct
0 a b 2019/8/1 NaN
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 NaN
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
Use Series.where + DataFrame.duplicated
df['pct']=df['pct'].where(df.duplicated(subset = ['city','district']))
print(df)
city district date pct
0 a b 2019/8/1 NaN
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 NaN
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
Detail:
df.duplicated(subset = ['city','district'])
0 False
1 True
2 True
3 False
4 True
5 True
dtype: bool

Train test split based on a column values - sequentially

i have a data frame as below
df = pd.DataFrame({"Col1": ['A','B','B','A','B','B','A','B','A', 'A'],
"Col2" : [-2.21,-9.59,0.16,1.29,-31.92,-24.48,15.23,34.58,24.33,-3.32],
"Col3" : [-0.27,-0.57,0.072,-0.15,-0.21,-2.54,-1.06,1.94,1.83,0.72],
"y" : [-1,1,-1,-1,-1,1,1,1,1,-1]})
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
6 A 15.23 -1.060 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
Is there a way to split the data frame(60:40 split) such that the first 60% of values of col1 will be train and last 40% test.
Train :
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test:
Col1 Col2 Col3 y
5 B -24.48 -2.540 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
I feel like you need groupby here
s=df.groupby('Col1').Col1.cumcount()#get the count for each group
s=s//(df.groupby('Col1').Col1.transform('count')*0.6).astype(int)# get the top 60% of each group
Train=df.loc[s==0].copy()
Test=df.drop(Train.index)
Train
Out[118]:
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test
Out[119]:
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
If need split without groups:
thresh = int(len(df) * 0.6)
train = df.iloc[:thresh]
test = df.iloc[thresh:]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
EDIT: If need split per groups create treshold with GroupBy.cumcount and filtering:
thresh = int(len(df) * 0.6 / df['Col1'].nunique())
print (thresh)
3
mask = df.groupby('Col1')['Col1'].cumcount() < thresh
train = df[mask]
test = df[~mask]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
print(test)
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
IIUC, you can use numpy.split:
import numpy as np
train, test = np.split(df, [int(len(df) * 0.6)])
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1

Resources