i have a data frame as below
df = pd.DataFrame({"Col1": ['A','B','B','A','B','B','A','B','A', 'A'],
"Col2" : [-2.21,-9.59,0.16,1.29,-31.92,-24.48,15.23,34.58,24.33,-3.32],
"Col3" : [-0.27,-0.57,0.072,-0.15,-0.21,-2.54,-1.06,1.94,1.83,0.72],
"y" : [-1,1,-1,-1,-1,1,1,1,1,-1]})
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
6 A 15.23 -1.060 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
Is there a way to split the data frame(60:40 split) such that the first 60% of values of col1 will be train and last 40% test.
Train :
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test:
Col1 Col2 Col3 y
5 B -24.48 -2.540 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
I feel like you need groupby here
s=df.groupby('Col1').Col1.cumcount()#get the count for each group
s=s//(df.groupby('Col1').Col1.transform('count')*0.6).astype(int)# get the top 60% of each group
Train=df.loc[s==0].copy()
Test=df.drop(Train.index)
Train
Out[118]:
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test
Out[119]:
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
If need split without groups:
thresh = int(len(df) * 0.6)
train = df.iloc[:thresh]
test = df.iloc[thresh:]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
EDIT: If need split per groups create treshold with GroupBy.cumcount and filtering:
thresh = int(len(df) * 0.6 / df['Col1'].nunique())
print (thresh)
3
mask = df.groupby('Col1')['Col1'].cumcount() < thresh
train = df[mask]
test = df[~mask]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
print(test)
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
IIUC, you can use numpy.split:
import numpy as np
train, test = np.split(df, [int(len(df) * 0.6)])
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
Related
I have a dataframe that looks like
Date col_1 col_2 col_3
2022-08-20 5 B 1
2022-07-21 6 A 1
2022-07-20 2 A 1
2022-06-15 5 B 1
2022-06-11 3 C 1
2022-06-05 5 C 2
2022-06-01 3 B 2
2022-05-21 6 A 1
2022-05-13 6 A 0
2022-05-10 2 B 3
2022-04-11 2 C 3
2022-03-16 5 A 3
2022-02-20 5 B 1
and i want to add a new column col_new that cumcount the number of rows with the same elements in col_1 and col_2 but excluding that row itself and such that the element in col_3 is 1. So the desired output would look like
Date col_1 col_2 col_3 col_new
2022-08-20 5 B 1 3
2022-07-21 6 A 1 2
2022-07-20 2 A 1 1
2022-06-15 5 B 1 2
2022-06-11 3 C 1 1
2022-06-05 5 C 2 0
2022-06-01 3 B 2 0
2022-05-21 6 A 1 1
2022-05-13 6 A 0 0
2022-05-10 2 B 3 0
2022-04-11 2 C 3 0
2022-03-16 5 A 3 0
2022-02-20 5 B 1 1
And here's what I have tried:
Date = pd.to_datetime(df['Date'], dayfirst=True)
list_col_3_is_1 = (df
.assign(Date=Date)
.sort_values('Date', ascending=True)
['col_3'].eq(1))
df['col_new'] = (list_col_3_is_1.groupby(df[['col_1','col_2']]).apply(lambda g: g.shift(1, fill_value=0).cumsum()))
But then I got the following error: ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
Thanks in advance.
Your solution should be changed:
df['col_new'] = list_col_3_is_1.groupby([df['col_1'],df['col_2']]).cumsum()
print (df)
Date col_1 col_2 col_3 col_new
0 2022-08-20 5 B 1 3
1 2022-07-21 6 A 1 2
2 2022-07-20 2 A 1 1
3 2022-06-15 5 B 1 2
4 2022-06-11 3 C 1 1
5 2022-06-05 5 C 2 0
6 2022-06-01 3 B 2 0
7 2022-05-21 6 A 1 1
8 2022-05-13 6 A 0 0
9 2022-05-10 2 B 3 0
10 2022-04-11 2 C 3 0
11 2022-03-16 5 A 3 0
12 2022-02-20 5 B 1 1
Assuming you already have the rows sorted in the desired order, you can use:
df['col_new'] = (df[::-1].assign(n=df['col_3'].eq(1))
.groupby(['col_1', 'col_2'])['n'].cumsum()
)
Output:
Date col_1 col_2 col_3 col_new
0 2022-08-20 5 B 1 3
1 2022-07-21 6 A 1 2
2 2022-07-20 2 A 1 1
3 2022-06-15 5 B 1 2
4 2022-06-11 3 C 1 1
5 2022-06-05 5 C 2 0
6 2022-06-01 3 B 2 0
7 2022-05-21 6 A 1 1
8 2022-05-13 6 A 0 0
9 2022-05-10 2 B 3 0
10 2022-04-11 2 C 3 0
11 2022-03-16 5 A 3 0
12 2022-02-20 5 B 1 1
I like to reshape a dataframe thats first column should be used to group the other columns by an additional header row.
Initial dataframe
df = pd.DataFrame(
{
'col1':['A','A','A','B','B','B'],
'col2':[1,2,3,4,5,6],
'col3':[1,2,3,4,5,6],
'col4':[1,2,3,4,5,6],
'colx':[1,2,3,4,5,6]
}
)
Trial:
Using pd.pivot() I can create an example, but this do not fit my expected one, it seems to be flipped in grouping:
df.pivot(columns='col1', values=['col2','col3','col4','colx'])
col2 col3 col4 colx
col1 A B A B A B A B
0 1.0 NaN 1.0 NaN 1.0 NaN 1.0 NaN
1 2.0 NaN 2.0 NaN 2.0 NaN 2.0 NaN
2 3.0 NaN 3.0 NaN 3.0 NaN 3.0 NaN
3 NaN 4.0 NaN 4.0 NaN 4.0 NaN 4.0
4 NaN 5.0 NaN 5.0 NaN 5.0 NaN 5.0
5 NaN 6.0 NaN 6.0 NaN 6.0 NaN 6.0
Expected output:
A B
col1 col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6
Create counter column by GroupBy.cumcount, then use DataFrame.pivot with swapping level of MultiIndex in columns by DataFrame.swaplevel, sorting it and last remove index and columns names by DataFrame.rename_axis:
df = (df.assign(g = df.groupby('col1').cumcount())
.pivot(index='g', columns='col1')
.swaplevel(0,1,axis=1)
.sort_index(axis=1)
.rename_axis(index=None, columns=[None, None]))
print(df)
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6
As an alternative to the classical pivot, you can concat the output of groupby with a dictionary comprehension, ensuring alignment with reset_index:
out = pd.concat({k: d.drop(columns='col1').reset_index(drop=True)
for k,d in df.groupby('col1')}, axis=1)
output:
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6
Given a dataset df as follows:
type module item value input
0 A a item1 2 1
1 A a item2 3 0
2 A aa item3 4 1
3 A aa item4 3 0
4 A aa item5 1 -1
5 B b item1 5 0
6 B b item2 1 -1
7 B bb item3 3 0
8 B bb item4 3 1
9 B bb item5 4 0
I need to calculate sum of pct based on the following logic: first, we only take value whose input is 0 or 1 as valid values. Then I need to groupby type, module to calculate percentage of sum, for example, the pct of first row of A-a-item1 is calculated by 2/(2 + 3) = 0.4, A-aa-item1 is calculated by 4/(4 + 3) = 0.57, not divided by 8 since input value for A-aa-item3 is -1 so it's excluded. The sum column in df2 is calculated by groupby type module then sum of sum.
df1:
type module item value input pct
0 A a item1 2 1 0.400000
1 A a item2 3 0 0.000000
2 A aa item1 4 1 0.571429
3 A aa item2 3 0 0.000000
4 A aa item3 1 -1 0.000000
5 B b item1 5 0 0.000000
6 B b item2 1 -1 0.000000
7 B bb item1 3 0 0.000000
8 B bb item2 3 1 0.300000
9 B bb item3 4 0 0.000000
df2:
type module sum
0 A a 0.40
1 A aa 0.57
2 B b 0.00
3 B bb 0.30
How could I get similar results based on the given dataset? Thanks.
You can replace not matched by conditions with Series.eq for compare by 1 with 0 and compare by 0, 1 by Series.isin and instead aggregation is used GroupBy.transform with sum for new column filled by aggregate values and divided by Series.div :
s1 = df['value'].where(df['input'].eq(1), 0)
s2 = (df.assign(value = df['value'].where(df['input'].isin([0,1]), 0))
.groupby(['type','module'])['value'].transform('sum'))
df['pct '] = s1.div(s2)
print (df)
type module item value input pct
0 A a item1 2 1 0.400000
1 A a item2 3 0 0.000000
2 A aa item3 4 1 0.571429
3 A aa item4 3 0 0.000000
4 A aa item5 1 -1 0.000000
5 B b item1 5 0 0.000000
6 B b item2 1 -1 0.000000
7 B bb item3 3 0 0.000000
8 B bb item4 3 1 0.300000
9 B bb item5 4 0 0.000000
For second DataFrame is added 2 new columns by DataFrame.assign, aggregate sum and last divide with DataFrame.pop for use and remove column value:
df2 = (df.assign(value = df['value'].where(df['input'].isin([0,1]), 0),
pct = df['value'].where(df['input'].eq(1), 0))
.groupby(['type','module'])[['value','pct']]
.sum()
.assign(pct = lambda x: x['pct'].div(x.pop('value')))
.reset_index())
print (df2)
type module pct
0 A a 0.400000
1 A aa 0.571429
2 B b 0.000000
3 B bb 0.300000
Given a small dataset as follows:
value input
0 3 0
1 4 1
2 3 -1
3 2 1
4 3 -1
5 5 0
6 1 0
7 1 1
8 1 1
I have used the following code:
df['pct'] = df['value'] / df['value'].sum()
But I want to calculate pct by excluding input = -1, which means if input value is -1, then the correspondent values will not taken into account to sum up, neither necessary to calculate pct, for rows 2 and 4 at this case.
The expected result will like this:
value input pct
0 3 0 0.18
1 4 1 0.24
2 3 -1 NaN
3 2 1 0.12
4 3 -1 NaN
5 5 0 0.29
6 1 0 0.06
7 1 1 0.06
8 1 1 0.06
How could I do that in Pandas? Thanks.
You can sum not matched rows by missing values to Series s by Series.where and divide only rows not matched mask filtered by DataFrame.loc, last round by Series.round:
mask = df['input'] != -1
df.loc[mask, 'pct'] = (df.loc[mask, 'value'] / df['value'].where(mask).sum()).round(2)
print (df)
value input pct
0 3 0 0.18
1 4 1 0.24
2 3 -1 NaN
3 2 1 0.12
4 3 -1 NaN
5 5 0 0.29
6 1 0 0.06
7 1 1 0.06
8 1 1 0.06
EDIT: If need replace missing values to 0 is possible use second argument in where for set values to 0, this Series is possible also sum for same output like replace to missing values:
s = df['value'].where(df['input'] != -1, 0)
df['pct'] = (s / s.sum()).round(2)
print (df)
value input pct
0 3 0 0.18
1 4 1 0.24
2 3 -1 0.00
3 2 1 0.12
4 3 -1 0.00
5 5 0 0.29
6 1 0 0.06
7 1 1 0.06
8 1 1 0.06
I have a data like below:
col1 col2
A 0
A 1
B 0
A 0
C 1
B 1
C 0
I like it as below:
col1 col2 col3 col4
A 0 .33 .33
A 1 .33 .33
B 0 .5 .33
A 0 .33 .33
C 1 .5 .33
B 1 .5 .33
C 0 .5 .33
Col 1 are categories of values. Col 2 are events,i.e 0=no, 1=yes.
col 3 should be the event rate of the category,i.e
(number of times the category has value 1/total number of occurances of that category)
col 4 should be event share of the category,i.e,
(number of times the category has value1/total number of 1s across all categories,
e.g col 4 for A should be number of 1s in A divided by number of total 1s across categories A,B & C together.)
Can anyone please help
Use GroupBy.transform mean and sum, for second divide by sum of all 1 of df['col2']:
df['col3'] = df.groupby('col1')['col2'].transform('mean')
df['col4'] = df.groupby('col1')['col2'].transform('sum').div(df['col2'].sum())
print (df)
col1 col2 col3 col4
0 A 0 0.333333 0.333333
1 A 1 0.333333 0.333333
2 B 0 0.500000 0.333333
3 A 0 0.333333 0.333333
4 C 1 0.500000 0.333333
5 B 1 0.500000 0.333333
6 C 0 0.500000 0.333333
Another solution with aggregate by GroupBy.agg:
df = df.join(df.groupby('col1')['col2'].agg([('col3','mean'),('col4','sum')])
.assign(col4 = lambda x: x['col4'].div(df['col2'].sum())), on='col1')
print (df)
col1 col2 col3 col4
0 A 0 0.333333 0.333333
1 A 1 0.333333 0.333333
2 B 0 0.500000 0.333333
3 A 0 0.333333 0.333333
4 C 1 0.500000 0.333333
5 B 1 0.500000 0.333333
6 C 0 0.500000 0.333333