Train test split based on a column values - sequentially

Train test split based on a column values - sequentially - python-3.x

i have a data frame as below
df = pd.DataFrame({"Col1": ['A','B','B','A','B','B','A','B','A', 'A'],
"Col2" : [-2.21,-9.59,0.16,1.29,-31.92,-24.48,15.23,34.58,24.33,-3.32],
"Col3" : [-0.27,-0.57,0.072,-0.15,-0.21,-2.54,-1.06,1.94,1.83,0.72],
"y" : [-1,1,-1,-1,-1,1,1,1,1,-1]})
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
6 A 15.23 -1.060 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
Is there a way to split the data frame(60:40 split) such that the first 60% of values of col1 will be train and last 40% test.
Train :
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test:
Col1 Col2 Col3 y
5 B -24.48 -2.540 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1

I feel like you need groupby here
s=df.groupby('Col1').Col1.cumcount()#get the count for each group
s=s//(df.groupby('Col1').Col1.transform('count')*0.6).astype(int)# get the top 60% of each group
Train=df.loc[s==0].copy()
Test=df.drop(Train.index)
Train
Out[118]:
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test
Out[119]:
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1

If need split without groups:
thresh = int(len(df) * 0.6)
train = df.iloc[:thresh]
test = df.iloc[thresh:]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
EDIT: If need split per groups create treshold with GroupBy.cumcount and filtering:
thresh = int(len(df) * 0.6 / df['Col1'].nunique())
print (thresh)
3
mask = df.groupby('Col1')['Col1'].cumcount() < thresh
train = df[mask]
test = df[~mask]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
print(test)
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1

IIUC, you can use numpy.split:
import numpy as np
train, test = np.split(df, [int(len(df) * 0.6)])
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1

Related

New column based on values in row and a fixed column value in Pandas Dataframe

I have a dataframe that looks like
Date col_1 col_2 col_3
2022-08-20 5 B 1
2022-07-21 6 A 1
2022-07-20 2 A 1
2022-06-15 5 B 1
2022-06-11 3 C 1
2022-06-05 5 C 2
2022-06-01 3 B 2
2022-05-21 6 A 1
2022-05-13 6 A 0
2022-05-10 2 B 3
2022-04-11 2 C 3
2022-03-16 5 A 3
2022-02-20 5 B 1
and i want to add a new column col_new that cumcount the number of rows with the same elements in col_1 and col_2 but excluding that row itself and such that the element in col_3 is 1. So the desired output would look like
Date col_1 col_2 col_3 col_new
2022-08-20 5 B 1 3
2022-07-21 6 A 1 2
2022-07-20 2 A 1 1
2022-06-15 5 B 1 2
2022-06-11 3 C 1 1
2022-06-05 5 C 2 0
2022-06-01 3 B 2 0
2022-05-21 6 A 1 1
2022-05-13 6 A 0 0
2022-05-10 2 B 3 0
2022-04-11 2 C 3 0
2022-03-16 5 A 3 0
2022-02-20 5 B 1 1
And here's what I have tried:
Date = pd.to_datetime(df['Date'], dayfirst=True)
list_col_3_is_1 = (df
.assign(Date=Date)
.sort_values('Date', ascending=True)
['col_3'].eq(1))
df['col_new'] = (list_col_3_is_1.groupby(df[['col_1','col_2']]).apply(lambda g: g.shift(1, fill_value=0).cumsum()))
But then I got the following error: ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
Thanks in advance.

Your solution should be changed:
df['col_new'] = list_col_3_is_1.groupby([df['col_1'],df['col_2']]).cumsum()
print (df)
Date col_1 col_2 col_3 col_new
0 2022-08-20 5 B 1 3
1 2022-07-21 6 A 1 2
2 2022-07-20 2 A 1 1
3 2022-06-15 5 B 1 2
4 2022-06-11 3 C 1 1
5 2022-06-05 5 C 2 0
6 2022-06-01 3 B 2 0
7 2022-05-21 6 A 1 1
8 2022-05-13 6 A 0 0
9 2022-05-10 2 B 3 0
10 2022-04-11 2 C 3 0
11 2022-03-16 5 A 3 0
12 2022-02-20 5 B 1 1

Assuming you already have the rows sorted in the desired order, you can use:
df['col_new'] = (df[::-1].assign(n=df['col_3'].eq(1))
.groupby(['col_1', 'col_2'])['n'].cumsum()
)
Output:
Date col_1 col_2 col_3 col_new
0 2022-08-20 5 B 1 3
1 2022-07-21 6 A 1 2
2 2022-07-20 2 A 1 1
3 2022-06-15 5 B 1 2
4 2022-06-11 3 C 1 1
5 2022-06-05 5 C 2 0
6 2022-06-01 3 B 2 0
7 2022-05-21 6 A 1 1
8 2022-05-13 6 A 0 0
9 2022-05-10 2 B 3 0
10 2022-04-11 2 C 3 0
11 2022-03-16 5 A 3 0
12 2022-02-20 5 B 1 1

How to reshape dataframe by creating a grouping multiheader from specific column?

I like to reshape a dataframe thats first column should be used to group the other columns by an additional header row.
Initial dataframe
df = pd.DataFrame(
{
'col1':['A','A','A','B','B','B'],
'col2':[1,2,3,4,5,6],
'col3':[1,2,3,4,5,6],
'col4':[1,2,3,4,5,6],
'colx':[1,2,3,4,5,6]
}
)
Trial:
Using pd.pivot() I can create an example, but this do not fit my expected one, it seems to be flipped in grouping:
df.pivot(columns='col1', values=['col2','col3','col4','colx'])
col2 col3 col4 colx
col1 A B A B A B A B
0 1.0 NaN 1.0 NaN 1.0 NaN 1.0 NaN
1 2.0 NaN 2.0 NaN 2.0 NaN 2.0 NaN
2 3.0 NaN 3.0 NaN 3.0 NaN 3.0 NaN
3 NaN 4.0 NaN 4.0 NaN 4.0 NaN 4.0
4 NaN 5.0 NaN 5.0 NaN 5.0 NaN 5.0
5 NaN 6.0 NaN 6.0 NaN 6.0 NaN 6.0
Expected output:
A B
col1 col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6

Create counter column by GroupBy.cumcount, then use DataFrame.pivot with swapping level of MultiIndex in columns by DataFrame.swaplevel, sorting it and last remove index and columns names by DataFrame.rename_axis:
df = (df.assign(g = df.groupby('col1').cumcount())
.pivot(index='g', columns='col1')
.swaplevel(0,1,axis=1)
.sort_index(axis=1)
.rename_axis(index=None, columns=[None, None]))
print(df)
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6

As an alternative to the classical pivot, you can concat the output of groupby with a dictionary comprehension, ensuring alignment with reset_index:
out = pd.concat({k: d.drop(columns='col1').reset_index(drop=True)
for k,d in df.groupby('col1')}, axis=1)
output:
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6

Groupby multiple columns and calculate percentage of sums in Pandas

Given a dataset df as follows:
type module item value input
0 A a item1 2 1
1 A a item2 3 0
2 A aa item3 4 1
3 A aa item4 3 0
4 A aa item5 1 -1
5 B b item1 5 0
6 B b item2 1 -1
7 B bb item3 3 0
8 B bb item4 3 1
9 B bb item5 4 0
I need to calculate sum of pct based on the following logic: first, we only take value whose input is 0 or 1 as valid values. Then I need to groupby type, module to calculate percentage of sum, for example, the pct of first row of A-a-item1 is calculated by 2/(2 + 3) = 0.4, A-aa-item1 is calculated by 4/(4 + 3) = 0.57, not divided by 8 since input value for A-aa-item3 is -1 so it's excluded. The sum column in df2 is calculated by groupby type module then sum of sum.
df1:
type module item value input pct
0 A a item1 2 1 0.400000
1 A a item2 3 0 0.000000
2 A aa item1 4 1 0.571429
3 A aa item2 3 0 0.000000
4 A aa item3 1 -1 0.000000
5 B b item1 5 0 0.000000
6 B b item2 1 -1 0.000000
7 B bb item1 3 0 0.000000
8 B bb item2 3 1 0.300000
9 B bb item3 4 0 0.000000
df2:
type module sum
0 A a 0.40
1 A aa 0.57
2 B b 0.00
3 B bb 0.30
How could I get similar results based on the given dataset? Thanks.

You can replace not matched by conditions with Series.eq for compare by 1 with 0 and compare by 0, 1 by Series.isin and instead aggregation is used GroupBy.transform with sum for new column filled by aggregate values and divided by Series.div :
s1 = df['value'].where(df['input'].eq(1), 0)
s2 = (df.assign(value = df['value'].where(df['input'].isin([0,1]), 0))
.groupby(['type','module'])['value'].transform('sum'))
df['pct '] = s1.div(s2)
print (df)
type module item value input pct
0 A a item1 2 1 0.400000
1 A a item2 3 0 0.000000
2 A aa item3 4 1 0.571429
3 A aa item4 3 0 0.000000
4 A aa item5 1 -1 0.000000
5 B b item1 5 0 0.000000
6 B b item2 1 -1 0.000000
7 B bb item3 3 0 0.000000
8 B bb item4 3 1 0.300000
9 B bb item5 4 0 0.000000
For second DataFrame is added 2 new columns by DataFrame.assign, aggregate sum and last divide with DataFrame.pop for use and remove column value:
df2 = (df.assign(value = df['value'].where(df['input'].isin([0,1]), 0),
pct = df['value'].where(df['input'].eq(1), 0))
.groupby(['type','module'])[['value','pct']]
.sum()
.assign(pct = lambda x: x['pct'].div(x.pop('value')))
.reset_index())
print (df2)
type module pct
0 A a 0.400000
1 A aa 0.571429
2 B b 0.000000
3 B bb 0.300000

Filter rows based one column' value and calculate percentage of sum in Pandas

Given a small dataset as follows:
value input
0 3 0
1 4 1
2 3 -1
3 2 1
4 3 -1
5 5 0
6 1 0
7 1 1
8 1 1
I have used the following code:
df['pct'] = df['value'] / df['value'].sum()
But I want to calculate pct by excluding input = -1, which means if input value is -1, then the correspondent values will not taken into account to sum up, neither necessary to calculate pct, for rows 2 and 4 at this case.
The expected result will like this:
value input pct
0 3 0 0.18
1 4 1 0.24
2 3 -1 NaN
3 2 1 0.12
4 3 -1 NaN
5 5 0 0.29
6 1 0 0.06
7 1 1 0.06
8 1 1 0.06
How could I do that in Pandas? Thanks.

You can sum not matched rows by missing values to Series s by Series.where and divide only rows not matched mask filtered by DataFrame.loc, last round by Series.round:
mask = df['input'] != -1
df.loc[mask, 'pct'] = (df.loc[mask, 'value'] / df['value'].where(mask).sum()).round(2)
print (df)
value input pct
0 3 0 0.18
1 4 1 0.24
2 3 -1 NaN
3 2 1 0.12
4 3 -1 NaN
5 5 0 0.29
6 1 0 0.06
7 1 1 0.06
8 1 1 0.06
EDIT: If need replace missing values to 0 is possible use second argument in where for set values to 0, this Series is possible also sum for same output like replace to missing values:
s = df['value'].where(df['input'] != -1, 0)
df['pct'] = (s / s.sum()).round(2)
print (df)
value input pct
0 3 0 0.18
1 4 1 0.24
2 3 -1 0.00
3 2 1 0.12
4 3 -1 0.00
5 5 0 0.29
6 1 0 0.06
7 1 1 0.06
8 1 1 0.06

data manipulation in pandas

I have a data like below:
col1 col2
A 0
A 1
B 0
A 0
C 1
B 1
C 0
I like it as below:
col1 col2 col3 col4
A 0 .33 .33
A 1 .33 .33
B 0 .5 .33
A 0 .33 .33
C 1 .5 .33
B 1 .5 .33
C 0 .5 .33
Col 1 are categories of values. Col 2 are events,i.e 0=no, 1=yes.
col 3 should be the event rate of the category,i.e
(number of times the category has value 1/total number of occurances of that category)
col 4 should be event share of the category,i.e,
(number of times the category has value1/total number of 1s across all categories,
e.g col 4 for A should be number of 1s in A divided by number of total 1s across categories A,B & C together.)
Can anyone please help

Use GroupBy.transform mean and sum, for second divide by sum of all 1 of df['col2']:
df['col3'] = df.groupby('col1')['col2'].transform('mean')
df['col4'] = df.groupby('col1')['col2'].transform('sum').div(df['col2'].sum())
print (df)
col1 col2 col3 col4
0 A 0 0.333333 0.333333
1 A 1 0.333333 0.333333
2 B 0 0.500000 0.333333
3 A 0 0.333333 0.333333
4 C 1 0.500000 0.333333
5 B 1 0.500000 0.333333
6 C 0 0.500000 0.333333
Another solution with aggregate by GroupBy.agg:
df = df.join(df.groupby('col1')['col2'].agg([('col3','mean'),('col4','sum')])
.assign(col4 = lambda x: x['col4'].div(df['col2'].sum())), on='col1')
print (df)
col1 col2 col3 col4
0 A 0 0.333333 0.333333
1 A 1 0.333333 0.333333
2 B 0 0.500000 0.333333
3 A 0 0.333333 0.333333
4 C 1 0.500000 0.333333
5 B 1 0.500000 0.333333
6 C 0 0.500000 0.333333

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Train test split based on a column values - sequentially - python-3.x

Related

New column based on values in row and a fixed column value in Pandas Dataframe

How to reshape dataframe by creating a grouping multiheader from specific column?

Groupby multiple columns and calculate percentage of sums in Pandas

Filter rows based one column' value and calculate percentage of sum in Pandas

data manipulation in pandas

Categories

Resources