How to add flag column in each group of pandas groupby object - python-3.x

I have df with three columns X, Y, Z. i want to apply groupby function to group data based on X . and then i want to insert flag column in each group . condition for flag column is if Column Z 30% values are greater than 1.5 then add flag column value 1 for group . if Column Z 30% values are not greater than 1.5 then add flag column value 0 for group .
here is my example df:
df = pd.DataFrame({'X':['1', '1', '1' ,'1', '1', '2','2','2','2','2','2','3','3','3'],'Y':["34","45","33","45","44", "66",'67','23','34','10','11','13','12','14'],'Z':["1.2","1.3","1.6","1.7","1.8", "0",'0','0','1.8','1.2','1.3','1.6','1.7','1.8']})
X Y Z
0 1 34 1.2
1 1 45 1.3
2 1 33 1.6
3 1 45 1.7
4 1 44 1.8
5 2 66 0
6 2 67 0
7 2 23 0
8 2 34 1.8
9 2 10 1.2
10 2 11 1.3
11 3 13 1.6
12 3 12 1.7
13 3 14 1.8
desired results:
df_result= pd.DataFrame({'X':['1', '1', '1' ,'1', '1', '2','2','2','2','2','2','3','3','3'],'Y':["34","45","33","45","44", "66",'67','23','34','10','11','13','12','14'],'Z':["1.2","1.3","1.6","1.7","1.8", "0",'0','0','1.8','1.2','1.3','1.6','1.7','1.8'],'flag':["1","1","1","1","1", "0",'0','0','0','0','0','1','1','1']})
print(df_result)
X Y Z flag
0 1 34 1.2 1
1 1 45 1.3 1
2 1 33 1.6 1
3 1 45 1.7 1
4 1 44 1.8 1
5 2 66 0 0
6 2 67 0 0
7 2 23 0 0
8 2 34 1.8 0
9 2 10 1.2 0
10 2 11 1.3 0
11 3 13 1.6 1
12 3 12 1.7 1
13 3 14 1.8 1

Use GroupBy.transform with lambda function and converting boolean to integers by Series.astype:
df["Z"]= df["Z"].astype(float)
f = lambda x: (x > 1.5).sum() > len(x) *.3
#if necessary convert 30% to integer by ceil
#f = lambda x: (x > 1.5).sum() > np.ceil(len(x) *.3)
df['flag'] = df.groupby("X")["Z"].transform(f).astype(int)
print (df)
X Y Z flag
0 1 34 1.2 1
1 1 45 1.3 1
2 1 33 1.6 1
3 1 45 1.7 1
4 1 44 1.8 1
5 2 66 0.0 0
6 2 67 0.0 0
7 2 23 0.0 0
8 2 34 1.8 0
9 2 10 1.2 0
10 2 11 1.3 0
11 3 13 1.6 1
12 3 12 1.7 1
13 3 14 1.8 1

Try this.please let me know if there is any issue.
import pandas as pd
import math
df = pd.DataFrame({'X':['1', '1', '1' ,'1', '1', '2','2','2','2','2','2','3','3','3'],'Y':["34","45","33","45","44", "66",'67','23','34','10','11','13','12','14'],'Z':["1.2","1.3","1.6","1.7","1.8", "0",'0','0','1.8','1.2','1.3','1.6','1.7','1.8']})
df["Z"]= pd.to_numeric(df["Z"])
def func(x):
p = math.ceil(x.shape[0]*3/10)
if sum(x>1.5) > p:
return 1
else:
return 0
t = df.groupby("X")["Z"].apply(lambda x: func(x)).reset_index(name="flag")
df["flag"] = df["X"].apply(lambda x: t[t["X"]==x]["flag"].values[0])
output
X Y Z flag
1 34 1.2 1
1 45 1.3 1
1 33 1.6 1
1 45 1.7 1
1 44 1.8 1
2 66 0.0 0
2 67 0.0 0
2 23 0.0 0
2 34 1.8 0
2 10 1.2 0
2 11 1.3 0
3 13 1.6 1
3 12 1.7 1
3 14 1.8 1

Related

Pandas: Combine pandas columns that have the same column name

If we have the following df,
df
A A B B B
0 10 2 0 3 3
1 20 4 19 21 36
2 30 20 24 24 12
3 40 10 39 23 46
How can I combine the content of the columns with the same names?
e.g.
A B
0 10 0
1 20 19
2 30 24
3 40 39
4 2 3
5 4 21
6 20 24
7 10 23
8 Na 3
9 Na 36
10 Na 12
11 Na 46
I tried groupby and merge and both are not doing this job.
Any help is appreciated.
If columns names are duplicated you can use DataFrame.melt with concat:
df = pd.concat([df['A'].melt()['value'], df['B'].melt()['value']], axis=1, keys=['A','B'])
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46
EDIT:
uniq = df.columns.unique()
df = pd.concat([df[c].melt()['value'] for c in uniq], axis=1, keys=uniq)
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46

How to save rows when value change in column python

I have DataFrame with two columns ID and Value1, I want to select rows when the value of column value1 column changes. I want to save rows 3 before change and 3 after the change and also change point row.
df=pd.DataFrame({'ID':[1,3,4,6,7,8,90,23,56,78,90,34,56,78,89,34,56],'Value1':[0,0,0,0,0,2,2,2,2,0,0,0,1,1,1,1,1]})
ID Value1
0 1 0
1 3 0
2 4 0
3 6 0
4 7 0
5 8 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0
12 56 1
13 78 1
14 89 1
15 34 1
16 56 1
output:
ID Value1
0 4 0
1 6 0
2 7 0
3 8 2
4 90 2
5 23 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0
IIUC,
import numpy as np
df=pd.DataFrame({'ID':[1,3,4,6,7,8,90,23,56,78,90,34,56,78,89,34,56],'Value1':[0,0,0,0,0,2,2,2,2,0,0,0,1,1,1,1,1]})
df.reset_index(drop=True) #index needs to start from zero for solution
ind = list(set([val for i in df[df['Value1'].diff()!=0].index for val in range(i-3, i+4) if i>0 and val>=0]))
# diff gives column wise differencing. combined it with nested list and
# finally, list(set()) to drop any duplicates in index values
df[df.index.isin(ind)]
ID Value1
2 4 0
3 6 0
4 7 0
5 8 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0
12 56 1
13 78 1
14 89 1
15 34 1
If you want to retain occurrences of duplicates, drop the list(set()) function over the list

Rename column index from 0 to last column pandas

I have a pandas data frame dat as below:
0 1 0 1 0 1
0 A 23 0.1 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.2 235 87 5
3 D 13 0.8 567 42 6
4 E 5 0.9 356 12 2
As you can see from above, the columns' index are 0,1,0,1,0,1 etc. I want to rename back to original index starting from 0,1,2,3,4 ... and I did the following:
dat = dat.reset_index(drop=True)
The index was not changed. How do I get the index renamed in this case? Thanks in advance.
dat.columns = range(dat.shape[1])
There are quite a few ways:
dat = dat.rename(columns = lambda x: dat.columns.get_loc(x))
Or
dat = dat.rename(columns = dict(zip(dat.columns, range(dat.shape[1]))))
Or
dat = dat.set_axis(pd.RangeIndex(dat.shape[1]), axis=1, inplace=False)
Out[677]:
0 1 2 3 4 5
0 A 23 0.10 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.20 235 87 5
3 D 13 0.80 567 42 6
4 E 5 0.90 356 12 2

TypeError: strptime() argument 1 must be str, not float

I'm having parsing errors on my code, below is the code and almost understandable dataset
import numpy as np
import pandas as pd
from datetime import datetime as dt
data0 = pd.read_csv('2009-10.csv')
data1 = pd.read_csv('2010-11.csv')
def parse_date(date):
if date == '':
return None
else:
return dt.strptime(date, '%d/%m/%y').date()
data0.Date = data0.Date.apply(parse_date)
data1.Date = data1.Date.apply(parse_date)
TypeError: strptime() argument 1 must be str, not float
Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR Referee HS AS HST AST HF AF HC AC HY AY HR AR B365H B365D B365A
15/08/09 Aston Villa Wigan 0 2 A 0 1 A M Clattenburg 11 14 5 7 15 14 4 6 2 2 0 0 1.67 3.6 5.5
15/08/09 Blackburn Man City 0 2 A 0 1 A M Dean 17 8 9 5 12 9 5 4 2 1 0 0 3.6 3.25 2.1
15/08/09 Bolton Sunderland 0 1 A 0 1 A A Marriner 11 20 3 13 16 10 4 7 2 1 0 0 2.25 3.25 3.25
15/08/09 Chelsea Hull 2 1 H 1 1 D A Wiley 26 7 12 3 13 15 12 4 1 2 0 0 1.17 6.5 21
15/08/09 Everton Arsenal 1 6 A 0 3 A M Halsey 8 15 5 9 11 13 4 9 0 0 0 0 3.2 3.25 2.3
Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR Referee HS AS HST AST HF AF HC AC HY AY HR AR B365H B365D B365A
14/08/10 Aston Villa West Ham 3 0 H 2 0 H M Dean 23 12 11 2 15 15 16 7 1 2 0 0 2 3.3 4
14/08/10 Blackburn Everton 1 0 H 1 0 H P Dowd 7 17 2 12 19 14 1 3 2 1 0 0 2.88 3.25 2.5
14/08/10 Bolton Fulham 0 0 D 0 0 D S Attwell 13 12 9 7 12 13 4 8 1 3 0 0 2.2 3.3 3.4
14/08/10 Chelsea West Brom 6 0 H 2 0 H M Clattenburg 18 10 13 4 10 10 3 1 1 0 0 0 1.17 7 17
14/08/10 Sunderland Birmingham 2 2 D 1 0 H A Taylor 6 13 2 7 13 10 3 6 3 3 1 0 2.1 3.3 3.6
14/08/10 Tottenham Man City 0 0 D 0 0 D A Marriner 22 11 18 7 13 16 10 3 0 2 0 0 2.4 3.3 3
IIUC, I think you are converting strings into datetime dtypes.
You can use Pandas to_datetime:
data0['Date'] = pd.to_datetime(data0['Date'], format='%d/%m/%y')
data1['Date'] = pd.to_datetime(data1['Date'], format='%d/%m/%y')

Multiple boxplots in SAS

I have this data set and I would like to make all boxplots of the 9 input variables to appear on the same plot, despite that they are in different scales. Could you please tell me if there is an easy way to accomplish this?
I am a novice SAS user so I would appreciate some advice. Thank you.
data raw;
input ID$ Family DistRd Cotton Maize Sorg Millet Bull Cattle Goats;
datalines;
FARM1 12 80 1.5 1 3 0.25 2 0 1
FARM2 54 8 6 4 0 1 6 32 5
FARM3 11 13 0.5 1 0 0 0 0 0
FARM4 21 13 2 2.5 1 0 1 0 5
FARM5 61 30 3 5 0 0 4 21 0
FARM6 20 70 0 2 3 0 2 0 3
FARM7 29 35 1.5 2 0 0 0 0 0
FARM8 29 35 2 3 2 0 0 0 0
FARM9 57 9 5 5 0 0 4 5 2
FARM10 23 33 2 2 1 0 2 1 7
FARM11 13 9 0.5 2 2 0 0 0 0
FARM12 15 9 2 2 2 0 0 0 0
FARM13 27 3 1.5 0 2 1 0 0 1
FARM14 28 5 2 0.5 2 2 2 0 5
FARM15 52 5 7 1 7 0 4 11 3
FARM16 12 10 2 2.5 3 0 0 0 0
FARM17 25 30 1 1 4 0 2 0 5
FARM18 5 3 1 0 1 0.5 0 0 3
FARM19 45 30 4.5 1 1 0 6 13 20
FARM20 6 7 1 1 1 1 2 0 5
FARM21 17 8 1.5 0.5 1.5 0.25 0 0 2
FARM22 22 6 3 2 3 1 3 0 2
FARM23 43 40 7 3 3 0.5 6 2 3
FARM24 66 36 0 0.5 5 5 0 0 0
FARM25 15 3 1 0 1.5 0.5 1 0 1
FARM26 26 5 2 1.5 2 2 1 0 0
FARM27 31 5 1.5 1 3 2 2 0 0
FARM28 37 2 3 2 3 5 3 0 5
FARM29 81 2 8 4 4 12 7 8 13
FARM30 14 10 0 0.5 3 1 0 0 0
FARM31 20 7 2 1 4 3 2 0 5
FARM32 26 7 2 1 2 2 2 0 2
FARM33 12 10 0.5 1 3 1 0 0 0
FARM34 18 35 4 3 3 3 4 0 0
FARM35 11 29 1 0.5 3 2 2 0 2
FARM36 50 29 5 3 5 4 4 8 4
FARM37 7 9 0 1 1 0 0 0 0
FARM38 26 9 2 1 3 0 0 0 0
FARM39 19 33 1 1.5 0 4 2 0 0
FARM40 43 33 3 3 4 7 4 3 0
FARM41 18 12 3 0 1 1 2 1 1
FARM42 64 20 3 5 2 2 4 0 6
FARM43 61 25 9 7 3 8 4 17 0
FARM44 18 3 0.5 0.5 2 2 0 0 4
FARM45 11 2 0.5 0 1.5 1.5 1 1 0
FARM46 30 3 4 2 4 0 4 2 0
FARM47 16 1.5 2 0.5 2 2 2 2 0
FARM48 46 1 0.75 1 3 2 0 0 2
FARM49 18 2 1.5 0.5 2 2 2 0 2
FARM50 81 3 12 1.5 10 8 11 14 15
FARM51 15 0 1.5 1.5 2.5 0 1 0 0
FARM52 26 11 3.5 2 4 0 2 2 2
FARM53 10 11 0 0 1.5 0 0 0 0
FARM54 40 12 5 3 6 1 8 17 10
FARM55 82 4 11 7 5 0.5 8 5 0
FARM56 40 5.5 6 4 2.5 1 3 0 2
FARM57 29 8 3 2 4 2 0 0 2
FARM58 23 5 5 4 3 1 1 0 0
FARM59 53 4 0 3 0 3 6 0 0
FARM60 57 3.5 9 8 0 0 10 23 0
FARM61 23 4 2 2 0.5 4 2 0 0
FARM62 9 31 2 2 0 2 1 0 0
FARM63 22 35 3 2 3 0 5 6 1
FARM64 25 35 3 1 2.5 0 4 8 10
FARM65 20 0 1.5 1 3 0 1 6 0
FARM66 27 41 1.1 0.25 1.5 1.5 0 3 1
FARM67 30 19 2 2 4 1 2 0 5
FARM68 77 18 8 4 6 4 6 8 6
FARM69 13 100 0.5 0.5 0 1 0 0 4
FARM70 24 100 2 3 0 0.5 3 14 10
FARM71 29 90 2 1.5 1.5 1.5 2 0 2
FARM72 57 90 10 7 0 1.5 7 8 7
;
run;
You need to transpose the values and use a group= statement.
Steps
1 Sort by ID
2 Transpose the data
3 Adjust the labels for display
4 Plot with PROC SGPLOT
proc sort data=raw;
by id;
run;
proc transpose data=raw out=raw_t;
by id;
run;
data raw_t;
set raw_t;
label _name_ = "Variable";
label col1 = "Value";
run;
ods html;
title "My Box Plot";
proc sgplot data=raw_t;
vbox col1 / group=_name_ ;
run;
ods html close;
Produces:

Resources