how to group a string in a column in python? - python-3.x

i have a dataframe
PROD TYPE QUANTI
0 wood i2 20
1 tv ut1 30
2 tabl il3 50
3 rmt z1 40
4 zet u1 60
5 rm t1 60
6 rt t2 80
7 dud i4 40
I want to group the column "TYPE" in-group categories of (i,u,z,y...etc)
Expected Output
PROD TYPE QUANTI
0 wood i_group 20
1 tv ut_group 30
2 tabl il_group 50
3 rmt z_group 40
4 zet y_group 60
5 rm t_group 60
6 rt t_group 80
7 dud i_group 40

Use Series.replace for replace number to _group:
df['TYPE'] = df['TYPE'].replace('\d+', '_group', regex=True)
print (df)
PROD TYPE QUANTI
0 wood i_group 20
1 tv ut_group 30
2 tabl il_group 50
3 rmt z_group 40
4 zet u_group 60
5 rm t_group 60
6 rt t_group 80
7 dud i_group 40
If possible some values with no number use:
df['TYPE'] = df['TYPE'].replace('\d+', '', regex=True) + '_group'

Related

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

How to convert multi-indexed datetime index into integer?

I have a multi indexed dataframe(groupby object) as the result of groupby (by 'id' and 'date').
x y
id date
abc 3/1/1994 100 7
9/1/1994 90 8
3/1/1995 80 9
bka 5/1/1993 50 8
7/1/1993 40 9
I'd like to convert those dates into an integer-like, such as
x y
id date
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
I thought it would be simple but I couldn't get there easily. Is there a simple way to work on this?
Try this:
s = 'day ' + df.groupby(level=0).cumcount().astype(str)
df1 = df.set_index([s], append=True).droplevel(1)
x y
id
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
You can calculate the new level and create a new index:
lvl1 = 'day ' + df.groupby('id').cumcount().astype('str')
df.index = pd.MultiIndex.from_tuples((x,y) for x,y in zip(df.index.get_level_values('id'), lvl1) )
output:
x y
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9

How to do cumulative mean and count in a easy way

I have following dataframe in pandas
data = {'call_put':['C', 'C', 'P','C', 'P'],'price':[10,20,30,40,50], 'qty':[11,12,11,14,9]}
df['amt']=df.price*df.qty
df=pd.DataFrame(data)
call_put price qty amt
0 C 10 11 110
1 C 20 12 240
2 P 30 11 330
3 C 40 14 560
4 P 50 9 450
I want output something like following based on call_put value is 'C' or 'P' count, median and calculation as follows
call_put price qty amt cummcount cummmedian cummsum
C 10 11 110 1 110 110
C 20 12 240 2 175 ((110+240)/2 ) 350
P 30 11 330 1 330 680
C 40 14 560 3 303.33 (110+240+560)/3 1240
P 50 9 450 2 390 ((330+450)/2) 1690
Can it be done in some easy way without creating additional dataframes and functions?
create a grouped element named g and use df.assign to assign values:
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
call_put price qty amt cum_count cummedian cum_sum
0 C 10 11 110 1 110.000000 110
1 C 20 12 240 2 175.000000 350
2 P 30 11 330 1 303.333333 680
3 C 40 14 560 3 330.000000 1240
4 P 50 9 450 2 390.000000 1690
Note: for P , the cummedian should be 390 since (330+450)/2 = 390
For cum_count look at df.groupby.cumcount()
for cummedian check how expanding() works ,
for cumsum check df.cumsum()
IIUC, this should work
df['cumcount']=df.groupby('call_put').cumcount()
df['cummidean']=df.groupby('call_put')['amt'].cumsum()
df['cumsum']=df.groupby('call_put').cumsum()
Thanks following solution is fine
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
if I run following without drop=True
g['amt'].expanding().mean().reset_index()
why output is showing level_1
call_put level_1 amt
0 C 0 110.000000
1 C 1 175.000000
2 C 3 303.333333
3 P 2 330.000000
4 P 4 390.000000
g['amt'].expanding().mean().reset_index(drop=True)
0 110.000000
1 175.000000
2 303.333333
3 330.000000
4 390.000000
Name: amt, dtype: float64
Can you pl explain in more detail ?
How do you add one more condition in groupby clause
g=df.groupby('call_put', 'price' < 50)
TypeError: '<' not supported between instances of 'str' and 'int'

Taking all duplicate values in column as single value in pandas

My current dataframe is:
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
I want to get a dataframe as:
Name term Grade
0 A 1 35
2 40
1 B 1 50
2 45
Is i possible to get like my expected output?If yes,How can i do it?
Use duplicated for boolean mask with numpy.where:
mask = df['Name'].duplicated()
#more general
#mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
Difference between masks is possible seen in changed DataFrame:
print (df)
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
4 A 4 43
5 A 3 46
If multiple same consecutive groups like 2 A groups need general solution:
mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 A 4 43
5 3 46
mask = df['Name'].duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 4 43
5 3 46

how to add a new column in dataframe which divides multiple columns and finds the maximum value

This maybe real simple solution but I am new to python 3 and I have a dataframe with multiple columns. I would like to add a new column to the existing dataframe - which does the following calculation i.e.
New Column = Max((Column A/Column B), (Column C/Column D), (Column E/Column F))
I can do a max based on the following code but wanted to check how can I do div alongwith it.
df['Max'] = df[['Column A','Column B','Column C', 'Column D', 'Column E', 'Column F']].max(axis=1)
Column A Column B Column C Column D Column E Column F Max
3600 36000 22 11 3200 3200 36000
2300 2300 13 26 1100 1200 2300
1300 13000 15 33 1000 1000 13000
Thanks
You can div the df by itself by slicing the columns in steps and then take the max:
In [105]:
df['Max'] = df.ix[:,df.columns[::2]].div(df.ix[:,df.columns[1::2]].values, axis=1).max(axis=1)
df
Out[105]:
Column A Column B Column C Column D Column E Column F Max
0 3600 36000 22 11 3200 3200 2
1 2300 2300 13 26 1100 1200 1
2 1300 13000 15 33 1000 1000 1
Here are the intermediate values:
In [108]:
df.ix[:,df.columns[::2]].div(df.ix[:,df.columns[1::2]].values, axis=1)
Out[108]:
Column A Column C Column E
0 0.1 2.000000 1.000000
1 1.0 0.500000 0.916667
2 0.1 0.454545 1.000000
You can try something like as follows
df['Max'] = df.apply(lambda v: max(v['A'] / v['B'].astype(float), v['C'] / V['D'].astype(float), v['E'] / v['F'].astype(float)), axis=1)
Example
In [14]: df
Out[14]:
A B C D E F
0 1 11 1 11 12 98
1 2 22 2 22 67 1
2 3 33 3 33 23 4
3 4 44 4 44 11 10
In [15]: df['Max'] = df.apply(lambda v: max(v['A'] / v['B'].astype(float), v['C'] /
v['D'].astype(float), v['E'] / v['F'].astype(float)), axis=1)
In [16]: df
Out[16]:
A B C D E F Max
0 1 11 1 11 12 98 0.122449
1 2 22 2 22 67 1 67.000000
2 3 33 3 33 23 4 5.750000
3 4 44 4 44 11 10 1.100000

Resources