Taking all duplicate values in column as single value in pandas - python-3.x

My current dataframe is:
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
I want to get a dataframe as:
Name term Grade
0 A 1 35
2 40
1 B 1 50
2 45
Is i possible to get like my expected output?If yes,How can i do it?

Use duplicated for boolean mask with numpy.where:
mask = df['Name'].duplicated()
#more general
#mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
Difference between masks is possible seen in changed DataFrame:
print (df)
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
4 A 4 43
5 A 3 46
If multiple same consecutive groups like 2 A groups need general solution:
mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 A 4 43
5 3 46
mask = df['Name'].duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 4 43
5 3 46

Related

How to take mean of 3 values before flag change 0 to 1python

I have dataframe with columns A,B and flag. I want to calculate mean of 2 values before flag change from 0 to 1 , and record value when flag change from 0 to 1 and record value when flag changes from 1 to 0.
# Input dataframe
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,0,0,0]})
# Expected output
df_out=df=pd.DataFrame({'A_mean_before_flag_change':[5.5],
'B_mean_before_flag_change':[5],
'A_value_before_change_flag':[7],
'B_value_before_change_flag':[6]})
I try to create more general solution:
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,1,0,1]})
print (df)
A B flag
0 1 1 0
1 3 3 0
2 4 4 0
3 7 6 0
4 8 8 1
5 11 11 1
6 1 1 1
7 15 19 0
8 20 20 0
9 15 15 1
10 16 16 0
11 87 87 1
First create groups by mask for 0 with next 1 values of flag:
m1 = df['flag'].eq(0) & df['flag'].shift(-1).eq(1)
df['g'] = m1.iloc[::-1].cumsum()
print (df)
A B flag g
0 1 1 0 3
1 3 3 0 3
2 4 4 0 3
3 7 6 0 3
4 8 8 1 2
5 11 11 1 2
6 1 1 1 2
7 15 19 0 2
8 20 20 0 2
9 15 15 1 1
10 16 16 0 1
11 87 87 1 0
then filter out groups with size less like N:
N = 4
df1 = df[df['g'].map(df['g'].value_counts()).ge(N)].copy()
print (df1)
A B flag g
0 1 1 0 3
1 3 3 0 3
2 4 4 0 3
3 7 6 0 3
4 8 8 1 2
5 11 11 1 2
6 1 1 1 2
7 15 19 0 2
8 20 20 0 2
Filter last N rows:
df2 = df1.groupby('g').tail(N)
And aggregate last with mean:
d = {'mean':'_mean_before_flag_change', 'last': '_value_before_change_flag'}
df3 = df2.groupby('g')['A','B'].agg(['mean','last']).sort_index(axis=1, level=1).rename(columns=d)
df3.columns = df3.columns.map(''.join)
print (df3)
A_value_before_change_flag B_value_before_change_flag \
g
2 20 20
3 7 6
A_mean_before_flag_change B_mean_before_flag_change
g
2 11.75 12.75
3 3.75 3.50
I'm assuming that this needs to work for cases with more than one rising edge and that the consecutive values and averages get appended to the output lists:
# the first step is to extract the rising and falling edges using diff(), identify sections and length
df['flag_diff'] = df.flag.diff().fillna(0)
df['flag_sections'] = (df.flag_diff != 0).cumsum()
df['flag_sum'] = df.flag.groupby(df.flag_sections).transform('sum')
# then you can get the relevant indices by checking for the rising edges
rising_edges = df.index[df.flag_diff==1.0]
val_indices = [i-1 for i in rising_edges]
avg_indices = [(i-2,i-1) for i in rising_edges]
# and finally iterate over the relevant sections
df_out = pd.DataFrame()
df_out['A_mean_before_flag_change'] = [df.A.loc[tpl[0]:tpl[1]].mean() for tpl in avg_indices]
df_out['B_mean_before_flag_change'] = [df.B.loc[tpl[0]:tpl[1]].mean() for tpl in avg_indices]
df_out['A_value_before_change_flag'] = [df.A.loc[idx] for idx in val_indices]
df_out['B_value_before_change_flag'] = [df.B.loc[idx] for idx in val_indices]
df_out['length'] = [df.flag_sum.loc[idx] for idx in rising_edges]
df_out.index = rising_edges

Pandas - Fill N rows for a specific column with a integer value and increment the integer there after

I have a dataframe to which I added say a column named col_1. I want to add integer values to that column starting from the first row that increment after every 4th row. So the new resulting column should have values of as such.
col_1
1
1
1
1
2
2
2
2
The current approach I have is a very brute force one:
for x in range(len(df)):
if x <= 3:
df['col_1'][x] = 1
if x >3 and x <= 7:
df['col_1'][x] = 2
This might work for something small but when moving to something larger it will chew up a lot of time.
If there si default RangeIndex you can use integer division with add 1:
df['col_1'] = df.index // 4 + 1
Or for general solution use helper array by lenght of DataFrame:
df['col_1'] = np.arange(len(df)) // 4 + 1
For repeat 1 and 2 pattern use also modulo by 2 like:
df = pd.DataFrame({'a':range(20, 40)})
df['col_1'] = (np.arange(len(df)) // 4) % 2 + 1
print (df)
a col_1
0 20 1
1 21 1
2 22 1
3 23 1
4 24 2
5 25 2
6 26 2
7 27 2
8 28 1
9 29 1
10 30 1
11 31 1
12 32 2
13 33 2
14 34 2
15 35 2
16 36 1
17 37 1
18 38 1
19 39 1

How to select rows in a DataFrame based on every transition for particular values in a particular column?

I have a DataFrame that has a ID column and Value column that only consist (0,1,2). I want to capture only those rows, if there is a transition from (0-1) or (1-2) in value column. This process has to be done for each ID separately.
I tried to do the groupby for ID and using a difference aggregation function. So that i can take those rows for which difference of values is 1. But it is failing in certain condition.
df=df.loc[df['values'].isin([0,1,2])]
df = df.sort_values(by=['Id'])
df.value.diff()
Given DataFrame:
Index UniqID Value
1    a    1
2    a    0
3    a    1
4    a    0
5    a    1
6    a    2
7    b    0
8    b    2
9    b    1
10    b    2
11    b    0
12    b    1
13    c    0
14    c    1
15    c    2
16    c    2
Expected Output:
2    a    0
3    a    1
4    a    0
5    a    1
6    a    2
9    b    1
10    b    2
11    b    0
12    b    1
13    c    0
14    c    1
15    c    2
Only expecting those rows when there is a transition from either 0-1 or 1-2.
Thank you in advance.
Use this my solution working for groups with tuples of patterns:
np.random.seed(123)
N = 100
d = {
'UniqID': np.random.choice(list('abcde'), N),
'Value': np.random.choice([0,1,2], N),
}
df = pd.DataFrame(d).sort_values('UniqID')
#print (df)
pat = [(0, 1), (1, 2)]
a = np.array(pat)
s = (df.groupby('UniqID')['Value']
.rolling(2, min_periods=1)
.apply(lambda x: np.all(x[None :] == a, axis=1).any(), raw=True))
mask = (s.mask(s == 0)
.groupby(level=0)
.bfill(limit=1)
.fillna(0)
.astype(bool)
.reset_index(level=0, drop=True))
df = df[mask]
print (df)
UniqID Value
99 a 1
98 a 2
12 a 1
63 a 2
38 a 0
41 a 1
9 a 1
72 a 2
64 b 1
67 b 2
33 b 0
68 b 1
57 b 1
71 b 2
10 b 0
8 b 1
61 c 1
66 c 2
46 c 0
0 c 1
40 c 2
21 d 0
74 d 1
15 d 1
85 d 2
6 d 1
88 d 2
91 d 0
83 d 1
4 d 1
34 d 2
96 d 0
48 d 1
29 d 0
84 d 1
32 e 0
62 e 1
37 e 1
55 e 2
16 e 0
23 e 1
Assuming, transition is strictly from 1 -> 2 and 0 -> 1. (This assumption is valid as well.)
Similar Sample data:
index,id,value
1,a,1
2,a,0
3,a,1
4,a,0
5,a,1
6,a,2
7,b,0
8,b,2
9,b,1
10,b,2
11,b,0
12,b,1
13,c,0
14,c,1
15,c,2
16,c,2
Load this in pandas dataframe.
Then,
Using below code:
def grp_trns(x):
x['dif']=x.value.diff().fillna(0)
return pd.DataFrame(list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index']))
target_index=df.groupby('id').apply(lambda x:grp_trns(x)).values.squeeze()
print(df[df['index'].isin(target_index)][['index', 'id','value']])
It gives desired dataframe based on assumption:
index id value
1 2 a 0
2 3 a 1
3 4 a 0
4 5 a 1
5 6 a 2
8 9 b 1
9 10 b 2
10 11 b 0
11 12 b 1
12 13 c 0
13 14 c 1
14 15 c 2
Edit: To include transition 1->0, below is updated function:
def grp_trns(x):
x['dif']=x.value.diff().fillna(0)
index1=list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index'])
index2=list(x[(x.dif==-1)&(x.value==0)]['index']-1)+list(x[(x.dif==-1)&(x.value==0)]['index'])
return pd.DataFrame(index1+index2)
My version is using shift and diff() to delete all lines with diff value equal to 0,2 or -2
df = pandas.DataFrame({'index':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],'UniqId':['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c'],'Value':[1,0,1,0,1,2,0,2,1,2,0,1,0,1,2,2]})
df['diff']=np.NaN
for element in df['UniqId'].unique():
df['diff'].loc[df['UniqId']==element]=df.loc[df['UniqId']==element]['Value'].diff()
df['diff']=df['diff'].shift(-1)
df=df.loc[(df['diff']!=-2) & (df['diff']!=2) & (df['diff']!=0)]
print(df)
Actually waiting for updates about the 2-1 and 1-2 relationship

Excel - Lookup the group based on value range per segment

I have a table like below.
segmentnum group 1 group 2 group 3 group 4
1 0 12 33 66
2 0 3 10 26
3 0 422 1433 3330
And a table like below.
vol segmentnum
0 1
58 1
66 1
48 1
9 2
13 2
7 2
10 3
1500 3
I'd like to add a column that tells me which group the vol for a given segmentnum belongs to. Such that
Group 1 = x to < group 2
Group 2 = x to < group 3
Group 3 = x to <= group 4
Desired result:
vol segmentnum group
0 1 1
58 1 3
66 1 3
48 1 3
9 2 2
13 2 3
7 2 2
10 3 3
1500 3 3
Per the accompanying image, put this in I2 and drag down.
=MATCH(G2, INDEX(B$2:E$4, MATCH(H2, A$2:A$4, 0), 0))
While these results differ from yours, I believe they are correct.

Subset and Loop to create a new column [duplicate]

With the DataFrame below as an example,
In [83]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
df
Out[83]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
What would be a simple way to generate a new column containing some aggregation of the data over one of the columns?
For example, if I sum values over items in A
In [84]:
df.groupby('A').sum()['values']
Out[84]:
A
1 25
2 45
Name: values
How can I get
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
In [20]: df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
In [21]: df
Out[21]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
In [22]: df['sum_values_A'] = df.groupby('A')['values'].transform(np.sum)
In [23]: df
Out[23]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
I found a way using join:
In [101]:
aggregated = df.groupby('A').sum()['values']
aggregated.name = 'sum_values_A'
df.join(aggregated,on='A')
Out[101]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
Anyone has a simpler way to do it?
This is not so direct but I found it very intuitive (the use of map to create new columns from another column) and can be applied to many other cases:
gb = df.groupby('A').sum()['values']
def getvalue(x):
return gb[x]
df['sum'] = df['A'].map(getvalue)
df
In [15]: def sum_col(df, col, new_col):
....: df[new_col] = df[col].sum()
....: return df
In [16]: df.groupby("A").apply(sum_col, 'values', 'sum_values_A')
Out[16]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

Resources