Using pandas' groupby with shifting - python-3.x

I am looking to use pd.rolling_mean in a groupby operation. I want to have in each group a rolling mean of the previous elemnets within the same group. Here is an example:
id val
0 1
0 2
0 3
1 4
1 5
2 6
Grouping by id, this should be transformed into:
id val
0 nan
0 1
0 1.5
1 nan
1 4
2 nan

I believe you want pd.Series.expanding
df.groupby('id').val.apply(lambda x: x.expanding().mean().shift())
0 NaN
1 1.0
2 1.5
3 NaN
4 4.0
5 NaN
Name: val, dtype: float64

I think you need groupby with shift and rolling, window size can be set to scalar:
df['val']=df.groupby('id')['val'].apply(lambda x: x.shift().rolling(2, min_periods=1).mean())
print (df)
id val
0 0 NaN
1 0 1.0
2 0 1.5
3 1 NaN
4 1 4.0
5 2 NaN
Thank you 3novak for comment - you can set window size by max length of group:
f = lambda x: x.shift().rolling(df['id'].value_counts().iloc[0], min_periods=1).mean()
df['val'] = df.groupby('id')['val'].apply(f)
print (df)
id val
0 0 NaN
1 0 1.0
2 0 1.5
3 1 NaN
4 1 4.0
5 2 NaN

Related

Replace only leading NaN values in Pandas dataframe

I have a dataframe of time series data, in which data reporting starts at different times (columns) for different observation units (rows). Prior to first reported datapoint for each unit, the dataframe contains NaN values, e.g.
0 1 2 3 4 ...
A NaN NaN 4 5 6 ...
B NaN 7 8 NaN 10...
C NaN 2 11 24 17...
I want to replace the leading (left-side) NaN values with 0, but only the leading ones (i.e. leaving the internal missing ones as NaN. So the result on the example above would be:
0 1 2 3 4 ...
A 0 0 4 5 6 ...
B 0 7 8 NaN 10...
C 0 2 11 24 17...
(Note the retained NaN for row B col 3)
I could iterate through the dataframe row-by-row, identify the first index of a non-NaN value in each row, and replace everything left of that with 0. But is there a way to do this as a whole-array operation?
notna + cumsum by rows, cells with zeros are leading NaN:
df[df.notna().cumsum(1) == 0] = 0
df
0 1 2 3 4
A 0.0 0.0 4 5.0 6
B 0.0 7.0 8 NaN 10
C 0.0 2.0 11 24.0 17
Here is another way using cumprod() and apply()
s = df.isna().cumprod(axis=1).sum(axis=1)
df.apply(lambda x: x.fillna(0,limit = s.loc[x.name]),axis=1)
Output:
0 1 2 3 4
A 0.0 0.0 4.0 5.0 6.0
B 0.0 7.0 8.0 NaN 10.0
C 0.0 2.0 11.0 24.0 17.0

Replacing constant values with nan

import pandas as pd
data={'col1':[1,3,3,1,2,3,2,2]}
df=pd.DataFrame(data,columns=['col1'])
print df
col1
0 1
1 3
2 3
3 1
4 2
5 3
6 2
7 2
Expected result:
Col1 newCol1
0 1. 1
1 3. 3
2 3. NaN
3. 1. 1
4 2. 2
5 3. 3
6 2. 2
7. 2. Nan
Try where combine with shift
df['col2'] = df.col1.where(df.col1.ne(df.col1.shift()))
df
Out[191]:
col1 col2
0 1 1.0
1 3 3.0
2 3 NaN
3 1 1.0
4 2 2.0
5 3 3.0
6 2 2.0
7 2 NaN

Pandas groupby and conditional check on multiple columns

I have a dataframe like so:
id date status value
1 2009-06-17 1 NaN
1 2009-07-17 B NaN
1 2009-08-17 A NaN
1 2009-09-17 5 NaN
1 2009-10-17 0 0.55
2 2010-07-17 B NaN
2 2010-08-17 A NaN
2 2010-09-17 0 0.00
Now I want to group by id and then check if value becomes non-zero after status changes to A. So for group with id=1, status does change to A and after(in terms of date) that value also becomes non-zero. But for group with id=2, even after status changes to A, value does not become non-zero. Please note that if status does not change to A then I don't even need to check value.
So finally I want a new dataframe like this:
id check
1 True
2 False
Use:
print (df)
id date status value
0 1 2009-06-17 1 NaN
1 1 2009-07-17 B NaN
2 1 2009-08-17 A NaN
3 1 2009-09-17 5 NaN
4 1 2009-10-17 0 0.55
5 2 2010-07-17 B NaN
6 2 2010-08-17 A NaN
7 2 2010-09-17 0 0.00
8 3 2010-08-17 R NaN
9 3 2010-09-17 0 0.00
idx = df['id'].unique()
#filter A values
m = df['status'].eq('A')
#filter all rows after A per groups
df1 = df[m.groupby(df['id']).cumsum().gt(0)]
print (df1)
id date status value
2 1 2009-08-17 A NaN
3 1 2009-09-17 5 NaN
4 1 2009-10-17 0 0.55
6 2 2010-08-17 A NaN
7 2 2010-09-17 0 0.00
#compare by 0 and test if no 0 value per group and last added all posible id by reindex
df2 = (df1['value'].ne(0)
.groupby(df1['id'])
.all()
.reindex(idx, fill_value=False)
.reset_index(name='check'))
print (df2)
id check
0 1 True
1 2 False
2 3 False

Subtract from all columns in dataframe row by the value in a Series when indexes match

I am trying to subtract 1 from all columns in the rows of a DataFrame that have a matching index in a list.
For example, if I have a DataFrame like this one:
df = pd.DataFrame({'AMOS Admin': [1,1,0,0,2,2], 'MX Programs': [0,0,1,1,0,0], 'Material Management': [2,2,2,2,1,1]})
print(df)
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 0 1 2
3 0 1 2
4 2 0 1
5 2 0 1
I want to subtract 1 from all columns where index is in [2, 3] so that the end result is:
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 -1 0 1
3 -1 0 1
4 2 0 1
5 2 0 1
Having found no way to do this I created a Series:
sr = pd.Series([1,1], index=['2', '3'])
print(sr)
2 1
3 1
dtype: int64
However, applying the sub method as per this question results in a DataFrame with all NaN and new rows at the bottom.
AMOS Admin MX Programs Material Management
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
Any help would be most appreciated.
Thanks,
Juan
Using reindex with you sr then subtract using values
df.loc[:]=df.values-sr.reindex(df.index,fill_value=0).values[:,None]
df
Out[1117]:
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 -1 0 1
3 -1 0 1
4 2 0 1
5 2 0 1
If what you want to do is that specific, why don't you just:
df.loc[[2, 3], :] = df.loc[[2, 3], :].subtract(1)

Pandas series.groupby().apply( .sum() ), .sum() not summing values

I have the following test code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'MONTH': [1,2,3,1,1,1,1,1,1,2,3,2,2,3,2,1,1,1,1,1,1,1],
'HOUR': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
'CIGFT': [np.NaN,12000,2500,73300,73300,np.NaN,np.NaN,np.NaN,np.NaN,12000,100,100,15000,2500,np.NaN,15000,11000,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
cigs = pd.DataFrame()
cigs['cigsum'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).sum())
cigs['cigcount'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).count())
df.fillna(value='-', inplace=True)
cigs['cigminus'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).sum())
tfile = open('test_COUNT_manual.txt', 'a')
tfile.write(cigs.to_string())
tfile.close()
I wind up with the following results:
The dataframe:
CIGFT HOUR MONTH
0 NaN 0 1
1 12000.0 0 2
2 2500.0 0 3
3 73300.0 0 1
4 73300.0 0 1
5 NaN 0 1
6 NaN 0 1
7 NaN 0 1
8 NaN 0 1
9 12000.0 0 2
10 100.0 0 3
11 100.0 0 2
12 15000.0 0 2
13 2500.0 0 3
14 NaN 0 2
15 15000.0 0 1
16 11000.0 0 1
17 NaN 0 1
18 NaN 0 1
19 NaN 0 1
20 NaN 0 1
21 NaN 0 1
The results in the write to file:
cigsum cigcount cigminus
MONTH HOUR
1 0 4 14 14
2 0 4 5 5
3 0 3 3 3
My issue is that the .sum() is not summing the values. It is doing a count of the non null values. When I replace the null values with a minus, the .sum()
produces the same result as the count().
So what do I use to get the sum of the values if .sum() does not do it?
Series.sum() -> return the sum of the series values excluding NA/null values by default as mentioned in official docs.
You are getting series in lambda function each time, just apply sum function to series in lambda will give you correct result.
Do this,
cigs['cigsum'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: c.sum())
Result of this code will be,
MONTH HOUR
1 0 172600.0
2 0 39100.0
3 0 5100.0

Resources