counting the values in multiple columns at once - python-3.x

I have a dataframe,df as below
Index DateTimestamp a b c
0 2017-08-03 00:00:00 ta bc tt
1 2017-08-03 00:00:00 re
3 2017-08-03 00:00:00 cv ma
4 2017-08-04 00:00:00
5 2017-09-04 00:00:00 cv
: : : : :
: : : : :
I want to group by 1day the count of values in each column by not considering the empty values in each column. So the output will be
Index a b c
2017-08-03 00:00:00 2 2 2
2017-08-04 00:00:00 0 1 0
I have tried this but not want i want:
df2=df.groupby([pd.Grouper(key='DeviceDateTimeStamp', freq='1D')]) ['a','b','c'].apply(pd.Series.count)

Use dt.floor or date for remove times with GroupBy.count for exclude count missing values:
print (df)
Index DateTimestamp a b c
0 0 2017-08-03 00:00:00 ta bc tt
1 1 2017-08-03 00:00:00 re NaN NaN
2 3 2017-08-03 00:00:00 NaN cv ma
3 4 2017-08-04 00:00:00 NaN NaN NaN
4 5 2017-09-04 00:00:00 NaN cv NaN
df2=df.groupby(df['DateTimestamp'].dt.floor('d'))['a','b','c'].count()
#another solution
#df2=df.groupby(df['DateTimestamp'].dt.date)['a','b','c'].count()
print (df2)
a b c
DateTimestamp
2017-08-03 2 2 2
2017-08-04 0 0 0
2017-09-04 0 1 0
EDIT:
print (df)
Index DateTimestamp a b c
0 0 2017-08-03 00:00:00 ta bc tt
1 1 2017-08-03 00:00:00 re
2 3 2017-08-03 00:00:00 cv ma
3 4 2017-08-04 00:00:00
4 5 2017-09-04 00:00:00 cv
Or if possible numeric values in a,b,c columns:
c = ['a','b','c']
df2=df[c].astype(str).ne('').groupby(df['DateTimestamp'].dt.floor('d')).sum().astype(int)
print (df2)
a b c
DateTimestamp
2017-08-03 2 2 2
2017-08-04 0 0 0
2017-09-04 0 1 0

Related

Pandas groupby and conditional check on multiple columns

I have a dataframe like so:
id date status value
1 2009-06-17 1 NaN
1 2009-07-17 B NaN
1 2009-08-17 A NaN
1 2009-09-17 5 NaN
1 2009-10-17 0 0.55
2 2010-07-17 B NaN
2 2010-08-17 A NaN
2 2010-09-17 0 0.00
Now I want to group by id and then check if value becomes non-zero after status changes to A. So for group with id=1, status does change to A and after(in terms of date) that value also becomes non-zero. But for group with id=2, even after status changes to A, value does not become non-zero. Please note that if status does not change to A then I don't even need to check value.
So finally I want a new dataframe like this:
id check
1 True
2 False
Use:
print (df)
id date status value
0 1 2009-06-17 1 NaN
1 1 2009-07-17 B NaN
2 1 2009-08-17 A NaN
3 1 2009-09-17 5 NaN
4 1 2009-10-17 0 0.55
5 2 2010-07-17 B NaN
6 2 2010-08-17 A NaN
7 2 2010-09-17 0 0.00
8 3 2010-08-17 R NaN
9 3 2010-09-17 0 0.00
idx = df['id'].unique()
#filter A values
m = df['status'].eq('A')
#filter all rows after A per groups
df1 = df[m.groupby(df['id']).cumsum().gt(0)]
print (df1)
id date status value
2 1 2009-08-17 A NaN
3 1 2009-09-17 5 NaN
4 1 2009-10-17 0 0.55
6 2 2010-08-17 A NaN
7 2 2010-09-17 0 0.00
#compare by 0 and test if no 0 value per group and last added all posible id by reindex
df2 = (df1['value'].ne(0)
.groupby(df1['id'])
.all()
.reindex(idx, fill_value=False)
.reset_index(name='check'))
print (df2)
id check
0 1 True
1 2 False
2 3 False

Pandas - Filling missing dates within groups with different time ranges

I'm working with a dataset which has monthly information about several users. And each user has a different time range. There is also missing "time" data for each user. What I would like to do is fill in the missing month data for each user based on the time range for each user(from min.time to max.time in months)
I've read approaches to similar situation using re-sample, re-index from here, but I'm not getting the desired output/there is row mismatch after filling the missing months.
Any help/pointers would be much appreciated.
-Luc
Tried using re-sample, re-index, but not getting desired output
x = pd.DataFrame({'user': ['a','a','b','b','c','a','a','b','a','c','c','b'], 'dt': ['2015-01-01','2015-02-01', '2016-01-01','2016-02-01','2017-01-01','2015-05-01','2015-07-01','2016-05-01','2015-08-01','2017-03-01','2017-08-01','2016-09-01'], 'val': [1,33,2,1,5,4,2,5,66,7,5,1]})
date id value
0 2015-01-01 a 1
1 2015-02-01 a 33
2 2016-01-01 b 2
3 2016-02-01 b 1
4 2017-01-01 c 5
5 2015-05-01 a 4
6 2015-07-01 a 2
7 2016-05-01 b 5
8 2015-08-01 a 66
9 2017-03-01 c 7
10 2017-08-01 c 5
11 2016-09-01 b 1
What I would like to see is - for each 'id' generate missing months based on min.date and max.date for that id and fill 'val' for those months with 0.
Create DatetimeIndex, so possible use groupby with custom lambda function and Series.asfreq:
x['dt'] = pd.to_datetime(x['dt'])
x = (x.set_index('dt')
.groupby('user')['val']
.apply(lambda x: x.asfreq('MS', fill_value=0))
.reset_index())
print (x)
user dt val
0 a 2015-01-01 1
1 a 2015-02-01 33
2 a 2015-03-01 0
3 a 2015-04-01 0
4 a 2015-05-01 4
5 a 2015-06-01 0
6 a 2015-07-01 2
7 a 2015-08-01 66
8 b 2016-01-01 2
9 b 2016-02-01 1
10 b 2016-03-01 0
11 b 2016-04-01 0
12 b 2016-05-01 5
13 b 2016-06-01 0
14 b 2016-07-01 0
15 b 2016-08-01 0
16 b 2016-09-01 1
17 c 2017-01-01 5
18 c 2017-02-01 0
19 c 2017-03-01 7
20 c 2017-04-01 0
21 c 2017-05-01 0
22 c 2017-06-01 0
23 c 2017-07-01 0
24 c 2017-08-01 5
Or use Series.reindex with min and max datetimes per groups:
x = (x.set_index('dt')
.groupby('user')['val']
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(), freq='MS'), fill_value=0))
.rename_axis(('user','dt'))
.reset_index())

Subtract from all columns in dataframe row by the value in a Series when indexes match

I am trying to subtract 1 from all columns in the rows of a DataFrame that have a matching index in a list.
For example, if I have a DataFrame like this one:
df = pd.DataFrame({'AMOS Admin': [1,1,0,0,2,2], 'MX Programs': [0,0,1,1,0,0], 'Material Management': [2,2,2,2,1,1]})
print(df)
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 0 1 2
3 0 1 2
4 2 0 1
5 2 0 1
I want to subtract 1 from all columns where index is in [2, 3] so that the end result is:
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 -1 0 1
3 -1 0 1
4 2 0 1
5 2 0 1
Having found no way to do this I created a Series:
sr = pd.Series([1,1], index=['2', '3'])
print(sr)
2 1
3 1
dtype: int64
However, applying the sub method as per this question results in a DataFrame with all NaN and new rows at the bottom.
AMOS Admin MX Programs Material Management
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
Any help would be most appreciated.
Thanks,
Juan
Using reindex with you sr then subtract using values
df.loc[:]=df.values-sr.reindex(df.index,fill_value=0).values[:,None]
df
Out[1117]:
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 -1 0 1
3 -1 0 1
4 2 0 1
5 2 0 1
If what you want to do is that specific, why don't you just:
df.loc[[2, 3], :] = df.loc[[2, 3], :].subtract(1)

diagonally subtract different columns in python

I want to create another column in dataframe which consists value of difference. The difference is calculated by subtracting different rows of different columns for unique date values.
I tried looking for various stackoverflow links but didn't find the answer.
The difference should be the value after subtracting value of ATA of 2st row with ATD of 1st row and so on for unique date values. For ex, ATA of 1st january cannot be subtracted from ATD of 2nd january.
For example:-
The difference column's first values should be NAN.
Second values should be 50 Mins (17:13:00 - 16:23:00)
But ATD of 02-01-2019 should not be subtracted with ATA of 01-01-2019
You want to apply a shift grouped by Date and then subtract this with ATD
>>> df = pd.DataFrame({'ATA':range(0,365),'ATD':range(10,375),'Date':pd.date_range(start="2018-01-01",end="2018-12-31")})
>>> df['ATD'] = df['ATD']/6.0
>>> df = pd.concat([df,df,df,df])
>>> df['shifted_ATA'] = df.groupby('Date')['ATA'].transform('shift')
>>> df['result'] = df['ATD'] - df['shifted_ATA']
>>> df = df.sort_values(by='Date', ascending=[1])
>>> df.head(20)
ATA ATD Date shifted_ATA result
0 0 1.666667 2018-01-01 NaN NaN
0 0 1.666667 2018-01-01 0.0 1.666667
0 0 1.666667 2018-01-01 0.0 1.666667
0 0 1.666667 2018-01-01 0.0 1.666667
1 1 1.833333 2018-01-02 NaN NaN
1 1 1.833333 2018-01-02 1.0 0.833333
1 1 1.833333 2018-01-02 1.0 0.833333
1 1 1.833333 2018-01-02 1.0 0.833333
2 2 2.000000 2018-01-03 2.0 0.000000
2 2 2.000000 2018-01-03 NaN NaN
2 2 2.000000 2018-01-03 2.0 0.000000
2 2 2.000000 2018-01-03 2.0 0.000000
3 3 2.166667 2018-01-04 3.0 -0.833333
3 3 2.166667 2018-01-04 3.0 -0.833333
3 3 2.166667 2018-01-04 NaN NaN
3 3 2.166667 2018-01-04 3.0 -0.833333
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 NaN NaN

Calculating rolling sum in a pandas dataframe on the basis of 2 variable constraints

I want to create a variable : SumOfPrevious5OccurencesAtIDLevel which is the sum of previous 5 values (as per Date variable) of Var1 at an ID level (column 1) , otherwise it will take a value of NA
Sample Data and Output:
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel
1 1/1/2018 0 NA
1 1/2/2018 1 NA
1 1/3/2018 2 NA
1 1/4/2018 3 NA
2 1/1/2018 4 NA
2 1/2/2018 5 NA
2 1/3/2018 6 NA
2 1/4/2018 7 NA
2 1/5/2018 8 NA
2 1/6/2018 9 30
2 1/7/2018 10 35
2 1/8/2018 11 40
Use groupby with transform and functions rolling and shift:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
#if not sorted ID with datetimes
df = df.sort_values(['ID','Date'])
df['new'] = df.groupby('ID')['Var1'].transform(lambda x: x.rolling(5).sum().shift())
print (df)
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel new
0 1 2018-01-01 0 NaN NaN
1 1 2018-01-02 1 NaN NaN
2 1 2018-01-03 2 NaN NaN
3 1 2018-01-04 3 NaN NaN
4 2 2018-01-01 4 NaN NaN
5 2 2018-01-02 5 NaN NaN
6 2 2018-01-03 6 NaN NaN
7 2 2018-01-04 7 NaN NaN
8 2 2018-01-05 8 NaN NaN
9 2 2018-01-06 9 30.0 30.0
10 2 2018-01-07 10 35.0 35.0
11 2 2018-01-08 11 40.0 40.0

Resources