Seems fastai library is not working on Python. However, I have tried to add feature using the following lines of code with an objective that should identify whether a given day is
Monday/Friday or Tuesday/Wednesday/Thursday.
The code is as follows
data['mon_fri'] = 0
for i in range(0,len(data)):
if (data['Dayofweek'][i] == 0 or data['Dayofweek'][i] == 4):
data['mon_fri'][i] = 1
else:
data['mon_fri'][i] = 0
when i run, getting the following error -
KeyError: 'Dayofweek'
Can anyone help me on this?
New answer showing the entire dataframe.
In [51]: df = pd.DataFrame({'col1':range(9)})
In [52]: df['d'] = pd.date_range('2016-12-31','2017-01-08',freq='D')
In [53]: df
Out[53]:
col1 d
0 0 2016-12-31
1 1 2017-01-01
2 2 2017-01-02
3 3 2017-01-03
4 4 2017-01-04
5 5 2017-01-05
6 6 2017-01-06
7 7 2017-01-07
8 8 2017-01-08
Now adding column for day of the week
In [54]: df['dow'] = df['d'].dt.dayofweek
In [55]: df
Out[55]:
col1 d dow
0 0 2016-12-31 5
1 1 2017-01-01 6
2 2 2017-01-02 0
3 3 2017-01-03 1
4 4 2017-01-04 2
5 5 2017-01-05 3
6 6 2017-01-06 4
7 7 2017-01-07 5
8 8 2017-01-08 6
Finally doing the calculation, 1 for M/Th 0 for other days
In [56]: df['feature'] = df['dow'].apply(lambda x: int((x==1) or (x==4)))
In [57]: df
Out[57]:
col1 d dow feature
0 0 2016-12-31 5 0
1 1 2017-01-01 6 0
2 2 2017-01-02 0 0
3 3 2017-01-03 1 1
4 4 2017-01-04 2 0
5 5 2017-01-05 3 0
6 6 2017-01-06 4 1
7 7 2017-01-07 5 0
8 8 2017-01-08 6 0
Assuming you are using pandas you can just use the built in dayofweek function
In [32]: d = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
In [33]: d
Out[33]:
2016-12-31 2016-12-31
2017-01-01 2017-01-01
2017-01-02 2017-01-02
2017-01-03 2017-01-03
2017-01-04 2017-01-04
2017-01-05 2017-01-05
2017-01-06 2017-01-06
2017-01-07 2017-01-07
2017-01-08 2017-01-08
Freq: D, dtype: datetime64[ns]
In [34]: s = (d.dt.dayofweek==1) |(d.dt.dayofweek==4)
In [35]: s
Out[35]:
2016-12-31 False
2017-01-01 False
2017-01-02 False
2017-01-03 True
2017-01-04 False
2017-01-05 False
2017-01-06 True
2017-01-07 False
2017-01-08 False
Freq: D, dtype: bool
Then convert to 1/0 simply by
In [39]: t = s.apply(lambda x: int(x==True))
In [40]: t
Out[40]:
2016-12-31 0
2017-01-01 0
2017-01-02 0
2017-01-03 1
2017-01-04 0
2017-01-05 0
2017-01-06 1
2017-01-07 0
2017-01-08 0
Freq: D, dtype: int64
Related
I'm working with a dataset which has monthly information about several users. And each user has a different time range. There is also missing "time" data for each user. What I would like to do is fill in the missing month data for each user based on the time range for each user(from min.time to max.time in months)
I've read approaches to similar situation using re-sample, re-index from here, but I'm not getting the desired output/there is row mismatch after filling the missing months.
Any help/pointers would be much appreciated.
-Luc
Tried using re-sample, re-index, but not getting desired output
x = pd.DataFrame({'user': ['a','a','b','b','c','a','a','b','a','c','c','b'], 'dt': ['2015-01-01','2015-02-01', '2016-01-01','2016-02-01','2017-01-01','2015-05-01','2015-07-01','2016-05-01','2015-08-01','2017-03-01','2017-08-01','2016-09-01'], 'val': [1,33,2,1,5,4,2,5,66,7,5,1]})
date id value
0 2015-01-01 a 1
1 2015-02-01 a 33
2 2016-01-01 b 2
3 2016-02-01 b 1
4 2017-01-01 c 5
5 2015-05-01 a 4
6 2015-07-01 a 2
7 2016-05-01 b 5
8 2015-08-01 a 66
9 2017-03-01 c 7
10 2017-08-01 c 5
11 2016-09-01 b 1
What I would like to see is - for each 'id' generate missing months based on min.date and max.date for that id and fill 'val' for those months with 0.
Create DatetimeIndex, so possible use groupby with custom lambda function and Series.asfreq:
x['dt'] = pd.to_datetime(x['dt'])
x = (x.set_index('dt')
.groupby('user')['val']
.apply(lambda x: x.asfreq('MS', fill_value=0))
.reset_index())
print (x)
user dt val
0 a 2015-01-01 1
1 a 2015-02-01 33
2 a 2015-03-01 0
3 a 2015-04-01 0
4 a 2015-05-01 4
5 a 2015-06-01 0
6 a 2015-07-01 2
7 a 2015-08-01 66
8 b 2016-01-01 2
9 b 2016-02-01 1
10 b 2016-03-01 0
11 b 2016-04-01 0
12 b 2016-05-01 5
13 b 2016-06-01 0
14 b 2016-07-01 0
15 b 2016-08-01 0
16 b 2016-09-01 1
17 c 2017-01-01 5
18 c 2017-02-01 0
19 c 2017-03-01 7
20 c 2017-04-01 0
21 c 2017-05-01 0
22 c 2017-06-01 0
23 c 2017-07-01 0
24 c 2017-08-01 5
Or use Series.reindex with min and max datetimes per groups:
x = (x.set_index('dt')
.groupby('user')['val']
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(), freq='MS'), fill_value=0))
.rename_axis(('user','dt'))
.reset_index())
I have a dataframe where I'm trying to do an expanding sum of values and group them by date.
Specifically, my data looks like:
creationDateTime OK Fail
2017-01-06 21:30:00 4 0
2017-01-06 21:35:00 4 0
2017-01-06 21:36:00 4 0
2017-01-07 21:48:00 3 1
2017-01-07 21:53:00 4 0
2017-01-08 21:22:00 3 1
2017-01-08 21:27:00 3 1
2017-01-09 21:49:00 3 1
and I'm trying to get something similar to:
creationDateTime OK Fail RollingOK RollingFail
2017-01-06 21:30:00 4 0 4 0
2017-01-06 21:35:00 4 0 8 0
2017-01-06 21:36:00 4 0 12 0
2017-01-07 21:48:00 3 1 3 1
2017-01-07 21:53:00 4 0 7 1
2017-01-08 21:22:00 3 1 3 1
2017-01-08 21:27:00 3 1 6 2
2017-01-09 21:49:00 3 1 3 1
I've figured out how to do a rolling sum of the values by using:
data_aggregated['RollingOK'] = data_aggregated['OK'].expanding(0).sum()
data_aggregated['RollingFail'] = data_aggregated['Fail'].expanding(0).sum()
But I'm not sure how I can alter this to get the rolling sums grouped by day, since the code above does a rolling sum over all the rows, without grouping by day.
Any help is very much appreciated.
Use DataFrameGroupBy.cumsum with specified columns after groupby:
#if DatetimeIndex
idx = data_aggregated.index.date
#if column
#idx = data_aggregated['creationDateTime'].dt.date
data_aggregated[['RollingOK','RollingFail']] = (data_aggregated.groupby(idx)['OK','Fail']
.cumsum())
print (data_aggregated)
OK Fail RollingOK RollingFail
creationDateTime
2017-01-06 21:30:00 4 0 4 0
2017-01-06 21:35:00 4 0 8 0
2017-01-06 21:36:00 4 0 12 0
2017-01-07 21:48:00 3 1 3 1
2017-01-07 21:53:00 4 0 7 1
2017-01-08 21:22:00 3 1 3 1
2017-01-08 21:27:00 3 1 6 2
2017-01-09 21:49:00 3 1 3 1
You can also working with all columns:
data_aggregated = (data_aggregated.join(data_aggregated.groupby(idx)
.cumsum()
.add_prefix('Rolling')))
print (data_aggregated)
OK Fail RollingOK RollingFail
creationDateTime
2017-01-06 21:30:00 4 0 4 0
2017-01-06 21:35:00 4 0 8 0
2017-01-06 21:36:00 4 0 12 0
2017-01-07 21:48:00 3 1 3 1
2017-01-07 21:53:00 4 0 7 1
2017-01-08 21:22:00 3 1 3 1
2017-01-08 21:27:00 3 1 6 2
2017-01-09 21:49:00 3 1 3 1
Your solution should be changed:
data_aggregated[['RollingOK','RollingFail']] = (data_aggregated.groupby(idx)['OK','Fail']
.expanding(0)
.sum()
.reset_index(level=0, drop=True))
You can use, (if 1st column : creationDateTime is a column):
df['RollingOK']=df.groupby(df.creationDateTime.dt.date)['OK'].cumsum()
df['RollingFail']=df.groupby(df.creationDateTime.dt.date)['Fail'].cumsum()
print(df)
creationDateTime OK Fail RollingOK RollingFail
0 2017-01-06 21:30:00 4 0 4 0
1 2017-01-06 21:35:00 4 0 8 0
2 2017-01-06 21:36:00 4 0 12 0
3 2017-01-07 21:48:00 3 1 3 1
4 2017-01-07 21:53:00 4 0 7 1
5 2017-01-08 21:22:00 3 1 3 1
6 2017-01-08 21:27:00 3 1 6 2
7 2017-01-09 21:49:00 3 1 3 1
I want to create another column in dataframe which consists value of difference. The difference is calculated by subtracting different rows of different columns for unique date values.
I tried looking for various stackoverflow links but didn't find the answer.
The difference should be the value after subtracting value of ATA of 2st row with ATD of 1st row and so on for unique date values. For ex, ATA of 1st january cannot be subtracted from ATD of 2nd january.
For example:-
The difference column's first values should be NAN.
Second values should be 50 Mins (17:13:00 - 16:23:00)
But ATD of 02-01-2019 should not be subtracted with ATA of 01-01-2019
You want to apply a shift grouped by Date and then subtract this with ATD
>>> df = pd.DataFrame({'ATA':range(0,365),'ATD':range(10,375),'Date':pd.date_range(start="2018-01-01",end="2018-12-31")})
>>> df['ATD'] = df['ATD']/6.0
>>> df = pd.concat([df,df,df,df])
>>> df['shifted_ATA'] = df.groupby('Date')['ATA'].transform('shift')
>>> df['result'] = df['ATD'] - df['shifted_ATA']
>>> df = df.sort_values(by='Date', ascending=[1])
>>> df.head(20)
ATA ATD Date shifted_ATA result
0 0 1.666667 2018-01-01 NaN NaN
0 0 1.666667 2018-01-01 0.0 1.666667
0 0 1.666667 2018-01-01 0.0 1.666667
0 0 1.666667 2018-01-01 0.0 1.666667
1 1 1.833333 2018-01-02 NaN NaN
1 1 1.833333 2018-01-02 1.0 0.833333
1 1 1.833333 2018-01-02 1.0 0.833333
1 1 1.833333 2018-01-02 1.0 0.833333
2 2 2.000000 2018-01-03 2.0 0.000000
2 2 2.000000 2018-01-03 NaN NaN
2 2 2.000000 2018-01-03 2.0 0.000000
2 2 2.000000 2018-01-03 2.0 0.000000
3 3 2.166667 2018-01-04 3.0 -0.833333
3 3 2.166667 2018-01-04 3.0 -0.833333
3 3 2.166667 2018-01-04 NaN NaN
3 3 2.166667 2018-01-04 3.0 -0.833333
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 NaN NaN
I want to create a variable : SumOfPrevious5OccurencesAtIDLevel which is the sum of previous 5 values (as per Date variable) of Var1 at an ID level (column 1) , otherwise it will take a value of NA
Sample Data and Output:
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel
1 1/1/2018 0 NA
1 1/2/2018 1 NA
1 1/3/2018 2 NA
1 1/4/2018 3 NA
2 1/1/2018 4 NA
2 1/2/2018 5 NA
2 1/3/2018 6 NA
2 1/4/2018 7 NA
2 1/5/2018 8 NA
2 1/6/2018 9 30
2 1/7/2018 10 35
2 1/8/2018 11 40
Use groupby with transform and functions rolling and shift:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
#if not sorted ID with datetimes
df = df.sort_values(['ID','Date'])
df['new'] = df.groupby('ID')['Var1'].transform(lambda x: x.rolling(5).sum().shift())
print (df)
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel new
0 1 2018-01-01 0 NaN NaN
1 1 2018-01-02 1 NaN NaN
2 1 2018-01-03 2 NaN NaN
3 1 2018-01-04 3 NaN NaN
4 2 2018-01-01 4 NaN NaN
5 2 2018-01-02 5 NaN NaN
6 2 2018-01-03 6 NaN NaN
7 2 2018-01-04 7 NaN NaN
8 2 2018-01-05 8 NaN NaN
9 2 2018-01-06 9 30.0 30.0
10 2 2018-01-07 10 35.0 35.0
11 2 2018-01-08 11 40.0 40.0
I have the following data frame:
date my_count
--------------------------
2017-01-01 6
2017-01-04 5
2017-01-05 3
2017-01-08 8
I would like to pad the skipped date with my_count = 0, so the padded data frame will look like:
date my_count
--------------------------
2017-01-01 6
2017-01-02 0
2017-01-03 0
2017-01-04 5
2017-01-05 3
2017-01-06 0
2017-01-07 0
2017-01-08 8
Except checking the data frame line by line, is there a more elegant way to do this? Thanks!
1st option resample,
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
print(df.resample('D').sum().fillna(0).reset_index())
date my_count
0 2017-01-01 6.0
1 2017-01-02 0.0
2 2017-01-03 0.0
3 2017-01-04 5.0
4 2017-01-05 3.0
5 2017-01-06 0.0
6 2017-01-07 0.0
7 2017-01-08 8.0
2nd option reindex by date_range,
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
print(df.reindex(pd.date_range('2017-01-01', '2017-01-08')).fillna(0))
my_count
2017-01-01 6.0
2017-01-02 0.0
2017-01-03 0.0
2017-01-04 5.0
2017-01-05 3.0
2017-01-06 0.0
2017-01-07 0.0
2017-01-08 8.0
If values of DatetimeIndex are unique use:
You can use asfreq or reindex by min or max value of index or by first and last (if DatetimeIndex is sorted):
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
print(df.asfreq('D', fill_value=0).reset_index())
date my_count
0 2017-01-01 6
1 2017-01-02 0
2 2017-01-03 0
3 2017-01-04 5
4 2017-01-05 3
5 2017-01-06 0
6 2017-01-07 0
7 2017-01-08 8
rng = pd.date_range(df.index.min(), df.index.max())
#alternative
#rng = pd.date_range(df.index[0], df.index[-1])
print(df.reindex(rng, fill_value=0).rename_axis('date').reset_index())
date my_count
0 2017-01-01 6
1 2017-01-02 0
2 2017-01-03 0
3 2017-01-04 5
4 2017-01-05 3
5 2017-01-06 0
6 2017-01-07 0
7 2017-01-08 8
If DatetimeIndex are not unique get:
ValueError: cannot reindex from a duplicate axis
Then need resample with some aggregate function like mean or groupby with Grouper and last replace NaNs by fillna:
print (df)
date my_count
0 2017-01-01 4 <-duplicate date
1 2017-01-01 6 <-duplicate date
2 2017-01-04 5
3 2017-01-05 3
4 2017-01-08 8
df['date'] = pd.to_datetime(df['date'])
print(df.resample('D', on='date')['my_count'].mean().fillna(0).reset_index())
date my_count
0 2017-01-01 5.0
1 2017-01-02 0.0
2 2017-01-03 0.0
3 2017-01-04 5.0
4 2017-01-05 3.0
5 2017-01-06 0.0
6 2017-01-07 0.0
7 2017-01-08 8.0
df = df.set_index('date')
print(df.groupby(pd.Grouper(freq='D'))['my_count'].mean().fillna(0).reset_index())
date my_count
0 2017-01-01 5.0
1 2017-01-02 0.0
2 2017-01-03 0.0
3 2017-01-04 5.0
4 2017-01-05 3.0
5 2017-01-06 0.0
6 2017-01-07 0.0
7 2017-01-08 8.0