Replace timeseries missing values with previous years value - python-3.x

As the title suggests, I have an hourly df looks like this:
date_time traffic_volume
date_time
2012-10-02 09:00:00 2012-10-02 09:00:00 5545.0
2012-10-02 10:00:00 2012-10-02 10:00:00 4516.0
2012-10-02 11:00:00 2012-10-02 11:00:00 NaN
2012-10-02 12:00:00 2012-10-02 12:00:00 NaN
2012-10-02 13:00:00 2012-10-02 13:00:00 NaN
2012-10-02 14:00:00 2012-10-02 14:00:00 NaN
2012-10-02 15:00:00 2012-10-02 15:00:00 5584.0
2012-10-02 16:00:00 2012-10-02 16:00:00 6015.0
The majority of the NaNs I imputed using
df['traffic_volume'] = df['traffic_volume'].interpolate(method='time')
The problem now is that for a certain subset of time-series (the remaining NaN's), I want to impute by putting the same value of that day but last year.
I used
df['traffic_volume'] = df.apply(lambda x: df.loc[ x['date_time'] + pd.offsets.DateOffset(years=-1)]['traffic_volume'] if x['traffic_volume']==np.NaN else x['traffic_volume'], axis=1)
The line of code ran but my NaN's weren't Imputed. My question is why? and if there is a better way what is it?
Thank you.
P.S The reason I don't want to use bfill, ffill or interpolate is because the sequence of NaN's are too much and the data loses granularity.

The fix is to use pd.isna(x['traffic']) instead of x['traffic_volume']==np.NaN for the if condition in the lambda. I still don't understand why the initial line ran but didn't impute.

Related

how to extract missing datetime interval in python

I have a date dataframe where date contains 15 min of interval. I want to find the missing datetime interval. id should be copied from previous line but value should be nan
'''
date value id
2021-12-02 07:00:00 12456677 693214
2021-01-02 07:30:00 12456677 693214
2021-01-02 07:45:00 12456677 693214
2021-01-02 08:00:00 12456677 693214
2021-01-02 08:15:00 12456665 693215
2021-01-02 08:45:00 12456665 693215
2021-01-03 09:00:00 12456666 693217
2021-01-03 10:30:00 12456666 693217
expacted output is
date value id
2021-01-02 08:30:00 NAN 693215
2021-01-02 09:15:00 NAN 693217
2021-01-03 09:30:00 NAN 693217
2021-01-03 09:45:00 NAN 693217
2021-01-03 10:00:00 NAN 693217
I am trying
df['Datetime'] = pd.to_datetime(df['date'])
df[ df['Datetime'].diff() > pd.Timedelta('15min') ]
but it is just giving a time after which date is missing. not the missing date and time.it showed me this output
date value id
2021-01-02 08:15:00 12456665 693215
2021-01-03 09:00:00 12456666 693217
2021-01-03 10:30:00 12456666 693217
can some one please guide me how can I extract missing date and time?
Thanks in advance
Use Series.asfreq per groups for get missing intervals:
#create DatetimeIndex
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#add 15 Minutes index per days and per id
df1 = (df.groupby([pd.Grouper(freq='D'), 'id'])['value']
.apply(lambda x: x.asfreq('15min'))
.reset_index(level=0, drop=True)
.reset_index())
print (df1)
id date value
0 693214 2021-01-02 07:30:00 12456677.0
1 693214 2021-01-02 07:45:00 12456677.0
2 693214 2021-01-02 08:00:00 12456677.0
3 693215 2021-01-02 08:15:00 12456665.0
4 693215 2021-01-02 08:30:00 NaN
5 693215 2021-01-02 08:45:00 12456665.0
6 693217 2021-01-03 09:00:00 12456666.0
7 693217 2021-01-03 09:15:00 NaN
8 693217 2021-01-03 09:30:00 NaN
9 693217 2021-01-03 09:45:00 NaN
10 693217 2021-01-03 10:00:00 NaN
11 693217 2021-01-03 10:15:00 NaN
12 693217 2021-01-03 10:30:00 12456666.0
13 693214 2021-12-02 07:00:00 12456677.0
Test missing values in boolean indexing:
df2 = df1[df1['value'].isna()]
print (df2)
id date value
4 693215 2021-01-02 08:30:00 NaN
7 693217 2021-01-03 09:15:00 NaN
8 693217 2021-01-03 09:30:00 NaN
9 693217 2021-01-03 09:45:00 NaN
10 693217 2021-01-03 10:00:00 NaN
11 693217 2021-01-03 10:15:00 NaN

How to extract date in which the hour of the peak value occurred?

I have Hourly time series starts by year 2013 and ends by year 2020 as below and I want to plot only day in which the system load reached it peak:
date_time system_load
2013-01-01 00:00:00 1.0
2013-01-01 01:00:00 0.9
2013-01-01 02:00:00 0.5
...
2020-12-31 21:00:00 2.1
2020-12-31 22:00:00 1.8
2020-12-31 23:00:00 0.8
The intended dataframe has 'one day(24hours) per year' :
date_time system_load
2013-07-09 00:00:00 3.1
2013-07-09 02:00:00 3.0
2013-07-09 03:00:00 4.8
2013-07-09 04:00:00 2.6
...
2013-07-09 21:00:00 3.7
2013-07-09 22:00:00 3.9
2013-07-09 23:00:00 5.1
2014-09-09 00:00:00 4.1
2014-09-09 02:00:00 5.3
2014-09-09 03:00:00 6.0
2014-09-09 04:00:00 4.8
...
2014-09-09 21:00:00 3.5
2014-09-09 22:00:00 2.6
2014-09-09 23:00:00 1.6
...
...
2020-06-01 00:00:00 4.2
2020-06-01 02:00:00 3.6
2020-06-01 03:00:00 3.9
2020-06-01 04:00:00 2.8
...
2020-06-01 21:00:00 2.7
2020-06-01 22:00:00 4.8
2020-06-01 23:00:00 3.8
Get only date and year part from date_time column
Groupby year column and get the row containing the max value of system_load column in each group
Getting all the time from the original dataframe where the date is the same with the date whose system_load value is the max
Plot the bar
df['date_time'] = pd.to_datetime(df['date_time']) # Ensure the `date_time` column is datetime type
df['just_date'] = df['date_time'].dt.date
df['year'] = df['date_time'].dt.year
idx = df.groupby(['year'])['system_load'].transform(max) == df['system_load']
df[df['just_date'].isin(df[idx]['just_date'])].plot.bar(x='date_time', y='system_load', rot=45)
If in one year there are several days have the same max system_load value, the above code returns all. If you want to keep only the first day, you can use pandas.DataFrame.idxmax()
idx = df.groupby(['year'])['system_load'].idxmax()
df[df['just_date'].isin(df.loc[idx]['just_date'])].plot.bar(x='date_time', y='system_load', rot=45)
Here's an approach to solve your problem:
let sourcedf contain the input data in the form of two columns 'TimeStamp' & 'Load'
Then do the following:
sourcedf['Date'] = sourcedf.apply(lambda row: row['Date_Time'].date(), axis = 1)
dfg = sourcedf.groupby('Date')
ldList = list(dfg['Load'].max().to_list())
tgtDate = dfg.max().index.to_list()[dList.index(max(ldList))]
dfout = sourcedf[sourcedf['Date'] == tgtDate]
dfout will then contain just the date on which the max load was experienced

How to find the AVG, and STD between fixed time period using Pandas

My dataset df looks like this:
DateTimeVal Open
2017-01-01 17:00:00 5.1532
2017-01-01 17:01:00 5.3522
2017-01-01 17:02:00 5.4535
2017-01-01 17:03:00 5.3567
2017-01-01 17:04:00 5.1512
....
It is a minute diff based dataset.
In my calculation, a single day(24 hour) is defined as:
17:00:00 Sunday to 16:59:00 Monday and so on for other days
What I want to do is find the AVG, and STD of each 24 hour from 17:00:00 Sunday to 16:59:00 Monday and so on for all the day
What did I do?
I did the rolling to find the AVG but it does for a day and not with time-range
# day avg
# 7 day rolling avg
df = (
df.assign(DAY_AVG=df.rolling(window=1*24*60)['Open'].mean())
df.assign(7DAY_AVG=df.rolling(window=7*24*60)['Open'].mean())
.groupby(df['DateTimeVal'].dt.date)
.last() )
I need help with these 2 things:
How do I find the AVG, and STD between fixed time period?
How do I find the AVG, and STD between fixed time period for 7D rolling and 14 Days rolling?
Use resample with base:
#Create empty dataframe for 2 days
df = pd.DataFrame(index = pd.date_range('2017-07-01', periods=48, freq='1H'))
#Set value equal to 1 from 17:00 to 16:59 next day
df.loc['2017-07-01 17:00:00': '2017-07-02 16:59:59', 'Value'] = 1
print(df)
Output:
Value
2017-07-01 00:00:00 NaN
2017-07-01 01:00:00 NaN
2017-07-01 02:00:00 NaN
2017-07-01 03:00:00 NaN
2017-07-01 04:00:00 NaN
2017-07-01 05:00:00 NaN
2017-07-01 06:00:00 NaN
2017-07-01 07:00:00 NaN
2017-07-01 08:00:00 NaN
2017-07-01 09:00:00 NaN
2017-07-01 10:00:00 NaN
2017-07-01 11:00:00 NaN
2017-07-01 12:00:00 NaN
2017-07-01 13:00:00 NaN
2017-07-01 14:00:00 NaN
2017-07-01 15:00:00 NaN
2017-07-01 16:00:00 NaN
2017-07-01 17:00:00 1.0
2017-07-01 18:00:00 1.0
2017-07-01 19:00:00 1.0
2017-07-01 20:00:00 1.0
2017-07-01 21:00:00 1.0
2017-07-01 22:00:00 1.0
2017-07-01 23:00:00 1.0
2017-07-02 00:00:00 1.0
2017-07-02 01:00:00 1.0
2017-07-02 02:00:00 1.0
2017-07-02 03:00:00 1.0
2017-07-02 04:00:00 1.0
2017-07-02 05:00:00 1.0
2017-07-02 06:00:00 1.0
2017-07-02 07:00:00 1.0
2017-07-02 08:00:00 1.0
2017-07-02 09:00:00 1.0
2017-07-02 10:00:00 1.0
2017-07-02 11:00:00 1.0
2017-07-02 12:00:00 1.0
2017-07-02 13:00:00 1.0
2017-07-02 14:00:00 1.0
2017-07-02 15:00:00 1.0
2017-07-02 16:00:00 1.0
2017-07-02 17:00:00 NaN
2017-07-02 18:00:00 NaN
2017-07-02 19:00:00 NaN
2017-07-02 20:00:00 NaN
2017-07-02 21:00:00 NaN
2017-07-02 22:00:00 NaN
2017-07-02 23:00:00 NaN
Now use, resample with base=17
df.resample('24H', base=17).sum()
Output:
Value
2017-06-30 17:00:00 0.0
2017-07-01 17:00:00 24.0
2017-07-02 17:00:00 0.0
Update for minute sampling:
df = pd.DataFrame({'Value': 0}, index = pd.date_range('2018-10-01', '2018-10-03', freq='1T'))
df.loc['2018-10-01 15:00:00':'2018-10-02 18:59:50', 'Value'] = 1
df.resample('24H', base=17).agg(['sum','mean'])
Output:
Value
sum mean
2018-09-30 17:00:00 120 0.117647
2018-10-01 17:00:00 1440 1.000000
2018-10-02 17:00:00 120 0.285036

Pandas: Find original index of a value with a grouped dataframe

I have a dataframe with a RangeIndex, timestamps in the first column and several thousands hourly temperature observations in the second.
It is easy enough to group the observations by 24 and find daily Tmax and Tmin. But I also want the timestamp of each day's max and min values.
How can I do that?
I hope I can get help without posting a working example, because the nature of the data makes it unpractical.
EDIT: Here's some data, spanning two days.
DT T-C
0 2015-01-01 00:00:00 -2.5
1 2015-01-01 01:00:00 -2.1
2 2015-01-01 02:00:00 -2.3
3 2015-01-01 03:00:00 -2.3
4 2015-01-01 04:00:00 -2.3
5 2015-01-01 05:00:00 -2.0
...
24 2015-01-02 00:00:00 1.1
25 2015-01-02 01:00:00 1.1
26 2015-01-02 02:00:00 0.8
27 2015-01-02 03:00:00 0.5
28 2015-01-02 04:00:00 1.0
29 2015-01-02 05:00:00 0.7
First create DatetimeIndex, then aggregate by Grouper with days and idxmax
idxmin for datetimes for min and max temperature:
df['DT'] = pd.to_datetime(df['DT'])
df = df.set_index('DT')
df = df.groupby(pd.Grouper(freq='D'))['T-C'].agg(['idxmax','idxmin','max','min'])
print (df)
idxmax idxmin max min
DT
2015-01-01 2015-01-01 05:00:00 2015-01-01 00:00:00 -2.0 -2.5
2015-01-02 2015-01-02 00:00:00 2015-01-02 03:00:00 1.1 0.5

Setting start time from previous night without dates from CSV using pandas

tI would like to run timeseries analysis on repeated measures data (time only, no dates) taken overnight from 22:00:00 to 09:00:00 the next morning.
How is the time set so that the Timeseries starts at 22:00:00. At the moment even when plotting it starts at 00:00:00 and ends at 23:00:00 with a flat line between 09:00:00 and 23:00:00?
df = pd.read_csv('1310.csv', parse_dates=True)
df['Time'] = pd.to_datetime(df['Time'])
df['Time'].apply( lambda d : d.time() )
df = df.set_index('Time')
df['2017-05-16 22:00:00'] + pd.Timedelta('-1 day')
Note: The date in the last line of code is automatically added, seen when df['Time'] is executed, so I inserted the same format with date in the last line for 22:00:00.
This is the error:
TypeError: Could not operate Timedelta('-1 days +00:00:00') with block values unsupported operand type(s) for +: 'numpy.ndarray' and 'Timedelta'
You should consider your timestamps as pd.Timedeltas and add a day to the samples before your start time.
Create some example data:
import pandas as pd
d = pd.date_range(start='22:00:00', periods=12, freq='h')
s = pd.Series(d).dt.time
df = pd.DataFrame(pd.np.random.randn(len(s)), index=s, columns=['value'])
df.to_csv('data.csv')
df
value
22:00:00 -0.214977
23:00:00 -0.006585
00:00:00 0.568259
01:00:00 0.603196
02:00:00 0.358124
03:00:00 0.027835
04:00:00 -0.436322
05:00:00 0.627624
06:00:00 0.168189
07:00:00 -0.321916
08:00:00 0.737383
09:00:00 1.100500
Read in, make index a timedelta, add a day to timedeltas before the start time, then assign back to the index.
df2 = pd.read_csv('data.csv', index_col=0)
df2.index = pd.to_timedelta(df2.index)
s = pd.Series(df2.index)
s[s < pd.Timedelta('22:00:00')] += pd.Timedelta('1d')
df2.index = pd.to_datetime(s)
df2
value
1970-01-01 22:00:00 -0.214977
1970-01-01 23:00:00 -0.006585
1970-01-02 00:00:00 0.568259
1970-01-02 01:00:00 0.603196
1970-01-02 02:00:00 0.358124
1970-01-02 03:00:00 0.027835
1970-01-02 04:00:00 -0.436322
1970-01-02 05:00:00 0.627624
1970-01-02 06:00:00 0.168189
1970-01-02 07:00:00 -0.321916
1970-01-02 08:00:00 0.737383
1970-01-02 09:00:00 1.100500
If you want to set the date of the first day:
df2.index += (pd.Timestamp('2015-06-06') - pd.Timestamp(0))
df2
value
2015-06-06 22:00:00 -0.214977
2015-06-06 23:00:00 -0.006585
2015-06-07 00:00:00 0.568259
2015-06-07 01:00:00 0.603196
2015-06-07 02:00:00 0.358124
2015-06-07 03:00:00 0.027835
2015-06-07 04:00:00 -0.436322
2015-06-07 05:00:00 0.627624
2015-06-07 06:00:00 0.168189
2015-06-07 07:00:00 -0.321916
2015-06-07 08:00:00 0.737383
2015-06-07 09:00:00 1.100500

Resources