I have a pd.DataFrame
utc_time year month day weekday hour
0 2017-01-01 21:00:00 2017 1 1 7 21
1 2017-01-01 23:00:00 2017 1 1 7 23
2 2017-01-02 00:00:00 2017 1 2 1 0
3 2017-01-02 01:00:00 2017 1 2 1 1
In the df above, hour 22 doesn't show up. I want every hour include in the dataframe, like:
utc_time year month day weekday hour
0 2017-01-01 21:00:00 2017 1 1 7 21
0 2017-01-01 22:00:00 2017 1 1 7 22
1 2017-01-01 23:00:00 2017 1 1 7 23
2 2017-01-02 00:00:00 2017 1 2 1 0
3 2017-01-02 01:00:00 2017 1 2 1 1
How to build function to detect the missing hour and insert into the dataframe ?
IIUC resample +bfill and ffill
s=df.set_index('utc_time').resample('1H')
(s.ffill()+s.bfill())/2
Out[163]:
year month day weekday hour
utc_time
2017-01-01 21:00:00 2017 1 1 7 21
2017-01-01 22:00:00 2017 1 1 7 22
2017-01-01 23:00:00 2017 1 1 7 23
2017-01-02 00:00:00 2017 1 2 1 0
2017-01-02 01:00:00 2017 1 2 1 1
Related
Hi I am trying to create a counter that counts my trend column by skipping a row and reset itself if the string values are different. For example on row 9 it will count 2 since the previous skipped row it was counted with a 1. But it resets back to one since the value at row 11 is different from row 9.
Is there anyway I could do this?
DateTimeStarted 50% Quantile 50Q shift 2H Trend Count
0 2020-12-18 15:00:00 554.0 NaN Flat 1
1 2020-12-18 16:00:00 593.0 NaN Flat 1
2 2020-12-18 17:00:00 534.0 554.0 Down 1
3 2020-12-18 18:00:00 562.0 593.0 Down 1
4 2020-12-18 19:00:00 552.0 534.0 Up 1
5 2020-12-18 20:00:00 592.0 562.0 Up 1
6 2020-12-19 08:00:00 511.0 552.0 Down 1
7 2020-12-19 09:00:00 584.0 592.0 Down 1
8 2020-12-19 10:00:00 576.0 511.0 Up 1
9 2020-12-19 11:00:00 545.5 584.0 Down 2
10 2020-12-19 12:00:00 609.5 576.0 Up 2
11 2020-12-19 13:00:00 548.0 545.5 Up 1
12 2020-12-19 14:00:00 565.0 609.5 Down 1
13 2020-12-19 15:00:00 575.0 548.0 Up 2
14 2020-12-19 16:00:00 570.0 565.0 Up 1
15 2020-12-19 17:00:00 557.0 575.0 Down 1
16 2020-12-19 18:00:00 578.0 570.0 Up 2
17 2020-12-19 19:00:00 578.5 557.0 Up 1
18 2020-12-21 08:00:00 543.0 578.0 Down 1
19 2020-12-21 09:00:00 558.0 578.5 Down 1
20 2020-12-21 10:00:00 570.0 543.0 Up 1
You can shift() the Trend column by 2 and check if it equals Trend:
df['Counter'] = df.Trend.shift(2).eq(df.Trend).astype(int).add(1)
I named it Counter here for comparison:
DateTimeStarted 50%Quantile 50Qshift2H Trend Count Counter
0 2020-12-18 15:00:00 554.0 NaN Flat 1 1
1 2020-12-18 16:00:00 593.0 NaN Flat 1 1
2 2020-12-18 17:00:00 534.0 554.0 Down 1 1
3 2020-12-18 18:00:00 562.0 593.0 Down 1 1
4 2020-12-18 19:00:00 552.0 534.0 Up 1 1
5 2020-12-18 20:00:00 592.0 562.0 Up 1 1
6 2020-12-19 08:00:00 511.0 552.0 Down 1 1
7 2020-12-19 09:00:00 584.0 592.0 Down 1 1
8 2020-12-19 10:00:00 576.0 511.0 Up 1 1
9 2020-12-19 11:00:00 545.5 584.0 Down 2 2
10 2020-12-19 12:00:00 609.5 576.0 Up 2 2
11 2020-12-19 13:00:00 548.0 545.5 Up 1 1
12 2020-12-19 14:00:00 565.0 609.5 Down 1 1
13 2020-12-19 15:00:00 575.0 548.0 Up 2 2
14 2020-12-19 16:00:00 570.0 565.0 Up 1 1
15 2020-12-19 17:00:00 557.0 575.0 Down 1 1
16 2020-12-19 18:00:00 578.0 570.0 Up 2 2
17 2020-12-19 19:00:00 578.5 557.0 Up 1 1
18 2020-12-21 08:00:00 543.0 578.0 Down 1 1
19 2020-12-21 09:00:00 558.0 578.5 Down 1 1
20 2020-12-21 10:00:00 570.0 543.0 Up 1 1
I have a situation where month and date are messed up for few dates in my dataframe. For e.g here is the input:
df['work_date'].head(15)
0 2018-01-01
1 2018-02-01
2 2018-03-01
3 2018-04-01
4 2018-05-01
5 2018-06-01
6 2018-07-01
7 2018-08-01
8 2018-09-01
9 2018-10-01
10 2018-11-01
11 2018-12-01
12 2018-01-13
13 2018-01-14
14 2018-01-15
The date is stored as a string. As you can see, the date is in the format yyyy-dd-mm till 12th of Jan and then becomes yyyy-mm-dd. The dataframe consists of 3 years worth data and this pattern repeats for all months for all years.
My expected output is to standardize the date to format dddd-mm-yy like below.
0 2018-01-01
1 2018-01-02
2 2018-01-03
3 2018-01-04
4 2018-01-05
5 2018-01-06
6 2018-01-07
7 2018-01-08
8 2018-01-09
9 2018-01-10
10 2018-01-11
11 2018-01-12
12 2018-01-13
13 2018-01-14
14 2018-01-15
Below is the code that I wrote and it gets the job done. Basically, I split the date string and do some string manipulations. However, as you can see its not too pretty. I am checking to see if there could be some other elegant solution to this other than doing the df.apply and the loops.
def func(x):
d = x.split('-')
print(d)
if (int(d[1]) <= 12) & (int(d[2]) <= 12) :
d = [d[0],d[2],d[1]]
x = '-'.join(d)
return x
else:
return x
df['work_date'] = df['work_date'].apply(lambda x:func(x))
You could just update the column based on the fact that it is in order and there is only one date and all days of the year are included consecutively:
df['Date'] = pd.date_range(df['work_date'].min(), '2018-01-12', freq='1D')
# you can specify df['work_date'].min() OR df['work_date'].max) OR A STRING. It really depends on what format your minimum and your maximum is
df
Out[1]:
work_date date
0 2018-01-01 2018-01-01
1 2018-02-01 2018-01-02
2 2018-03-01 2018-01-03
3 2018-04-01 2018-01-04
4 2018-05-01 2018-01-05
5 2018-06-01 2018-01-06
6 2018-07-01 2018-01-07
7 2018-08-01 2018-01-08
8 2018-09-01 2018-01-09
9 2018-10-01 2018-01-10
10 2018-11-01 2018-01-11
11 2018-12-01 2018-01-12
12 2018-01-13 2018-01-13
13 2018-01-14 2018-01-14
14 2018-01-15 2018-01-15
To make this more dynamic, you could also do some try / except shown below:
minn = df['work_date'].min()
maxx = df['work_date'].max()
try:
df['Date'] = pd.date_range(minn, maxx, freq='1D')
except ValueError:
s = maxx.split('-')
df['Date'] = pd.date_range(minn, f'{s[0]}-{s[2]}-{s[1]}', freq='1D')
except ValueError:
s = minn.split('-')
df['Date'] = pd.date_range(f'{s[0]}-{s[2]}-{s[1]}', maxx, freq='1D')
df
I'm totally new to Time Series Analysis and I'm trying to work on examples available online
this is what I have currently:
# Time based features
data = pd.read_csv('Train_SU63ISt.csv')
data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M')
data['Hour'] = data['Datetime'].dt.hour
data['minute'] = data['Datetime'].dt.minute
data.head()
ID Datetime Count Hour Minute
0 0 2012-08-25 00:00:00 8 0 0
1 1 2012-08-25 01:00:00 2 1 0
2 2 2012-08-25 02:00:00 6 2 0
3 3 2012-08-25 03:00:00 2 3 0
4 4 2012-08-25 04:00:00 2 4 0
What I'm looking for is something like this:
ID Datetime Count Hour Minute 4-Hour-window
0 0 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
1 1 2012-08-25 04:00:00 22 8 0 04:00:00 - 08:00:00
2 2 2012-08-25 08:00:00 18 12 0 08:00:00 - 12:00:00
3 3 2012-08-25 12:00:00 16 16 0 12:00:00 - 16:00:00
4 4 2012-08-25 16:00:00 18 20 0 16:00:00 - 20:00:00
5 5 2012-08-25 20:00:00 14 24 0 20:00:00 - 00:00:00
6 6 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
7 7 2012-08-26 04:00:00 24 8 0 04:00:00 - 08:00:00
8 8 2012-08-26 08:00:00 20 12 0 08:00:00 - 12:00:00
9 9 2012-08-26 12:00:00 10 16 0 12:00:00 - 16:00:00
10 10 2012-08-26 16:00:00 18 20 0 16:00:00 - 20:00:00
11 11 2012-08-26 20:00:00 14 24 0 20:00:00 - 00:00:00
I think what you are looking for is the resample function, see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
Something like this should work (not tested):
sampled_data = data.resample(
'4H',
kind='timestamp',
on='Datetime',
label='left'
).sum()
The function is very similar to groupby and groups the data into chunks of the column specified in on=, in this case we use timestamps and chunks of 4 hours.
Finally, you need to use some kind of disaggregation, in this case sum(), to convert all elements of each group into a single element per timechunk
I have a list of data with total number of orders and I would like to calculate the average number of orders per day of the week. For example, average number of order on Monday.
0 2018-01-01 00:00:00 3162
1 2018-01-02 00:00:00 1146
2 2018-01-03 00:00:00 396
3 2018-01-04 00:00:00 848
4 2018-01-05 00:00:00 1624
5 2018-01-06 00:00:00 3052
6 2018-01-07 00:00:00 3674
7 2018-01-08 00:00:00 1768
8 2018-01-09 00:00:00 1190
9 2018-01-10 00:00:00 382
10 2018-01-11 00:00:00 3170
Make sure your date column is in datetime format (looks like it already is)
Add column to convert date to day of week
Group by the day of week and take average
df['Date'] = pd.to_datetime(df['Date']) # Step 1
df['DayofWeek'] =df['Date'].dt.day_name() # Step 2
df.groupby(['DayofWeek']).mean() # Step 3
I am working with this dataset:
TPdata:
id Tp1 Sp2 time
A 1 7 08:00:00
B 2 8 09:00:00
C 3 9 18:30:00
D 4 10 20:00:00
E 5 11 08:00:00
F 6 12 09:00:00
I would like to change the entries 08:00:00 in column time to 'early'. I thought this would work but it isn't:
TPdata$time[TPdata$time == 18:30:00] <- "early"
Can anyone help?