How to create 4 hour time interval in Time Series Analysis (python)

How to create 4 hour time interval in Time Series Analysis (python) - python-3.x

I'm totally new to Time Series Analysis and I'm trying to work on examples available online
this is what I have currently:
# Time based features
data = pd.read_csv('Train_SU63ISt.csv')
data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M')
data['Hour'] = data['Datetime'].dt.hour
data['minute'] = data['Datetime'].dt.minute
data.head()
ID Datetime Count Hour Minute
0 0 2012-08-25 00:00:00 8 0 0
1 1 2012-08-25 01:00:00 2 1 0
2 2 2012-08-25 02:00:00 6 2 0
3 3 2012-08-25 03:00:00 2 3 0
4 4 2012-08-25 04:00:00 2 4 0
What I'm looking for is something like this:
ID Datetime Count Hour Minute 4-Hour-window
0 0 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
1 1 2012-08-25 04:00:00 22 8 0 04:00:00 - 08:00:00
2 2 2012-08-25 08:00:00 18 12 0 08:00:00 - 12:00:00
3 3 2012-08-25 12:00:00 16 16 0 12:00:00 - 16:00:00
4 4 2012-08-25 16:00:00 18 20 0 16:00:00 - 20:00:00
5 5 2012-08-25 20:00:00 14 24 0 20:00:00 - 00:00:00
6 6 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
7 7 2012-08-26 04:00:00 24 8 0 04:00:00 - 08:00:00
8 8 2012-08-26 08:00:00 20 12 0 08:00:00 - 12:00:00
9 9 2012-08-26 12:00:00 10 16 0 12:00:00 - 16:00:00
10 10 2012-08-26 16:00:00 18 20 0 16:00:00 - 20:00:00
11 11 2012-08-26 20:00:00 14 24 0 20:00:00 - 00:00:00

I think what you are looking for is the resample function, see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
Something like this should work (not tested):
sampled_data = data.resample(
'4H',
kind='timestamp',
on='Datetime',
label='left'
).sum()
The function is very similar to groupby and groups the data into chunks of the column specified in on=, in this case we use timestamps and chunks of 4 hours.
Finally, you need to use some kind of disaggregation, in this case sum(), to convert all elements of each group into a single element per timechunk

Related

Pandas counter that counts by skipping a row and reset on different values

Hi I am trying to create a counter that counts my trend column by skipping a row and reset itself if the string values are different. For example on row 9 it will count 2 since the previous skipped row it was counted with a 1. But it resets back to one since the value at row 11 is different from row 9.
Is there anyway I could do this?
DateTimeStarted 50% Quantile 50Q shift 2H Trend Count
0 2020-12-18 15:00:00 554.0 NaN Flat 1
1 2020-12-18 16:00:00 593.0 NaN Flat 1
2 2020-12-18 17:00:00 534.0 554.0 Down 1
3 2020-12-18 18:00:00 562.0 593.0 Down 1
4 2020-12-18 19:00:00 552.0 534.0 Up 1
5 2020-12-18 20:00:00 592.0 562.0 Up 1
6 2020-12-19 08:00:00 511.0 552.0 Down 1
7 2020-12-19 09:00:00 584.0 592.0 Down 1
8 2020-12-19 10:00:00 576.0 511.0 Up 1
9 2020-12-19 11:00:00 545.5 584.0 Down 2
10 2020-12-19 12:00:00 609.5 576.0 Up 2
11 2020-12-19 13:00:00 548.0 545.5 Up 1
12 2020-12-19 14:00:00 565.0 609.5 Down 1
13 2020-12-19 15:00:00 575.0 548.0 Up 2
14 2020-12-19 16:00:00 570.0 565.0 Up 1
15 2020-12-19 17:00:00 557.0 575.0 Down 1
16 2020-12-19 18:00:00 578.0 570.0 Up 2
17 2020-12-19 19:00:00 578.5 557.0 Up 1
18 2020-12-21 08:00:00 543.0 578.0 Down 1
19 2020-12-21 09:00:00 558.0 578.5 Down 1
20 2020-12-21 10:00:00 570.0 543.0 Up 1

You can shift() the Trend column by 2 and check if it equals Trend:
df['Counter'] = df.Trend.shift(2).eq(df.Trend).astype(int).add(1)
I named it Counter here for comparison:
DateTimeStarted 50%Quantile 50Qshift2H Trend Count Counter
0 2020-12-18 15:00:00 554.0 NaN Flat 1 1
1 2020-12-18 16:00:00 593.0 NaN Flat 1 1
2 2020-12-18 17:00:00 534.0 554.0 Down 1 1
3 2020-12-18 18:00:00 562.0 593.0 Down 1 1
4 2020-12-18 19:00:00 552.0 534.0 Up 1 1
5 2020-12-18 20:00:00 592.0 562.0 Up 1 1
6 2020-12-19 08:00:00 511.0 552.0 Down 1 1
7 2020-12-19 09:00:00 584.0 592.0 Down 1 1
8 2020-12-19 10:00:00 576.0 511.0 Up 1 1
9 2020-12-19 11:00:00 545.5 584.0 Down 2 2
10 2020-12-19 12:00:00 609.5 576.0 Up 2 2
11 2020-12-19 13:00:00 548.0 545.5 Up 1 1
12 2020-12-19 14:00:00 565.0 609.5 Down 1 1
13 2020-12-19 15:00:00 575.0 548.0 Up 2 2
14 2020-12-19 16:00:00 570.0 565.0 Up 1 1
15 2020-12-19 17:00:00 557.0 575.0 Down 1 1
16 2020-12-19 18:00:00 578.0 570.0 Up 2 2
17 2020-12-19 19:00:00 578.5 557.0 Up 1 1
18 2020-12-21 08:00:00 543.0 578.0 Down 1 1
19 2020-12-21 09:00:00 558.0 578.5 Down 1 1
20 2020-12-21 10:00:00 570.0 543.0 Up 1 1

Replace values of duplicate rows based on the same datetime in another column but keep the last row unchanged

Hi I have a data frame that looks like this. Based on the same datetime, I need to keep the last row as 1 and replace the remaining ones as 0. Is there anyway for me to do this?
DateTimeStarted Value
0 2020-12-19 16:00:00 1
1 2020-12-19 16:00:00 1
2 2020-12-19 16:00:00 1
3 2020-12-19 16:00:00 1
4 2020-12-19 16:00:00 1
5 2020-12-19 16:00:00 1
6 2020-12-19 16:00:00 1
7 2020-12-19 16:00:00 1
8 2020-12-19 16:00:00 1
9 2020-12-19 16:00:00 1
10 2020-12-19 16:00:00 1
11 2020-12-19 16:00:00 1
12 2020-12-19 16:00:00 1
13 2020-12-19 16:00:00 1
14 2020-12-19 16:00:00 1
15 2020-12-19 16:00:00 1
16 2020-12-19 16:00:00 1
17 2020-12-19 16:00:00 1
18 2020-12-19 16:00:00 1
19 2020-12-26 18:00:00 1
20 2020-12-26 18:00:00 1
21 2020-12-27 13:00:00 0
22 2020-12-27 14:00:00 0
23 2020-12-27 15:00:00 0
24 2020-12-27 15:00:00 0
25 2020-12-27 17:00:00 0
The solution should look like this. The values 0 should also remained unchanged.
DateTimeStarted Value
0 2020-12-19 16:00:00 0
1 2020-12-19 16:00:00 0
2 2020-12-19 16:00:00 0
3 2020-12-19 16:00:00 0
4 2020-12-19 16:00:00 0
5 2020-12-19 16:00:00 0
6 2020-12-19 16:00:00 0
7 2020-12-19 16:00:00 0
8 2020-12-19 16:00:00 0
9 2020-12-19 16:00:00 0
10 2020-12-19 16:00:00 0
11 2020-12-19 16:00:00 0
12 2020-12-19 16:00:00 0
13 2020-12-19 16:00:00 0
14 2020-12-19 16:00:00 0
15 2020-12-19 16:00:00 0
16 2020-12-19 16:00:00 0
17 2020-12-19 16:00:00 0
18 2020-12-19 16:00:00 1
19 2020-12-26 18:00:00 0
20 2020-12-26 18:00:00 1
21 2020-12-27 13:00:00 0
22 2020-12-27 14:00:00 0
23 2020-12-27 15:00:00 0
24 2020-12-27 15:00:00 0
25 2020-12-27 17:00:00 0

Try this:
((~df.DateTimeStarted.duplicated(keep='last')) & (df.Value.ne(0))).astype(int)
Output:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 1
19 0
20 1
21 0
22 0
23 0
24 0
25 0
dtype: int32

Just use duplicated() method and stored unique value in a variable:
uniquedf=df[~df.duplicated(subset=['DateTimeStarted'],keep='last')]
Now set 'Value' column of you df equal to 0:
df['Value']=0
Then make use of reindex() method and fillna() method:
result=uniquedf.reindex(df.index).fillna(df)
Finally change the dtype of 'Value' column by astype() method:
result['Value']=result['Value'].astype(int)
Now if you print result you will get your desired output

Apply timestamp convert to date to multiple columns in Python

I want to convert two timestamp columns start_date and end_date to normal date columns:
id start_date end_date
0 1 1578448800000 1583632800000
1 2 1582164000000 1582250400000
2 3 1582509600000 1582596000000
3 4 1583373600000 1588557600000
4 5 1582509600000 1582596000000
5 6 1582164000000 1582250400000
6 7 1581040800000 1586224800000
7 8 1582423200000 1582509600000
8 9 1583287200000 1583373600000
The following code works for one timestamp, but how could I apply it to those two columns?
Thanks for your kind helps.
import datetime
timestamp = datetime.datetime.fromtimestamp(1500000000)
print(timestamp.strftime('%Y-%m-%d %H:%M:%S'))
Output:
2017-07-14 10:40:00
I also try with pd.to_datetime(df['start_date']/1000).apply(lambda x: x.date()) which give a incorrect result.
0 1970-01-01
1 1970-01-01
2 1970-01-01
3 1970-01-01
4 1970-01-01
5 1970-01-01
6 1970-01-01
7 1970-01-01
8 1970-01-01

Use DataFrame.apply with list of columns names and to_datetime with parameter unit='ms':
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(pd.to_datetime, unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
EDIT: For dates add lambda function with Series.dt.date:
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x, unit='ms').dt.date)
print (df)
id start_date end_date
0 1 2020-01-08 2020-03-08
1 2 2020-02-20 2020-02-21
2 3 2020-02-24 2020-02-25
3 4 2020-03-05 2020-05-04
4 5 2020-02-24 2020-02-25
5 6 2020-02-20 2020-02-21
6 7 2020-02-07 2020-04-07
7 8 2020-02-23 2020-02-24
8 9 2020-03-04 2020-03-05
Or convert each column separately:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms')
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
And for dates:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms').dt.date
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms').dt.date

How to convert time different value of two rows minutes to hour

Here I have dataset with datetime. Here I want to get time different value row by row in my csv file.
So I wrote the code to get the time different value in minutes. Then I want to convert that time different in hour.
That means;
if time difference value is 30 minutes. in hours 0.5h
if 120 min > 2h
But when I tried to it, it doesn't match with my required format. I just divide that time difference with 60.
my code:
df1['time_diff'] = pd.to_datetime(df1["time"])
print(df1['time_diff'])
0 2019-08-09 06:15:00
1 2019-08-09 06:45:00
2 2019-08-09 07:45:00
3 2019-08-09 09:00:00
4 2019-08-09 09:25:00
5 2019-08-09 09:30:00
6 2019-08-09 11:00:00
7 2019-08-09 11:30:00
8 2019-08-09 13:30:00
9 2019-08-09 13:50:00
10 2019-08-09 15:00:00
11 2019-08-09 15:25:00
12 2019-08-09 16:25:00
13 2019-08-09 18:00:00
df1['delta'] = (df1['time_diff']-df1['time_diff'].shift()).fillna(0)
df1['t'] = df1['delta'].apply(lambda x: x / np.timedelta64(1,'m')).astype('int64')% (24*60)
then the result:
After dividing by 60:
df1['t'] = df1['delta'].apply(lambda x: x / np.timedelta64(1,'m')).astype('int64')% (24*60)/60
result:
comparing each two images you can see in my first picture 30 min is there when I tries to convert into hours it is not showing and it just showing 1 only.
But have to convert 30 min as 0.5 hr.
Expected output:
[![
time_diff in min expected output of time_diff in hour
0 0
30 0.5
60 1
75 1.25
25 0.4167
5 0.083
90 1.5
30 0.5
120 2
20 0.333
70 1.33
25 0.4167
60 1
95 1.583
Can anyone help me to solve this error?

I suggest use Series.dt.total_seconds with divide by 60 and 3600:
df1['datetimes'] = pd.to_datetime(df1['date']+ ' ' + df1['time'], dayfirst=True)
df1['delta'] = df1['datetimes'].diff().fillna(pd.Timedelta(0))
td = df1['delta'].dt.total_seconds()
df1['time_diff in min'] = td.div(60).astype(int)
df1['time_diff in hour'] = td.div(3600)
print (df1)
datetimes delta time_diff in min time_diff in hour
0 2019-08-09 06:15:00 00:00:00 0 0.000000
1 2019-08-09 06:45:00 00:30:00 30 0.500000
2 2019-08-09 07:45:00 01:00:00 60 1.000000
3 2019-08-09 09:00:00 01:15:00 75 1.250000
4 2019-08-09 09:25:00 00:25:00 25 0.416667
5 2019-08-09 09:30:00 00:05:00 5 0.083333
6 2019-08-09 11:00:00 01:30:00 90 1.500000
7 2019-08-09 11:30:00 00:30:00 30 0.500000
8 2019-08-09 13:30:00 02:00:00 120 2.000000
9 2019-08-09 13:50:00 00:20:00 20 0.333333
10 2019-08-09 15:00:00 01:10:00 70 1.166667
11 2019-08-09 15:25:00 00:25:00 25 0.416667
12 2019-08-09 16:25:00 01:00:00 60 1.000000
13 2019-08-09 18:00:00 01:35:00 95 1.583333

Insert missing datetime in DataFrame

I have a pd.DataFrame
utc_time year month day weekday hour
0 2017-01-01 21:00:00 2017 1 1 7 21
1 2017-01-01 23:00:00 2017 1 1 7 23
2 2017-01-02 00:00:00 2017 1 2 1 0
3 2017-01-02 01:00:00 2017 1 2 1 1
In the df above, hour 22 doesn't show up. I want every hour include in the dataframe, like:
utc_time year month day weekday hour
0 2017-01-01 21:00:00 2017 1 1 7 21
0 2017-01-01 22:00:00 2017 1 1 7 22
1 2017-01-01 23:00:00 2017 1 1 7 23
2 2017-01-02 00:00:00 2017 1 2 1 0
3 2017-01-02 01:00:00 2017 1 2 1 1
How to build function to detect the missing hour and insert into the dataframe ?

IIUC resample +bfill and ffill
s=df.set_index('utc_time').resample('1H')
(s.ffill()+s.bfill())/2
Out[163]:
year month day weekday hour
utc_time
2017-01-01 21:00:00 2017 1 1 7 21
2017-01-01 22:00:00 2017 1 1 7 22
2017-01-01 23:00:00 2017 1 1 7 23
2017-01-02 00:00:00 2017 1 2 1 0
2017-01-02 01:00:00 2017 1 2 1 1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to create 4 hour time interval in Time Series Analysis (python) - python-3.x

Related

Pandas counter that counts by skipping a row and reset on different values

Replace values of duplicate rows based on the same datetime in another column but keep the last row unchanged

Apply timestamp convert to date to multiple columns in Python

How to convert time different value of two rows minutes to hour

Insert missing datetime in DataFrame

Categories

Resources