I'm totally new to Time Series Analysis and I'm trying to work on examples available online
this is what I have currently:
# Time based features
data = pd.read_csv('Train_SU63ISt.csv')
data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M')
data['Hour'] = data['Datetime'].dt.hour
data['minute'] = data['Datetime'].dt.minute
data.head()
ID Datetime Count Hour Minute
0 0 2012-08-25 00:00:00 8 0 0
1 1 2012-08-25 01:00:00 2 1 0
2 2 2012-08-25 02:00:00 6 2 0
3 3 2012-08-25 03:00:00 2 3 0
4 4 2012-08-25 04:00:00 2 4 0
What I'm looking for is something like this:
ID Datetime Count Hour Minute 4-Hour-window
0 0 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
1 1 2012-08-25 04:00:00 22 8 0 04:00:00 - 08:00:00
2 2 2012-08-25 08:00:00 18 12 0 08:00:00 - 12:00:00
3 3 2012-08-25 12:00:00 16 16 0 12:00:00 - 16:00:00
4 4 2012-08-25 16:00:00 18 20 0 16:00:00 - 20:00:00
5 5 2012-08-25 20:00:00 14 24 0 20:00:00 - 00:00:00
6 6 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
7 7 2012-08-26 04:00:00 24 8 0 04:00:00 - 08:00:00
8 8 2012-08-26 08:00:00 20 12 0 08:00:00 - 12:00:00
9 9 2012-08-26 12:00:00 10 16 0 12:00:00 - 16:00:00
10 10 2012-08-26 16:00:00 18 20 0 16:00:00 - 20:00:00
11 11 2012-08-26 20:00:00 14 24 0 20:00:00 - 00:00:00
I think what you are looking for is the resample function, see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
Something like this should work (not tested):
sampled_data = data.resample(
'4H',
kind='timestamp',
on='Datetime',
label='left'
).sum()
The function is very similar to groupby and groups the data into chunks of the column specified in on=, in this case we use timestamps and chunks of 4 hours.
Finally, you need to use some kind of disaggregation, in this case sum(), to convert all elements of each group into a single element per timechunk
Related
Hi I am trying to create a counter that counts my trend column by skipping a row and reset itself if the string values are different. For example on row 9 it will count 2 since the previous skipped row it was counted with a 1. But it resets back to one since the value at row 11 is different from row 9.
Is there anyway I could do this?
DateTimeStarted 50% Quantile 50Q shift 2H Trend Count
0 2020-12-18 15:00:00 554.0 NaN Flat 1
1 2020-12-18 16:00:00 593.0 NaN Flat 1
2 2020-12-18 17:00:00 534.0 554.0 Down 1
3 2020-12-18 18:00:00 562.0 593.0 Down 1
4 2020-12-18 19:00:00 552.0 534.0 Up 1
5 2020-12-18 20:00:00 592.0 562.0 Up 1
6 2020-12-19 08:00:00 511.0 552.0 Down 1
7 2020-12-19 09:00:00 584.0 592.0 Down 1
8 2020-12-19 10:00:00 576.0 511.0 Up 1
9 2020-12-19 11:00:00 545.5 584.0 Down 2
10 2020-12-19 12:00:00 609.5 576.0 Up 2
11 2020-12-19 13:00:00 548.0 545.5 Up 1
12 2020-12-19 14:00:00 565.0 609.5 Down 1
13 2020-12-19 15:00:00 575.0 548.0 Up 2
14 2020-12-19 16:00:00 570.0 565.0 Up 1
15 2020-12-19 17:00:00 557.0 575.0 Down 1
16 2020-12-19 18:00:00 578.0 570.0 Up 2
17 2020-12-19 19:00:00 578.5 557.0 Up 1
18 2020-12-21 08:00:00 543.0 578.0 Down 1
19 2020-12-21 09:00:00 558.0 578.5 Down 1
20 2020-12-21 10:00:00 570.0 543.0 Up 1
You can shift() the Trend column by 2 and check if it equals Trend:
df['Counter'] = df.Trend.shift(2).eq(df.Trend).astype(int).add(1)
I named it Counter here for comparison:
DateTimeStarted 50%Quantile 50Qshift2H Trend Count Counter
0 2020-12-18 15:00:00 554.0 NaN Flat 1 1
1 2020-12-18 16:00:00 593.0 NaN Flat 1 1
2 2020-12-18 17:00:00 534.0 554.0 Down 1 1
3 2020-12-18 18:00:00 562.0 593.0 Down 1 1
4 2020-12-18 19:00:00 552.0 534.0 Up 1 1
5 2020-12-18 20:00:00 592.0 562.0 Up 1 1
6 2020-12-19 08:00:00 511.0 552.0 Down 1 1
7 2020-12-19 09:00:00 584.0 592.0 Down 1 1
8 2020-12-19 10:00:00 576.0 511.0 Up 1 1
9 2020-12-19 11:00:00 545.5 584.0 Down 2 2
10 2020-12-19 12:00:00 609.5 576.0 Up 2 2
11 2020-12-19 13:00:00 548.0 545.5 Up 1 1
12 2020-12-19 14:00:00 565.0 609.5 Down 1 1
13 2020-12-19 15:00:00 575.0 548.0 Up 2 2
14 2020-12-19 16:00:00 570.0 565.0 Up 1 1
15 2020-12-19 17:00:00 557.0 575.0 Down 1 1
16 2020-12-19 18:00:00 578.0 570.0 Up 2 2
17 2020-12-19 19:00:00 578.5 557.0 Up 1 1
18 2020-12-21 08:00:00 543.0 578.0 Down 1 1
19 2020-12-21 09:00:00 558.0 578.5 Down 1 1
20 2020-12-21 10:00:00 570.0 543.0 Up 1 1
Hi I have a data frame that looks like this. Based on the same datetime, I need to keep the last row as 1 and replace the remaining ones as 0. Is there anyway for me to do this?
DateTimeStarted Value
0 2020-12-19 16:00:00 1
1 2020-12-19 16:00:00 1
2 2020-12-19 16:00:00 1
3 2020-12-19 16:00:00 1
4 2020-12-19 16:00:00 1
5 2020-12-19 16:00:00 1
6 2020-12-19 16:00:00 1
7 2020-12-19 16:00:00 1
8 2020-12-19 16:00:00 1
9 2020-12-19 16:00:00 1
10 2020-12-19 16:00:00 1
11 2020-12-19 16:00:00 1
12 2020-12-19 16:00:00 1
13 2020-12-19 16:00:00 1
14 2020-12-19 16:00:00 1
15 2020-12-19 16:00:00 1
16 2020-12-19 16:00:00 1
17 2020-12-19 16:00:00 1
18 2020-12-19 16:00:00 1
19 2020-12-26 18:00:00 1
20 2020-12-26 18:00:00 1
21 2020-12-27 13:00:00 0
22 2020-12-27 14:00:00 0
23 2020-12-27 15:00:00 0
24 2020-12-27 15:00:00 0
25 2020-12-27 17:00:00 0
The solution should look like this. The values 0 should also remained unchanged.
DateTimeStarted Value
0 2020-12-19 16:00:00 0
1 2020-12-19 16:00:00 0
2 2020-12-19 16:00:00 0
3 2020-12-19 16:00:00 0
4 2020-12-19 16:00:00 0
5 2020-12-19 16:00:00 0
6 2020-12-19 16:00:00 0
7 2020-12-19 16:00:00 0
8 2020-12-19 16:00:00 0
9 2020-12-19 16:00:00 0
10 2020-12-19 16:00:00 0
11 2020-12-19 16:00:00 0
12 2020-12-19 16:00:00 0
13 2020-12-19 16:00:00 0
14 2020-12-19 16:00:00 0
15 2020-12-19 16:00:00 0
16 2020-12-19 16:00:00 0
17 2020-12-19 16:00:00 0
18 2020-12-19 16:00:00 1
19 2020-12-26 18:00:00 0
20 2020-12-26 18:00:00 1
21 2020-12-27 13:00:00 0
22 2020-12-27 14:00:00 0
23 2020-12-27 15:00:00 0
24 2020-12-27 15:00:00 0
25 2020-12-27 17:00:00 0
Try this:
((~df.DateTimeStarted.duplicated(keep='last')) & (df.Value.ne(0))).astype(int)
Output:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 1
19 0
20 1
21 0
22 0
23 0
24 0
25 0
dtype: int32
Just use duplicated() method and stored unique value in a variable:
uniquedf=df[~df.duplicated(subset=['DateTimeStarted'],keep='last')]
Now set 'Value' column of you df equal to 0:
df['Value']=0
Then make use of reindex() method and fillna() method:
result=uniquedf.reindex(df.index).fillna(df)
Finally change the dtype of 'Value' column by astype() method:
result['Value']=result['Value'].astype(int)
Now if you print result you will get your desired output
I want to convert two timestamp columns start_date and end_date to normal date columns:
id start_date end_date
0 1 1578448800000 1583632800000
1 2 1582164000000 1582250400000
2 3 1582509600000 1582596000000
3 4 1583373600000 1588557600000
4 5 1582509600000 1582596000000
5 6 1582164000000 1582250400000
6 7 1581040800000 1586224800000
7 8 1582423200000 1582509600000
8 9 1583287200000 1583373600000
The following code works for one timestamp, but how could I apply it to those two columns?
Thanks for your kind helps.
import datetime
timestamp = datetime.datetime.fromtimestamp(1500000000)
print(timestamp.strftime('%Y-%m-%d %H:%M:%S'))
Output:
2017-07-14 10:40:00
I also try with pd.to_datetime(df['start_date']/1000).apply(lambda x: x.date()) which give a incorrect result.
0 1970-01-01
1 1970-01-01
2 1970-01-01
3 1970-01-01
4 1970-01-01
5 1970-01-01
6 1970-01-01
7 1970-01-01
8 1970-01-01
Use DataFrame.apply with list of columns names and to_datetime with parameter unit='ms':
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(pd.to_datetime, unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
EDIT: For dates add lambda function with Series.dt.date:
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x, unit='ms').dt.date)
print (df)
id start_date end_date
0 1 2020-01-08 2020-03-08
1 2 2020-02-20 2020-02-21
2 3 2020-02-24 2020-02-25
3 4 2020-03-05 2020-05-04
4 5 2020-02-24 2020-02-25
5 6 2020-02-20 2020-02-21
6 7 2020-02-07 2020-04-07
7 8 2020-02-23 2020-02-24
8 9 2020-03-04 2020-03-05
Or convert each column separately:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms')
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
And for dates:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms').dt.date
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms').dt.date
Here I have dataset with datetime. Here I want to get time different value row by row in my csv file.
So I wrote the code to get the time different value in minutes. Then I want to convert that time different in hour.
That means;
if time difference value is 30 minutes. in hours 0.5h
if 120 min > 2h
But when I tried to it, it doesn't match with my required format. I just divide that time difference with 60.
my code:
df1['time_diff'] = pd.to_datetime(df1["time"])
print(df1['time_diff'])
0 2019-08-09 06:15:00
1 2019-08-09 06:45:00
2 2019-08-09 07:45:00
3 2019-08-09 09:00:00
4 2019-08-09 09:25:00
5 2019-08-09 09:30:00
6 2019-08-09 11:00:00
7 2019-08-09 11:30:00
8 2019-08-09 13:30:00
9 2019-08-09 13:50:00
10 2019-08-09 15:00:00
11 2019-08-09 15:25:00
12 2019-08-09 16:25:00
13 2019-08-09 18:00:00
df1['delta'] = (df1['time_diff']-df1['time_diff'].shift()).fillna(0)
df1['t'] = df1['delta'].apply(lambda x: x / np.timedelta64(1,'m')).astype('int64')% (24*60)
then the result:
After dividing by 60:
df1['t'] = df1['delta'].apply(lambda x: x / np.timedelta64(1,'m')).astype('int64')% (24*60)/60
result:
comparing each two images you can see in my first picture 30 min is there when I tries to convert into hours it is not showing and it just showing 1 only.
But have to convert 30 min as 0.5 hr.
Expected output:
[![
time_diff in min expected output of time_diff in hour
0 0
30 0.5
60 1
75 1.25
25 0.4167
5 0.083
90 1.5
30 0.5
120 2
20 0.333
70 1.33
25 0.4167
60 1
95 1.583
Can anyone help me to solve this error?
I suggest use Series.dt.total_seconds with divide by 60 and 3600:
df1['datetimes'] = pd.to_datetime(df1['date']+ ' ' + df1['time'], dayfirst=True)
df1['delta'] = df1['datetimes'].diff().fillna(pd.Timedelta(0))
td = df1['delta'].dt.total_seconds()
df1['time_diff in min'] = td.div(60).astype(int)
df1['time_diff in hour'] = td.div(3600)
print (df1)
datetimes delta time_diff in min time_diff in hour
0 2019-08-09 06:15:00 00:00:00 0 0.000000
1 2019-08-09 06:45:00 00:30:00 30 0.500000
2 2019-08-09 07:45:00 01:00:00 60 1.000000
3 2019-08-09 09:00:00 01:15:00 75 1.250000
4 2019-08-09 09:25:00 00:25:00 25 0.416667
5 2019-08-09 09:30:00 00:05:00 5 0.083333
6 2019-08-09 11:00:00 01:30:00 90 1.500000
7 2019-08-09 11:30:00 00:30:00 30 0.500000
8 2019-08-09 13:30:00 02:00:00 120 2.000000
9 2019-08-09 13:50:00 00:20:00 20 0.333333
10 2019-08-09 15:00:00 01:10:00 70 1.166667
11 2019-08-09 15:25:00 00:25:00 25 0.416667
12 2019-08-09 16:25:00 01:00:00 60 1.000000
13 2019-08-09 18:00:00 01:35:00 95 1.583333
I have a pd.DataFrame
utc_time year month day weekday hour
0 2017-01-01 21:00:00 2017 1 1 7 21
1 2017-01-01 23:00:00 2017 1 1 7 23
2 2017-01-02 00:00:00 2017 1 2 1 0
3 2017-01-02 01:00:00 2017 1 2 1 1
In the df above, hour 22 doesn't show up. I want every hour include in the dataframe, like:
utc_time year month day weekday hour
0 2017-01-01 21:00:00 2017 1 1 7 21
0 2017-01-01 22:00:00 2017 1 1 7 22
1 2017-01-01 23:00:00 2017 1 1 7 23
2 2017-01-02 00:00:00 2017 1 2 1 0
3 2017-01-02 01:00:00 2017 1 2 1 1
How to build function to detect the missing hour and insert into the dataframe ?
IIUC resample +bfill and ffill
s=df.set_index('utc_time').resample('1H')
(s.ffill()+s.bfill())/2
Out[163]:
year month day weekday hour
utc_time
2017-01-01 21:00:00 2017 1 1 7 21
2017-01-01 22:00:00 2017 1 1 7 22
2017-01-01 23:00:00 2017 1 1 7 23
2017-01-02 00:00:00 2017 1 2 1 0
2017-01-02 01:00:00 2017 1 2 1 1