I want to convert two timestamp columns start_date and end_date to normal date columns:
id start_date end_date
0 1 1578448800000 1583632800000
1 2 1582164000000 1582250400000
2 3 1582509600000 1582596000000
3 4 1583373600000 1588557600000
4 5 1582509600000 1582596000000
5 6 1582164000000 1582250400000
6 7 1581040800000 1586224800000
7 8 1582423200000 1582509600000
8 9 1583287200000 1583373600000
The following code works for one timestamp, but how could I apply it to those two columns?
Thanks for your kind helps.
import datetime
timestamp = datetime.datetime.fromtimestamp(1500000000)
print(timestamp.strftime('%Y-%m-%d %H:%M:%S'))
Output:
2017-07-14 10:40:00
I also try with pd.to_datetime(df['start_date']/1000).apply(lambda x: x.date()) which give a incorrect result.
0 1970-01-01
1 1970-01-01
2 1970-01-01
3 1970-01-01
4 1970-01-01
5 1970-01-01
6 1970-01-01
7 1970-01-01
8 1970-01-01
Use DataFrame.apply with list of columns names and to_datetime with parameter unit='ms':
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(pd.to_datetime, unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
EDIT: For dates add lambda function with Series.dt.date:
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x, unit='ms').dt.date)
print (df)
id start_date end_date
0 1 2020-01-08 2020-03-08
1 2 2020-02-20 2020-02-21
2 3 2020-02-24 2020-02-25
3 4 2020-03-05 2020-05-04
4 5 2020-02-24 2020-02-25
5 6 2020-02-20 2020-02-21
6 7 2020-02-07 2020-04-07
7 8 2020-02-23 2020-02-24
8 9 2020-03-04 2020-03-05
Or convert each column separately:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms')
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
And for dates:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms').dt.date
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms').dt.date
Related
I have a date dataframe where date contains 15 min of interval. I want to find the missing datetime interval. id should be copied from previous line but value should be nan
'''
date value id
2021-12-02 07:00:00 12456677 693214
2021-01-02 07:30:00 12456677 693214
2021-01-02 07:45:00 12456677 693214
2021-01-02 08:00:00 12456677 693214
2021-01-02 08:15:00 12456665 693215
2021-01-02 08:45:00 12456665 693215
2021-01-03 09:00:00 12456666 693217
2021-01-03 10:30:00 12456666 693217
expacted output is
date value id
2021-01-02 08:30:00 NAN 693215
2021-01-02 09:15:00 NAN 693217
2021-01-03 09:30:00 NAN 693217
2021-01-03 09:45:00 NAN 693217
2021-01-03 10:00:00 NAN 693217
I am trying
df['Datetime'] = pd.to_datetime(df['date'])
df[ df['Datetime'].diff() > pd.Timedelta('15min') ]
but it is just giving a time after which date is missing. not the missing date and time.it showed me this output
date value id
2021-01-02 08:15:00 12456665 693215
2021-01-03 09:00:00 12456666 693217
2021-01-03 10:30:00 12456666 693217
can some one please guide me how can I extract missing date and time?
Thanks in advance
Use Series.asfreq per groups for get missing intervals:
#create DatetimeIndex
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#add 15 Minutes index per days and per id
df1 = (df.groupby([pd.Grouper(freq='D'), 'id'])['value']
.apply(lambda x: x.asfreq('15min'))
.reset_index(level=0, drop=True)
.reset_index())
print (df1)
id date value
0 693214 2021-01-02 07:30:00 12456677.0
1 693214 2021-01-02 07:45:00 12456677.0
2 693214 2021-01-02 08:00:00 12456677.0
3 693215 2021-01-02 08:15:00 12456665.0
4 693215 2021-01-02 08:30:00 NaN
5 693215 2021-01-02 08:45:00 12456665.0
6 693217 2021-01-03 09:00:00 12456666.0
7 693217 2021-01-03 09:15:00 NaN
8 693217 2021-01-03 09:30:00 NaN
9 693217 2021-01-03 09:45:00 NaN
10 693217 2021-01-03 10:00:00 NaN
11 693217 2021-01-03 10:15:00 NaN
12 693217 2021-01-03 10:30:00 12456666.0
13 693214 2021-12-02 07:00:00 12456677.0
Test missing values in boolean indexing:
df2 = df1[df1['value'].isna()]
print (df2)
id date value
4 693215 2021-01-02 08:30:00 NaN
7 693217 2021-01-03 09:15:00 NaN
8 693217 2021-01-03 09:30:00 NaN
9 693217 2021-01-03 09:45:00 NaN
10 693217 2021-01-03 10:00:00 NaN
11 693217 2021-01-03 10:15:00 NaN
Hi I am trying to create a counter that counts my trend column by skipping a row and reset itself if the string values are different. For example on row 9 it will count 2 since the previous skipped row it was counted with a 1. But it resets back to one since the value at row 11 is different from row 9.
Is there anyway I could do this?
DateTimeStarted 50% Quantile 50Q shift 2H Trend Count
0 2020-12-18 15:00:00 554.0 NaN Flat 1
1 2020-12-18 16:00:00 593.0 NaN Flat 1
2 2020-12-18 17:00:00 534.0 554.0 Down 1
3 2020-12-18 18:00:00 562.0 593.0 Down 1
4 2020-12-18 19:00:00 552.0 534.0 Up 1
5 2020-12-18 20:00:00 592.0 562.0 Up 1
6 2020-12-19 08:00:00 511.0 552.0 Down 1
7 2020-12-19 09:00:00 584.0 592.0 Down 1
8 2020-12-19 10:00:00 576.0 511.0 Up 1
9 2020-12-19 11:00:00 545.5 584.0 Down 2
10 2020-12-19 12:00:00 609.5 576.0 Up 2
11 2020-12-19 13:00:00 548.0 545.5 Up 1
12 2020-12-19 14:00:00 565.0 609.5 Down 1
13 2020-12-19 15:00:00 575.0 548.0 Up 2
14 2020-12-19 16:00:00 570.0 565.0 Up 1
15 2020-12-19 17:00:00 557.0 575.0 Down 1
16 2020-12-19 18:00:00 578.0 570.0 Up 2
17 2020-12-19 19:00:00 578.5 557.0 Up 1
18 2020-12-21 08:00:00 543.0 578.0 Down 1
19 2020-12-21 09:00:00 558.0 578.5 Down 1
20 2020-12-21 10:00:00 570.0 543.0 Up 1
You can shift() the Trend column by 2 and check if it equals Trend:
df['Counter'] = df.Trend.shift(2).eq(df.Trend).astype(int).add(1)
I named it Counter here for comparison:
DateTimeStarted 50%Quantile 50Qshift2H Trend Count Counter
0 2020-12-18 15:00:00 554.0 NaN Flat 1 1
1 2020-12-18 16:00:00 593.0 NaN Flat 1 1
2 2020-12-18 17:00:00 534.0 554.0 Down 1 1
3 2020-12-18 18:00:00 562.0 593.0 Down 1 1
4 2020-12-18 19:00:00 552.0 534.0 Up 1 1
5 2020-12-18 20:00:00 592.0 562.0 Up 1 1
6 2020-12-19 08:00:00 511.0 552.0 Down 1 1
7 2020-12-19 09:00:00 584.0 592.0 Down 1 1
8 2020-12-19 10:00:00 576.0 511.0 Up 1 1
9 2020-12-19 11:00:00 545.5 584.0 Down 2 2
10 2020-12-19 12:00:00 609.5 576.0 Up 2 2
11 2020-12-19 13:00:00 548.0 545.5 Up 1 1
12 2020-12-19 14:00:00 565.0 609.5 Down 1 1
13 2020-12-19 15:00:00 575.0 548.0 Up 2 2
14 2020-12-19 16:00:00 570.0 565.0 Up 1 1
15 2020-12-19 17:00:00 557.0 575.0 Down 1 1
16 2020-12-19 18:00:00 578.0 570.0 Up 2 2
17 2020-12-19 19:00:00 578.5 557.0 Up 1 1
18 2020-12-21 08:00:00 543.0 578.0 Down 1 1
19 2020-12-21 09:00:00 558.0 578.5 Down 1 1
20 2020-12-21 10:00:00 570.0 543.0 Up 1 1
I'm totally new to Time Series Analysis and I'm trying to work on examples available online
this is what I have currently:
# Time based features
data = pd.read_csv('Train_SU63ISt.csv')
data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M')
data['Hour'] = data['Datetime'].dt.hour
data['minute'] = data['Datetime'].dt.minute
data.head()
ID Datetime Count Hour Minute
0 0 2012-08-25 00:00:00 8 0 0
1 1 2012-08-25 01:00:00 2 1 0
2 2 2012-08-25 02:00:00 6 2 0
3 3 2012-08-25 03:00:00 2 3 0
4 4 2012-08-25 04:00:00 2 4 0
What I'm looking for is something like this:
ID Datetime Count Hour Minute 4-Hour-window
0 0 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
1 1 2012-08-25 04:00:00 22 8 0 04:00:00 - 08:00:00
2 2 2012-08-25 08:00:00 18 12 0 08:00:00 - 12:00:00
3 3 2012-08-25 12:00:00 16 16 0 12:00:00 - 16:00:00
4 4 2012-08-25 16:00:00 18 20 0 16:00:00 - 20:00:00
5 5 2012-08-25 20:00:00 14 24 0 20:00:00 - 00:00:00
6 6 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
7 7 2012-08-26 04:00:00 24 8 0 04:00:00 - 08:00:00
8 8 2012-08-26 08:00:00 20 12 0 08:00:00 - 12:00:00
9 9 2012-08-26 12:00:00 10 16 0 12:00:00 - 16:00:00
10 10 2012-08-26 16:00:00 18 20 0 16:00:00 - 20:00:00
11 11 2012-08-26 20:00:00 14 24 0 20:00:00 - 00:00:00
I think what you are looking for is the resample function, see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
Something like this should work (not tested):
sampled_data = data.resample(
'4H',
kind='timestamp',
on='Datetime',
label='left'
).sum()
The function is very similar to groupby and groups the data into chunks of the column specified in on=, in this case we use timestamps and chunks of 4 hours.
Finally, you need to use some kind of disaggregation, in this case sum(), to convert all elements of each group into a single element per timechunk
df['index_day'] = df.index.floor('d')
my dataframe is df.head
index_day P2_Qa ... P2_Qcon P2_m
2019-01-10 17:00:00 2019-01-10 93.599342 ... 107.673342 14.962424
2019-01-10 17:01:00 2019-01-10 90.833884 ... 104.658384 14.343642
2019-01-10 17:02:00 2019-01-10 90.907001 ... 104.601001 14.568892
2019-01-10 17:03:00 2019-01-10 93.579973 ... 107.115473 14.884902
2019-01-10 17:04:00 2019-01-10 93.688072 ... 107.168072 14.831412
I'm looping for every day
for day, i in df.groupby('index_day'):
sns.jointplot(x='P2_Tam', y='P2_Qa', data=i, kind='reg')
j=j+1
plt.savefig(j+'.png')
This gives me regression plots for one day 24 hours. However, I want such plots for nights only. Loop around 12 hours where one night = one loop= 1 plot from 18:00 till 6 in the morning morning.
However, i want to loop with one loop = 18:00 till 6:00 of next day rather than one loop=24 hours of one day. How do I do that?
I think you can filter first by DataFrame.between_time only for nights and then loop by 12H with base=6:
rng = pd.date_range('2017-04-03', periods=35, freq='H')
df = pd.DataFrame({'a': range(35)}, index=rng)
df = df.between_time('18:00:01', '6:00')
print (df)
a
2017-04-03 00:00:00 0
2017-04-03 01:00:00 1
2017-04-03 02:00:00 2
2017-04-03 03:00:00 3
2017-04-03 04:00:00 4
2017-04-03 05:00:00 5
2017-04-03 06:00:00 6
2017-04-03 19:00:00 19
2017-04-03 20:00:00 20
2017-04-03 21:00:00 21
2017-04-03 22:00:00 22
2017-04-03 23:00:00 23
2017-04-04 00:00:00 24
2017-04-04 01:00:00 25
2017-04-04 02:00:00 26
2017-04-04 03:00:00 27
2017-04-04 04:00:00 28
2017-04-04 05:00:00 29
2017-04-04 06:00:00 30
for i, g in df.groupby(pd.Grouper(freq='12H', base=6, closed='right')):
if not g.empty:
print (g)
a
2017-04-03 00:00:00 0
2017-04-03 01:00:00 1
2017-04-03 02:00:00 2
2017-04-03 03:00:00 3
2017-04-03 04:00:00 4
2017-04-03 05:00:00 5
2017-04-03 06:00:00 6
a
2017-04-03 19:00:00 19
2017-04-03 20:00:00 20
2017-04-03 21:00:00 21
2017-04-03 22:00:00 22
2017-04-03 23:00:00 23
2017-04-04 00:00:00 24
2017-04-04 01:00:00 25
2017-04-04 02:00:00 26
2017-04-04 03:00:00 27
2017-04-04 04:00:00 28
2017-04-04 05:00:00 29
2017-04-04 06:00:00 30
EDIT:
If want select by 12 hours after start time one possible solution with DataFrame.truncate:
rng = pd.date_range('2017-04-03', periods=35, freq='2H')
df = pd.DataFrame({'a': range(35)}, index=rng)
dates = df.index.floor('d').unique()
for s, e in zip(dates + pd.Timedelta(18, unit='H'),
dates + pd.Timedelta(30, unit='H')):
df1 = df.truncate(s, e)
if not df1.empty:
print (df1)
a
2017-04-03 18:00:00 9
2017-04-03 20:00:00 10
2017-04-03 22:00:00 11
2017-04-04 00:00:00 12
2017-04-04 02:00:00 13
2017-04-04 04:00:00 14
2017-04-04 06:00:00 15
a
2017-04-04 18:00:00 21
2017-04-04 20:00:00 22
2017-04-04 22:00:00 23
2017-04-05 00:00:00 24
2017-04-05 02:00:00 25
2017-04-05 04:00:00 26
2017-04-05 06:00:00 27
a
2017-04-05 18:00:00 33
2017-04-05 20:00:00 34
Here I have dataset with datetime. Here I want to get time different value row by row in my csv file.
So I wrote the code to get the time different value in minutes. Then I want to convert that time different in hour.
That means;
if time difference value is 30 minutes. in hours 0.5h
if 120 min > 2h
But when I tried to it, it doesn't match with my required format. I just divide that time difference with 60.
my code:
df1['time_diff'] = pd.to_datetime(df1["time"])
print(df1['time_diff'])
0 2019-08-09 06:15:00
1 2019-08-09 06:45:00
2 2019-08-09 07:45:00
3 2019-08-09 09:00:00
4 2019-08-09 09:25:00
5 2019-08-09 09:30:00
6 2019-08-09 11:00:00
7 2019-08-09 11:30:00
8 2019-08-09 13:30:00
9 2019-08-09 13:50:00
10 2019-08-09 15:00:00
11 2019-08-09 15:25:00
12 2019-08-09 16:25:00
13 2019-08-09 18:00:00
df1['delta'] = (df1['time_diff']-df1['time_diff'].shift()).fillna(0)
df1['t'] = df1['delta'].apply(lambda x: x / np.timedelta64(1,'m')).astype('int64')% (24*60)
then the result:
After dividing by 60:
df1['t'] = df1['delta'].apply(lambda x: x / np.timedelta64(1,'m')).astype('int64')% (24*60)/60
result:
comparing each two images you can see in my first picture 30 min is there when I tries to convert into hours it is not showing and it just showing 1 only.
But have to convert 30 min as 0.5 hr.
Expected output:
[![
time_diff in min expected output of time_diff in hour
0 0
30 0.5
60 1
75 1.25
25 0.4167
5 0.083
90 1.5
30 0.5
120 2
20 0.333
70 1.33
25 0.4167
60 1
95 1.583
Can anyone help me to solve this error?
I suggest use Series.dt.total_seconds with divide by 60 and 3600:
df1['datetimes'] = pd.to_datetime(df1['date']+ ' ' + df1['time'], dayfirst=True)
df1['delta'] = df1['datetimes'].diff().fillna(pd.Timedelta(0))
td = df1['delta'].dt.total_seconds()
df1['time_diff in min'] = td.div(60).astype(int)
df1['time_diff in hour'] = td.div(3600)
print (df1)
datetimes delta time_diff in min time_diff in hour
0 2019-08-09 06:15:00 00:00:00 0 0.000000
1 2019-08-09 06:45:00 00:30:00 30 0.500000
2 2019-08-09 07:45:00 01:00:00 60 1.000000
3 2019-08-09 09:00:00 01:15:00 75 1.250000
4 2019-08-09 09:25:00 00:25:00 25 0.416667
5 2019-08-09 09:30:00 00:05:00 5 0.083333
6 2019-08-09 11:00:00 01:30:00 90 1.500000
7 2019-08-09 11:30:00 00:30:00 30 0.500000
8 2019-08-09 13:30:00 02:00:00 120 2.000000
9 2019-08-09 13:50:00 00:20:00 20 0.333333
10 2019-08-09 15:00:00 01:10:00 70 1.166667
11 2019-08-09 15:25:00 00:25:00 25 0.416667
12 2019-08-09 16:25:00 01:00:00 60 1.000000
13 2019-08-09 18:00:00 01:35:00 95 1.583333