df['index_day'] = df.index.floor('d')
my dataframe is df.head
index_day P2_Qa ... P2_Qcon P2_m
2019-01-10 17:00:00 2019-01-10 93.599342 ... 107.673342 14.962424
2019-01-10 17:01:00 2019-01-10 90.833884 ... 104.658384 14.343642
2019-01-10 17:02:00 2019-01-10 90.907001 ... 104.601001 14.568892
2019-01-10 17:03:00 2019-01-10 93.579973 ... 107.115473 14.884902
2019-01-10 17:04:00 2019-01-10 93.688072 ... 107.168072 14.831412
I'm looping for every day
for day, i in df.groupby('index_day'):
sns.jointplot(x='P2_Tam', y='P2_Qa', data=i, kind='reg')
j=j+1
plt.savefig(j+'.png')
This gives me regression plots for one day 24 hours. However, I want such plots for nights only. Loop around 12 hours where one night = one loop= 1 plot from 18:00 till 6 in the morning morning.
However, i want to loop with one loop = 18:00 till 6:00 of next day rather than one loop=24 hours of one day. How do I do that?
I think you can filter first by DataFrame.between_time only for nights and then loop by 12H with base=6:
rng = pd.date_range('2017-04-03', periods=35, freq='H')
df = pd.DataFrame({'a': range(35)}, index=rng)
df = df.between_time('18:00:01', '6:00')
print (df)
a
2017-04-03 00:00:00 0
2017-04-03 01:00:00 1
2017-04-03 02:00:00 2
2017-04-03 03:00:00 3
2017-04-03 04:00:00 4
2017-04-03 05:00:00 5
2017-04-03 06:00:00 6
2017-04-03 19:00:00 19
2017-04-03 20:00:00 20
2017-04-03 21:00:00 21
2017-04-03 22:00:00 22
2017-04-03 23:00:00 23
2017-04-04 00:00:00 24
2017-04-04 01:00:00 25
2017-04-04 02:00:00 26
2017-04-04 03:00:00 27
2017-04-04 04:00:00 28
2017-04-04 05:00:00 29
2017-04-04 06:00:00 30
for i, g in df.groupby(pd.Grouper(freq='12H', base=6, closed='right')):
if not g.empty:
print (g)
a
2017-04-03 00:00:00 0
2017-04-03 01:00:00 1
2017-04-03 02:00:00 2
2017-04-03 03:00:00 3
2017-04-03 04:00:00 4
2017-04-03 05:00:00 5
2017-04-03 06:00:00 6
a
2017-04-03 19:00:00 19
2017-04-03 20:00:00 20
2017-04-03 21:00:00 21
2017-04-03 22:00:00 22
2017-04-03 23:00:00 23
2017-04-04 00:00:00 24
2017-04-04 01:00:00 25
2017-04-04 02:00:00 26
2017-04-04 03:00:00 27
2017-04-04 04:00:00 28
2017-04-04 05:00:00 29
2017-04-04 06:00:00 30
EDIT:
If want select by 12 hours after start time one possible solution with DataFrame.truncate:
rng = pd.date_range('2017-04-03', periods=35, freq='2H')
df = pd.DataFrame({'a': range(35)}, index=rng)
dates = df.index.floor('d').unique()
for s, e in zip(dates + pd.Timedelta(18, unit='H'),
dates + pd.Timedelta(30, unit='H')):
df1 = df.truncate(s, e)
if not df1.empty:
print (df1)
a
2017-04-03 18:00:00 9
2017-04-03 20:00:00 10
2017-04-03 22:00:00 11
2017-04-04 00:00:00 12
2017-04-04 02:00:00 13
2017-04-04 04:00:00 14
2017-04-04 06:00:00 15
a
2017-04-04 18:00:00 21
2017-04-04 20:00:00 22
2017-04-04 22:00:00 23
2017-04-05 00:00:00 24
2017-04-05 02:00:00 25
2017-04-05 04:00:00 26
2017-04-05 06:00:00 27
a
2017-04-05 18:00:00 33
2017-04-05 20:00:00 34
Related
I have a dataframe which looks like this, and index is datetime64 of numpy:
(index) data
2017-01-01 00:00:00 1
2017-01-01 01:00:00 2
2017-01-01 02:00:00 3
…… ……
2017-01-04 00:00:00 73
2017-01-04 01:00:00 nan
2017-01-04 02:00:00 75
…… ……
Now I want to get datas in rolling windows which width are all 72(72 hours) and there is no intersection between two windows such as this:
windows1:
(index) data
2017-01-01 00:00:00 1
2017-01-01 01:00:00 2
2017-01-01 02:00:00 3
…… ……
2017-01-03 23:00:00 72
windows2:
(index) data
2017-01-04 00:00:00 73
# data of 2017-01-04 01:00:00 is nan, removed
2017-01-01 02:00:00 75
…… ……
2017-01-03 23:00:00 144
So how can realize this by DataFrame.rolling or DataSeries.rolling? If there is no easy answer, I will use index itself to solve the problem.
A 72H rolling can be achieved with df.rolling('72H').sum() (or any other function than sum)
But it looks like you don't want a rolling but rather a groupby with floor:
for k,g in df.groupby(df.index.floor('72H')):
print(f'New group: {k}\n', g.head(), '\n')
output:
New group: 2016-12-31 00:00:00
data
index
2017-01-01 00:00:00 1
2017-01-01 01:00:00 2
2017-01-01 02:00:00 3
2017-01-01 03:00:00 4
2017-01-01 04:00:00 5
New group: 2017-01-03 00:00:00
data
index
2017-01-03 00:00:00 49
2017-01-03 01:00:00 50
2017-01-03 02:00:00 51
2017-01-03 03:00:00 52
2017-01-03 04:00:00 53
To compute, for example, the mean:
df.groupby(df.index.floor('72H')).mean()
data
index
2016-12-31 24.5
2017-01-03 73.0
alternative
group = (df.index-df.index[0])//pd.Timedelta('72H')
df.groupby(group).mean()
Used input:
df = pd.DataFrame({'index': pd.date_range('2017-01-01', '2017-01-05', freq='1H'),
'data': np.arange(1, 98)}).set_index('index')
I want to convert two timestamp columns start_date and end_date to normal date columns:
id start_date end_date
0 1 1578448800000 1583632800000
1 2 1582164000000 1582250400000
2 3 1582509600000 1582596000000
3 4 1583373600000 1588557600000
4 5 1582509600000 1582596000000
5 6 1582164000000 1582250400000
6 7 1581040800000 1586224800000
7 8 1582423200000 1582509600000
8 9 1583287200000 1583373600000
The following code works for one timestamp, but how could I apply it to those two columns?
Thanks for your kind helps.
import datetime
timestamp = datetime.datetime.fromtimestamp(1500000000)
print(timestamp.strftime('%Y-%m-%d %H:%M:%S'))
Output:
2017-07-14 10:40:00
I also try with pd.to_datetime(df['start_date']/1000).apply(lambda x: x.date()) which give a incorrect result.
0 1970-01-01
1 1970-01-01
2 1970-01-01
3 1970-01-01
4 1970-01-01
5 1970-01-01
6 1970-01-01
7 1970-01-01
8 1970-01-01
Use DataFrame.apply with list of columns names and to_datetime with parameter unit='ms':
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(pd.to_datetime, unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
EDIT: For dates add lambda function with Series.dt.date:
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x, unit='ms').dt.date)
print (df)
id start_date end_date
0 1 2020-01-08 2020-03-08
1 2 2020-02-20 2020-02-21
2 3 2020-02-24 2020-02-25
3 4 2020-03-05 2020-05-04
4 5 2020-02-24 2020-02-25
5 6 2020-02-20 2020-02-21
6 7 2020-02-07 2020-04-07
7 8 2020-02-23 2020-02-24
8 9 2020-03-04 2020-03-05
Or convert each column separately:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms')
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
And for dates:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms').dt.date
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms').dt.date
I'm totally new to Time Series Analysis and I'm trying to work on examples available online
this is what I have currently:
# Time based features
data = pd.read_csv('Train_SU63ISt.csv')
data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M')
data['Hour'] = data['Datetime'].dt.hour
data['minute'] = data['Datetime'].dt.minute
data.head()
ID Datetime Count Hour Minute
0 0 2012-08-25 00:00:00 8 0 0
1 1 2012-08-25 01:00:00 2 1 0
2 2 2012-08-25 02:00:00 6 2 0
3 3 2012-08-25 03:00:00 2 3 0
4 4 2012-08-25 04:00:00 2 4 0
What I'm looking for is something like this:
ID Datetime Count Hour Minute 4-Hour-window
0 0 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
1 1 2012-08-25 04:00:00 22 8 0 04:00:00 - 08:00:00
2 2 2012-08-25 08:00:00 18 12 0 08:00:00 - 12:00:00
3 3 2012-08-25 12:00:00 16 16 0 12:00:00 - 16:00:00
4 4 2012-08-25 16:00:00 18 20 0 16:00:00 - 20:00:00
5 5 2012-08-25 20:00:00 14 24 0 20:00:00 - 00:00:00
6 6 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
7 7 2012-08-26 04:00:00 24 8 0 04:00:00 - 08:00:00
8 8 2012-08-26 08:00:00 20 12 0 08:00:00 - 12:00:00
9 9 2012-08-26 12:00:00 10 16 0 12:00:00 - 16:00:00
10 10 2012-08-26 16:00:00 18 20 0 16:00:00 - 20:00:00
11 11 2012-08-26 20:00:00 14 24 0 20:00:00 - 00:00:00
I think what you are looking for is the resample function, see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
Something like this should work (not tested):
sampled_data = data.resample(
'4H',
kind='timestamp',
on='Datetime',
label='left'
).sum()
The function is very similar to groupby and groups the data into chunks of the column specified in on=, in this case we use timestamps and chunks of 4 hours.
Finally, you need to use some kind of disaggregation, in this case sum(), to convert all elements of each group into a single element per timechunk
before a come here to ask you I search a lot in the internet and documentations.
My problem is as follow:
I have a dataframe like that:
date dir vel
0 2006-02-12 17:00:00 181.00 3.92
1 2006-02-12 19:00:00 17.88 5.10
2 2006-02-12 21:00:00 214.75 3.73
3 2006-02-13 00:00:00 165.53 2.16
4 2006-02-13 01:00:00 189.44 2.94
5 2006-02-13 04:00:00 152.88 2.55
6 2006-02-13 05:00:00 188.03 3.73
7 2006-02-13 06:00:00 158.50 1.37
8 2006-02-13 07:00:00 189.44 2.55
9 2006-02-13 08:00:00 152.88 1.37
10 2006-02-13 10:00:00 109.28 0.20
11 2006-02-13 11:00:00 248.50 0.98
12 2006-02-13 12:00:00 26.31 1.96
13 2006-02-13 13:00:00 19.28 6.08
14 2006-02-13 14:00:00 334.28 3.53
15 2006-02-13 15:00:00 338.50 2.75
16 2006-02-13 16:00:00 318.81 3.92
17 2006-02-13 17:00:00 323.03 3.73
18 2006-02-13 21:00:00 62.88 1.76
19 2006-02-13 22:00:00 188.03 2.94
I just need to find the sequences of consecutive dates and drop the sequences of consecutive dates with less than 3 dates of duration. So I would get as result the following dataframe:
date dir vel
5 2006-02-13 04:00:00 152.88 2.55
6 2006-02-13 05:00:00 188.03 3.73
7 2006-02-13 06:00:00 158.50 1.37
8 2006-02-13 07:00:00 189.44 2.55
9 2006-02-13 08:00:00 152.88 1.37
10 2006-02-13 10:00:00 109.28 0.20
11 2006-02-13 11:00:00 248.50 0.98
12 2006-02-13 12:00:00 26.31 1.96
13 2006-02-13 13:00:00 19.28 6.08
14 2006-02-13 14:00:00 334.28 3.53
15 2006-02-13 15:00:00 338.50 2.75
16 2006-02-13 16:00:00 318.81 3.92
17 2006-02-13 17:00:00 323.03 3.73
So far I have used the following script (inspired in this anwer:Find group of consecutive dates in Pandas DataFrame)
(obs: The DataFrame name is estreito):
dt = estreito['date']
hour = pd.Timedelta('1H')
in_block = ((dt - dt.shift(-1)).abs() == hour) | (dt.diff() == hour)
filt = estreito.loc[in_block]
breaks = filt['date'].diff() != hour
groups = breaks.cumsum()
for _, frame in filt.groupby(groups):
print(frame, end='\n\n')
The print output is something like that:
date dir vel
3 2006-02-13 00:00:00 165.53 2.16
4 2006-02-13 01:00:00 189.44 2.94
date dir vel
5 2006-02-13 04:00:00 152.88 2.55
6 2006-02-13 05:00:00 188.03 3.73
7 2006-02-13 06:00:00 158.50 1.37
8 2006-02-13 07:00:00 189.44 2.55
9 2006-02-13 08:00:00 152.88 1.37
date dir vel
10 2006-02-13 10:00:00 109.28 0.20
11 2006-02-13 11:00:00 248.50 0.98
12 2006-02-13 12:00:00 26.31 1.96
13 2006-02-13 13:00:00 19.28 6.08
14 2006-02-13 14:00:00 334.28 3.53
15 2006-02-13 15:00:00 338.50 2.75
16 2006-02-13 16:00:00 318.81 3.92
17 2006-02-13 17:00:00 323.03 3.73
How can I save the output in a new Dataframe filtering the groups with less than 3 consecutive dates of lenght.
There is a different way to do this analysis? Perhaps have an easier way to get the desired result.
Thanks in advance.
We using diff with cumsum create the key
s=df.date.diff().dt.seconds.ne(60*60).cumsum()
Then using transform count for the new key created , and slice the original df
df[s.groupby(s).transform('count').gt(3)]
Out[983]:
date dir vel
5 2006-02-13 04:00:00 152.88 2.55
6 2006-02-13 05:00:00 188.03 3.73
7 2006-02-13 06:00:00 158.50 1.37
8 2006-02-13 07:00:00 189.44 2.55
9 2006-02-13 08:00:00 152.88 1.37
10 2006-02-13 10:00:00 109.28 0.20
11 2006-02-13 11:00:00 248.50 0.98
12 2006-02-13 12:00:00 26.31 1.96
13 2006-02-13 13:00:00 19.28 6.08
14 2006-02-13 14:00:00 334.28 3.53
15 2006-02-13 15:00:00 338.50 2.75
16 2006-02-13 16:00:00 318.81 3.92
17 2006-02-13 17:00:00 323.03 3.73
tI would like to run timeseries analysis on repeated measures data (time only, no dates) taken overnight from 22:00:00 to 09:00:00 the next morning.
How is the time set so that the Timeseries starts at 22:00:00. At the moment even when plotting it starts at 00:00:00 and ends at 23:00:00 with a flat line between 09:00:00 and 23:00:00?
df = pd.read_csv('1310.csv', parse_dates=True)
df['Time'] = pd.to_datetime(df['Time'])
df['Time'].apply( lambda d : d.time() )
df = df.set_index('Time')
df['2017-05-16 22:00:00'] + pd.Timedelta('-1 day')
Note: The date in the last line of code is automatically added, seen when df['Time'] is executed, so I inserted the same format with date in the last line for 22:00:00.
This is the error:
TypeError: Could not operate Timedelta('-1 days +00:00:00') with block values unsupported operand type(s) for +: 'numpy.ndarray' and 'Timedelta'
You should consider your timestamps as pd.Timedeltas and add a day to the samples before your start time.
Create some example data:
import pandas as pd
d = pd.date_range(start='22:00:00', periods=12, freq='h')
s = pd.Series(d).dt.time
df = pd.DataFrame(pd.np.random.randn(len(s)), index=s, columns=['value'])
df.to_csv('data.csv')
df
value
22:00:00 -0.214977
23:00:00 -0.006585
00:00:00 0.568259
01:00:00 0.603196
02:00:00 0.358124
03:00:00 0.027835
04:00:00 -0.436322
05:00:00 0.627624
06:00:00 0.168189
07:00:00 -0.321916
08:00:00 0.737383
09:00:00 1.100500
Read in, make index a timedelta, add a day to timedeltas before the start time, then assign back to the index.
df2 = pd.read_csv('data.csv', index_col=0)
df2.index = pd.to_timedelta(df2.index)
s = pd.Series(df2.index)
s[s < pd.Timedelta('22:00:00')] += pd.Timedelta('1d')
df2.index = pd.to_datetime(s)
df2
value
1970-01-01 22:00:00 -0.214977
1970-01-01 23:00:00 -0.006585
1970-01-02 00:00:00 0.568259
1970-01-02 01:00:00 0.603196
1970-01-02 02:00:00 0.358124
1970-01-02 03:00:00 0.027835
1970-01-02 04:00:00 -0.436322
1970-01-02 05:00:00 0.627624
1970-01-02 06:00:00 0.168189
1970-01-02 07:00:00 -0.321916
1970-01-02 08:00:00 0.737383
1970-01-02 09:00:00 1.100500
If you want to set the date of the first day:
df2.index += (pd.Timestamp('2015-06-06') - pd.Timestamp(0))
df2
value
2015-06-06 22:00:00 -0.214977
2015-06-06 23:00:00 -0.006585
2015-06-07 00:00:00 0.568259
2015-06-07 01:00:00 0.603196
2015-06-07 02:00:00 0.358124
2015-06-07 03:00:00 0.027835
2015-06-07 04:00:00 -0.436322
2015-06-07 05:00:00 0.627624
2015-06-07 06:00:00 0.168189
2015-06-07 07:00:00 -0.321916
2015-06-07 08:00:00 0.737383
2015-06-07 09:00:00 1.100500