Creating a loop of 12 hours in dataframe with timestamp index - python-3.x

df['index_day'] = df.index.floor('d')
my dataframe is df.head
index_day P2_Qa ... P2_Qcon P2_m
2019-01-10 17:00:00 2019-01-10 93.599342 ... 107.673342 14.962424
2019-01-10 17:01:00 2019-01-10 90.833884 ... 104.658384 14.343642
2019-01-10 17:02:00 2019-01-10 90.907001 ... 104.601001 14.568892
2019-01-10 17:03:00 2019-01-10 93.579973 ... 107.115473 14.884902
2019-01-10 17:04:00 2019-01-10 93.688072 ... 107.168072 14.831412
I'm looping for every day
for day, i in df.groupby('index_day'):
sns.jointplot(x='P2_Tam', y='P2_Qa', data=i, kind='reg')
j=j+1
plt.savefig(j+'.png')
This gives me regression plots for one day 24 hours. However, I want such plots for nights only. Loop around 12 hours where one night = one loop= 1 plot from 18:00 till 6 in the morning morning.
However, i want to loop with one loop = 18:00 till 6:00 of next day rather than one loop=24 hours of one day. How do I do that?

I think you can filter first by DataFrame.between_time only for nights and then loop by 12H with base=6:
rng = pd.date_range('2017-04-03', periods=35, freq='H')
df = pd.DataFrame({'a': range(35)}, index=rng)
df = df.between_time('18:00:01', '6:00')
print (df)
a
2017-04-03 00:00:00 0
2017-04-03 01:00:00 1
2017-04-03 02:00:00 2
2017-04-03 03:00:00 3
2017-04-03 04:00:00 4
2017-04-03 05:00:00 5
2017-04-03 06:00:00 6
2017-04-03 19:00:00 19
2017-04-03 20:00:00 20
2017-04-03 21:00:00 21
2017-04-03 22:00:00 22
2017-04-03 23:00:00 23
2017-04-04 00:00:00 24
2017-04-04 01:00:00 25
2017-04-04 02:00:00 26
2017-04-04 03:00:00 27
2017-04-04 04:00:00 28
2017-04-04 05:00:00 29
2017-04-04 06:00:00 30
for i, g in df.groupby(pd.Grouper(freq='12H', base=6, closed='right')):
if not g.empty:
print (g)
a
2017-04-03 00:00:00 0
2017-04-03 01:00:00 1
2017-04-03 02:00:00 2
2017-04-03 03:00:00 3
2017-04-03 04:00:00 4
2017-04-03 05:00:00 5
2017-04-03 06:00:00 6
a
2017-04-03 19:00:00 19
2017-04-03 20:00:00 20
2017-04-03 21:00:00 21
2017-04-03 22:00:00 22
2017-04-03 23:00:00 23
2017-04-04 00:00:00 24
2017-04-04 01:00:00 25
2017-04-04 02:00:00 26
2017-04-04 03:00:00 27
2017-04-04 04:00:00 28
2017-04-04 05:00:00 29
2017-04-04 06:00:00 30
EDIT:
If want select by 12 hours after start time one possible solution with DataFrame.truncate:
rng = pd.date_range('2017-04-03', periods=35, freq='2H')
df = pd.DataFrame({'a': range(35)}, index=rng)
dates = df.index.floor('d').unique()
for s, e in zip(dates + pd.Timedelta(18, unit='H'),
dates + pd.Timedelta(30, unit='H')):
df1 = df.truncate(s, e)
if not df1.empty:
print (df1)
a
2017-04-03 18:00:00 9
2017-04-03 20:00:00 10
2017-04-03 22:00:00 11
2017-04-04 00:00:00 12
2017-04-04 02:00:00 13
2017-04-04 04:00:00 14
2017-04-04 06:00:00 15
a
2017-04-04 18:00:00 21
2017-04-04 20:00:00 22
2017-04-04 22:00:00 23
2017-04-05 00:00:00 24
2017-04-05 02:00:00 25
2017-04-05 04:00:00 26
2017-04-05 06:00:00 27
a
2017-04-05 18:00:00 33
2017-04-05 20:00:00 34

Related

How to select data series with special steps by Dataframe.rolling in pandas?

I have a dataframe which looks like this, and index is datetime64 of numpy:
(index) data
2017-01-01 00:00:00 1
2017-01-01 01:00:00 2
2017-01-01 02:00:00 3
…… ……
2017-01-04 00:00:00 73
2017-01-04 01:00:00 nan
2017-01-04 02:00:00 75
…… ……
Now I want to get datas in rolling windows which width are all 72(72 hours) and there is no intersection between two windows such as this:
windows1:
(index) data
2017-01-01 00:00:00 1
2017-01-01 01:00:00 2
2017-01-01 02:00:00 3
…… ……
2017-01-03 23:00:00 72
windows2:
(index) data
2017-01-04 00:00:00 73
# data of 2017-01-04 01:00:00 is nan, removed
2017-01-01 02:00:00 75
…… ……
2017-01-03 23:00:00 144
So how can realize this by DataFrame.rolling or DataSeries.rolling? If there is no easy answer, I will use index itself to solve the problem.
A 72H rolling can be achieved with df.rolling('72H').sum() (or any other function than sum)
But it looks like you don't want a rolling but rather a groupby with floor:
for k,g in df.groupby(df.index.floor('72H')):
print(f'New group: {k}\n', g.head(), '\n')
output:
New group: 2016-12-31 00:00:00
data
index
2017-01-01 00:00:00 1
2017-01-01 01:00:00 2
2017-01-01 02:00:00 3
2017-01-01 03:00:00 4
2017-01-01 04:00:00 5
New group: 2017-01-03 00:00:00
data
index
2017-01-03 00:00:00 49
2017-01-03 01:00:00 50
2017-01-03 02:00:00 51
2017-01-03 03:00:00 52
2017-01-03 04:00:00 53
To compute, for example, the mean:
df.groupby(df.index.floor('72H')).mean()
data
index
2016-12-31 24.5
2017-01-03 73.0
alternative
group = (df.index-df.index[0])//pd.Timedelta('72H')
df.groupby(group).mean()
Used input:
df = pd.DataFrame({'index': pd.date_range('2017-01-01', '2017-01-05', freq='1H'),
'data': np.arange(1, 98)}).set_index('index')

Apply timestamp convert to date to multiple columns in Python

I want to convert two timestamp columns start_date and end_date to normal date columns:
id start_date end_date
0 1 1578448800000 1583632800000
1 2 1582164000000 1582250400000
2 3 1582509600000 1582596000000
3 4 1583373600000 1588557600000
4 5 1582509600000 1582596000000
5 6 1582164000000 1582250400000
6 7 1581040800000 1586224800000
7 8 1582423200000 1582509600000
8 9 1583287200000 1583373600000
The following code works for one timestamp, but how could I apply it to those two columns?
Thanks for your kind helps.
import datetime
timestamp = datetime.datetime.fromtimestamp(1500000000)
print(timestamp.strftime('%Y-%m-%d %H:%M:%S'))
Output:
2017-07-14 10:40:00
I also try with pd.to_datetime(df['start_date']/1000).apply(lambda x: x.date()) which give a incorrect result.
0 1970-01-01
1 1970-01-01
2 1970-01-01
3 1970-01-01
4 1970-01-01
5 1970-01-01
6 1970-01-01
7 1970-01-01
8 1970-01-01
Use DataFrame.apply with list of columns names and to_datetime with parameter unit='ms':
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(pd.to_datetime, unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
EDIT: For dates add lambda function with Series.dt.date:
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x, unit='ms').dt.date)
print (df)
id start_date end_date
0 1 2020-01-08 2020-03-08
1 2 2020-02-20 2020-02-21
2 3 2020-02-24 2020-02-25
3 4 2020-03-05 2020-05-04
4 5 2020-02-24 2020-02-25
5 6 2020-02-20 2020-02-21
6 7 2020-02-07 2020-04-07
7 8 2020-02-23 2020-02-24
8 9 2020-03-04 2020-03-05
Or convert each column separately:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms')
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
And for dates:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms').dt.date
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms').dt.date

How to create 4 hour time interval in Time Series Analysis (python)

I'm totally new to Time Series Analysis and I'm trying to work on examples available online
this is what I have currently:
# Time based features
data = pd.read_csv('Train_SU63ISt.csv')
data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M')
data['Hour'] = data['Datetime'].dt.hour
data['minute'] = data['Datetime'].dt.minute
data.head()
ID Datetime Count Hour Minute
0 0 2012-08-25 00:00:00 8 0 0
1 1 2012-08-25 01:00:00 2 1 0
2 2 2012-08-25 02:00:00 6 2 0
3 3 2012-08-25 03:00:00 2 3 0
4 4 2012-08-25 04:00:00 2 4 0
What I'm looking for is something like this:
ID Datetime Count Hour Minute 4-Hour-window
0 0 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
1 1 2012-08-25 04:00:00 22 8 0 04:00:00 - 08:00:00
2 2 2012-08-25 08:00:00 18 12 0 08:00:00 - 12:00:00
3 3 2012-08-25 12:00:00 16 16 0 12:00:00 - 16:00:00
4 4 2012-08-25 16:00:00 18 20 0 16:00:00 - 20:00:00
5 5 2012-08-25 20:00:00 14 24 0 20:00:00 - 00:00:00
6 6 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
7 7 2012-08-26 04:00:00 24 8 0 04:00:00 - 08:00:00
8 8 2012-08-26 08:00:00 20 12 0 08:00:00 - 12:00:00
9 9 2012-08-26 12:00:00 10 16 0 12:00:00 - 16:00:00
10 10 2012-08-26 16:00:00 18 20 0 16:00:00 - 20:00:00
11 11 2012-08-26 20:00:00 14 24 0 20:00:00 - 00:00:00
I think what you are looking for is the resample function, see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
Something like this should work (not tested):
sampled_data = data.resample(
'4H',
kind='timestamp',
on='Datetime',
label='left'
).sum()
The function is very similar to groupby and groups the data into chunks of the column specified in on=, in this case we use timestamps and chunks of 4 hours.
Finally, you need to use some kind of disaggregation, in this case sum(), to convert all elements of each group into a single element per timechunk

How to delete rows with less than 3 consecutive date values in the Dataframe

before a come here to ask you I search a lot in the internet and documentations.
My problem is as follow:
I have a dataframe like that:
date dir vel
0 2006-02-12 17:00:00 181.00 3.92
1 2006-02-12 19:00:00 17.88 5.10
2 2006-02-12 21:00:00 214.75 3.73
3 2006-02-13 00:00:00 165.53 2.16
4 2006-02-13 01:00:00 189.44 2.94
5 2006-02-13 04:00:00 152.88 2.55
6 2006-02-13 05:00:00 188.03 3.73
7 2006-02-13 06:00:00 158.50 1.37
8 2006-02-13 07:00:00 189.44 2.55
9 2006-02-13 08:00:00 152.88 1.37
10 2006-02-13 10:00:00 109.28 0.20
11 2006-02-13 11:00:00 248.50 0.98
12 2006-02-13 12:00:00 26.31 1.96
13 2006-02-13 13:00:00 19.28 6.08
14 2006-02-13 14:00:00 334.28 3.53
15 2006-02-13 15:00:00 338.50 2.75
16 2006-02-13 16:00:00 318.81 3.92
17 2006-02-13 17:00:00 323.03 3.73
18 2006-02-13 21:00:00 62.88 1.76
19 2006-02-13 22:00:00 188.03 2.94
I just need to find the sequences of consecutive dates and drop the sequences of consecutive dates with less than 3 dates of duration. So I would get as result the following dataframe:
date dir vel
5 2006-02-13 04:00:00 152.88 2.55
6 2006-02-13 05:00:00 188.03 3.73
7 2006-02-13 06:00:00 158.50 1.37
8 2006-02-13 07:00:00 189.44 2.55
9 2006-02-13 08:00:00 152.88 1.37
10 2006-02-13 10:00:00 109.28 0.20
11 2006-02-13 11:00:00 248.50 0.98
12 2006-02-13 12:00:00 26.31 1.96
13 2006-02-13 13:00:00 19.28 6.08
14 2006-02-13 14:00:00 334.28 3.53
15 2006-02-13 15:00:00 338.50 2.75
16 2006-02-13 16:00:00 318.81 3.92
17 2006-02-13 17:00:00 323.03 3.73
So far I have used the following script (inspired in this anwer:Find group of consecutive dates in Pandas DataFrame)
(obs: The DataFrame name is estreito):
dt = estreito['date']
hour = pd.Timedelta('1H')
in_block = ((dt - dt.shift(-1)).abs() == hour) | (dt.diff() == hour)
filt = estreito.loc[in_block]
breaks = filt['date'].diff() != hour
groups = breaks.cumsum()
for _, frame in filt.groupby(groups):
print(frame, end='\n\n')
The print output is something like that:
date dir vel
3 2006-02-13 00:00:00 165.53 2.16
4 2006-02-13 01:00:00 189.44 2.94
date dir vel
5 2006-02-13 04:00:00 152.88 2.55
6 2006-02-13 05:00:00 188.03 3.73
7 2006-02-13 06:00:00 158.50 1.37
8 2006-02-13 07:00:00 189.44 2.55
9 2006-02-13 08:00:00 152.88 1.37
date dir vel
10 2006-02-13 10:00:00 109.28 0.20
11 2006-02-13 11:00:00 248.50 0.98
12 2006-02-13 12:00:00 26.31 1.96
13 2006-02-13 13:00:00 19.28 6.08
14 2006-02-13 14:00:00 334.28 3.53
15 2006-02-13 15:00:00 338.50 2.75
16 2006-02-13 16:00:00 318.81 3.92
17 2006-02-13 17:00:00 323.03 3.73
How can I save the output in a new Dataframe filtering the groups with less than 3 consecutive dates of lenght.
There is a different way to do this analysis? Perhaps have an easier way to get the desired result.
Thanks in advance.
We using diff with cumsum create the key
s=df.date.diff().dt.seconds.ne(60*60).cumsum()
Then using transform count for the new key created , and slice the original df
df[s.groupby(s).transform('count').gt(3)]
Out[983]:
date dir vel
5 2006-02-13 04:00:00 152.88 2.55
6 2006-02-13 05:00:00 188.03 3.73
7 2006-02-13 06:00:00 158.50 1.37
8 2006-02-13 07:00:00 189.44 2.55
9 2006-02-13 08:00:00 152.88 1.37
10 2006-02-13 10:00:00 109.28 0.20
11 2006-02-13 11:00:00 248.50 0.98
12 2006-02-13 12:00:00 26.31 1.96
13 2006-02-13 13:00:00 19.28 6.08
14 2006-02-13 14:00:00 334.28 3.53
15 2006-02-13 15:00:00 338.50 2.75
16 2006-02-13 16:00:00 318.81 3.92
17 2006-02-13 17:00:00 323.03 3.73

Setting start time from previous night without dates from CSV using pandas

tI would like to run timeseries analysis on repeated measures data (time only, no dates) taken overnight from 22:00:00 to 09:00:00 the next morning.
How is the time set so that the Timeseries starts at 22:00:00. At the moment even when plotting it starts at 00:00:00 and ends at 23:00:00 with a flat line between 09:00:00 and 23:00:00?
df = pd.read_csv('1310.csv', parse_dates=True)
df['Time'] = pd.to_datetime(df['Time'])
df['Time'].apply( lambda d : d.time() )
df = df.set_index('Time')
df['2017-05-16 22:00:00'] + pd.Timedelta('-1 day')
Note: The date in the last line of code is automatically added, seen when df['Time'] is executed, so I inserted the same format with date in the last line for 22:00:00.
This is the error:
TypeError: Could not operate Timedelta('-1 days +00:00:00') with block values unsupported operand type(s) for +: 'numpy.ndarray' and 'Timedelta'
You should consider your timestamps as pd.Timedeltas and add a day to the samples before your start time.
Create some example data:
import pandas as pd
d = pd.date_range(start='22:00:00', periods=12, freq='h')
s = pd.Series(d).dt.time
df = pd.DataFrame(pd.np.random.randn(len(s)), index=s, columns=['value'])
df.to_csv('data.csv')
df
value
22:00:00 -0.214977
23:00:00 -0.006585
00:00:00 0.568259
01:00:00 0.603196
02:00:00 0.358124
03:00:00 0.027835
04:00:00 -0.436322
05:00:00 0.627624
06:00:00 0.168189
07:00:00 -0.321916
08:00:00 0.737383
09:00:00 1.100500
Read in, make index a timedelta, add a day to timedeltas before the start time, then assign back to the index.
df2 = pd.read_csv('data.csv', index_col=0)
df2.index = pd.to_timedelta(df2.index)
s = pd.Series(df2.index)
s[s < pd.Timedelta('22:00:00')] += pd.Timedelta('1d')
df2.index = pd.to_datetime(s)
df2
value
1970-01-01 22:00:00 -0.214977
1970-01-01 23:00:00 -0.006585
1970-01-02 00:00:00 0.568259
1970-01-02 01:00:00 0.603196
1970-01-02 02:00:00 0.358124
1970-01-02 03:00:00 0.027835
1970-01-02 04:00:00 -0.436322
1970-01-02 05:00:00 0.627624
1970-01-02 06:00:00 0.168189
1970-01-02 07:00:00 -0.321916
1970-01-02 08:00:00 0.737383
1970-01-02 09:00:00 1.100500
If you want to set the date of the first day:
df2.index += (pd.Timestamp('2015-06-06') - pd.Timestamp(0))
df2
value
2015-06-06 22:00:00 -0.214977
2015-06-06 23:00:00 -0.006585
2015-06-07 00:00:00 0.568259
2015-06-07 01:00:00 0.603196
2015-06-07 02:00:00 0.358124
2015-06-07 03:00:00 0.027835
2015-06-07 04:00:00 -0.436322
2015-06-07 05:00:00 0.627624
2015-06-07 06:00:00 0.168189
2015-06-07 07:00:00 -0.321916
2015-06-07 08:00:00 0.737383
2015-06-07 09:00:00 1.100500

Resources