Setting start time from previous night without dates from CSV using pandas - python-3.x

tI would like to run timeseries analysis on repeated measures data (time only, no dates) taken overnight from 22:00:00 to 09:00:00 the next morning.
How is the time set so that the Timeseries starts at 22:00:00. At the moment even when plotting it starts at 00:00:00 and ends at 23:00:00 with a flat line between 09:00:00 and 23:00:00?
df = pd.read_csv('1310.csv', parse_dates=True)
df['Time'] = pd.to_datetime(df['Time'])
df['Time'].apply( lambda d : d.time() )
df = df.set_index('Time')
df['2017-05-16 22:00:00'] + pd.Timedelta('-1 day')
Note: The date in the last line of code is automatically added, seen when df['Time'] is executed, so I inserted the same format with date in the last line for 22:00:00.
This is the error:
TypeError: Could not operate Timedelta('-1 days +00:00:00') with block values unsupported operand type(s) for +: 'numpy.ndarray' and 'Timedelta'

You should consider your timestamps as pd.Timedeltas and add a day to the samples before your start time.
Create some example data:
import pandas as pd
d = pd.date_range(start='22:00:00', periods=12, freq='h')
s = pd.Series(d).dt.time
df = pd.DataFrame(pd.np.random.randn(len(s)), index=s, columns=['value'])
df.to_csv('data.csv')
df
value
22:00:00 -0.214977
23:00:00 -0.006585
00:00:00 0.568259
01:00:00 0.603196
02:00:00 0.358124
03:00:00 0.027835
04:00:00 -0.436322
05:00:00 0.627624
06:00:00 0.168189
07:00:00 -0.321916
08:00:00 0.737383
09:00:00 1.100500
Read in, make index a timedelta, add a day to timedeltas before the start time, then assign back to the index.
df2 = pd.read_csv('data.csv', index_col=0)
df2.index = pd.to_timedelta(df2.index)
s = pd.Series(df2.index)
s[s < pd.Timedelta('22:00:00')] += pd.Timedelta('1d')
df2.index = pd.to_datetime(s)
df2
value
1970-01-01 22:00:00 -0.214977
1970-01-01 23:00:00 -0.006585
1970-01-02 00:00:00 0.568259
1970-01-02 01:00:00 0.603196
1970-01-02 02:00:00 0.358124
1970-01-02 03:00:00 0.027835
1970-01-02 04:00:00 -0.436322
1970-01-02 05:00:00 0.627624
1970-01-02 06:00:00 0.168189
1970-01-02 07:00:00 -0.321916
1970-01-02 08:00:00 0.737383
1970-01-02 09:00:00 1.100500
If you want to set the date of the first day:
df2.index += (pd.Timestamp('2015-06-06') - pd.Timestamp(0))
df2
value
2015-06-06 22:00:00 -0.214977
2015-06-06 23:00:00 -0.006585
2015-06-07 00:00:00 0.568259
2015-06-07 01:00:00 0.603196
2015-06-07 02:00:00 0.358124
2015-06-07 03:00:00 0.027835
2015-06-07 04:00:00 -0.436322
2015-06-07 05:00:00 0.627624
2015-06-07 06:00:00 0.168189
2015-06-07 07:00:00 -0.321916
2015-06-07 08:00:00 0.737383
2015-06-07 09:00:00 1.100500

Related

How to extract date in which the hour of the peak value occurred?

I have Hourly time series starts by year 2013 and ends by year 2020 as below and I want to plot only day in which the system load reached it peak:
date_time system_load
2013-01-01 00:00:00 1.0
2013-01-01 01:00:00 0.9
2013-01-01 02:00:00 0.5
...
2020-12-31 21:00:00 2.1
2020-12-31 22:00:00 1.8
2020-12-31 23:00:00 0.8
The intended dataframe has 'one day(24hours) per year' :
date_time system_load
2013-07-09 00:00:00 3.1
2013-07-09 02:00:00 3.0
2013-07-09 03:00:00 4.8
2013-07-09 04:00:00 2.6
...
2013-07-09 21:00:00 3.7
2013-07-09 22:00:00 3.9
2013-07-09 23:00:00 5.1
2014-09-09 00:00:00 4.1
2014-09-09 02:00:00 5.3
2014-09-09 03:00:00 6.0
2014-09-09 04:00:00 4.8
...
2014-09-09 21:00:00 3.5
2014-09-09 22:00:00 2.6
2014-09-09 23:00:00 1.6
...
...
2020-06-01 00:00:00 4.2
2020-06-01 02:00:00 3.6
2020-06-01 03:00:00 3.9
2020-06-01 04:00:00 2.8
...
2020-06-01 21:00:00 2.7
2020-06-01 22:00:00 4.8
2020-06-01 23:00:00 3.8
Get only date and year part from date_time column
Groupby year column and get the row containing the max value of system_load column in each group
Getting all the time from the original dataframe where the date is the same with the date whose system_load value is the max
Plot the bar
df['date_time'] = pd.to_datetime(df['date_time']) # Ensure the `date_time` column is datetime type
df['just_date'] = df['date_time'].dt.date
df['year'] = df['date_time'].dt.year
idx = df.groupby(['year'])['system_load'].transform(max) == df['system_load']
df[df['just_date'].isin(df[idx]['just_date'])].plot.bar(x='date_time', y='system_load', rot=45)
If in one year there are several days have the same max system_load value, the above code returns all. If you want to keep only the first day, you can use pandas.DataFrame.idxmax()
idx = df.groupby(['year'])['system_load'].idxmax()
df[df['just_date'].isin(df.loc[idx]['just_date'])].plot.bar(x='date_time', y='system_load', rot=45)
Here's an approach to solve your problem:
let sourcedf contain the input data in the form of two columns 'TimeStamp' & 'Load'
Then do the following:
sourcedf['Date'] = sourcedf.apply(lambda row: row['Date_Time'].date(), axis = 1)
dfg = sourcedf.groupby('Date')
ldList = list(dfg['Load'].max().to_list())
tgtDate = dfg.max().index.to_list()[dList.index(max(ldList))]
dfout = sourcedf[sourcedf['Date'] == tgtDate]
dfout will then contain just the date on which the max load was experienced

Panda df converting dtype - How to make this faster?

I have a panda dataframe with 3 colums with dates in it. They look like this and are not a datetime object yet:
In [32]:dfp[["Datum", "Start", "Ende"]]
Out[32]:
Datum Start Ende
0 4.2.2020 00:00:00 2.1.2018 08:00:00 14.12.2021 08:00:00
1 4.2.2020 00:00:00 2.1.2018 08:00:00 14.12.2021 08:00:00
2 4.2.2020 00:00:00 2.1.2018 08:00:00 14.12.2021 08:00:00
3 4.2.2020 00:00:00 2.1.2018 08:00:00 14.12.2021 08:00:00
4 4.2.2020 00:00:00 2.1.2018 08:00:00 14.12.2021 08:00:00
... ... ...
473474 4.8.2020 00:00:00 2.1.2014 08:00:00 29.12.2018 08:00:00
473475 4.8.2020 00:00:00 2.1.2014 08:00:00 29.12.2018 08:00:00
473476 4.8.2020 00:00:00 2.1.2014 08:00:00 29.12.2018 08:00:00
473477 4.8.2020 00:00:00 2.1.2014 08:00:00 29.12.2018 08:00:00
473478 4.8.2020 00:00:00 2.1.2014 08:00:00 29.12.2018 08:00:00
[473479 rows x 3 columns]
So I am turning them into datetime objects with this code:
dfp["Datum"] = (dfp["Datum"].apply(lambda x: x.replace(" 00:00:00", ""))
).apply(lambda x: datetime.strptime(x, '%d.%m.%Y'))
dfp["Start"] = (dfp["Start"].apply(lambda x: x.replace(" 08:00:00", ""))
).apply(lambda x: datetime.strptime(x, '%d.%m.%Y'))
dfp["Ende"] = (dfp["Ende"].apply(lambda x: x.replace(" 08:00:00", ""))
).apply(lambda x: datetime.strptime(x, '%d.%m.%Y'))
After that, pandas is able to recognise these values as datetime objects. However, this takes 12 seconds to run which seems quite a long time compared to all the other pieces of code I have with pandas. Is this a bad way of coding what i did here?

How to filter by Time from DateTime values in Pandas

My df dataset looks likes this:
time Open
2017-01-03 06:00:00 5.2475
2017-01-03 07:00:00 5.2475
2017-01-03 08:00:00 5.2180
2017-01-03 09:00:00 5.2128
2017-01-03 10:00:00 5.2128
2017-01-04 06:00:00 5.4122
2017-01-04 07:00:00 5.4122
2017-01-04 08:00:00 5.2123
2017-01-04 09:00:00 5.2475
2017-01-04 10:00:00 5.2475
2017-01-05 07:00:00 5.2180
2017-01-05 08:00:00 5.2128
2017-01-05 09:00:00 5.4122
2017-01-05 10:00:00 5.4122
....
I want to filter time values starting from '07:00:00' and include next 3 values
My new df should look like this:
time Open
2017-01-03 07:00:00 5.2475
2017-01-03 08:00:00 5.2180
2017-01-03 09:00:00 5.2128
2017-01-04 07:00:00 5.4122
2017-01-04 08:00:00 5.2123
2017-01-04 09:00:00 5.2475
2017-01-05 07:00:00 5.2180
2017-01-05 08:00:00 5.2128
2017-01-05 09:00:00 5.4122
....
Here, we are not including the '06:00:00' or '10:00:00' since we are only getting the data starting from '07:00:00' and the next 3 values
We need to preserve the order of the original df and just remove unwanted data in between that does not match the criteria of starting from '07:00:00' and 3 values after '07:00:00'
What did I do?
I tried to filter by selecting the time part but it only gives me one value when I do this:
df[(df.index.time == datetime.time(07, 0))
but I want the next 3 values. Doing head(3) does not work:
df[(df.index.time == datetime.time(07, 0))].head(3)
Can you please help me?
use between_time to fetch data on the basis of time
df = pd.DataFrame(data={"time":["2017-01-03 07:00:00","2017-01-03 06:00:00","2017-01-03 08:00:00","2017-01-03 10:00:00"],
"open":[5,5,5,4]})
df['time'] = pd.to_datetime(df['time'])
df.set_index("time",inplace=True)
res = df.between_time('07:00:00','09:00:00')
print(res)
time
2017-01-03 07:00:00 5
2017-01-03 08:00:00 5
2017-01-03 09:00:00 4
addition to your question
date_list = ['2017-01-03', '2017-01-02', '2017-01-07']
res =res[res.index.normalize().isin(date_list)]
in order to ignore last_date you can do
res=res[(res.index >='2017-01-02') &(res.index < '2017-01-07')]
Compare values by time and create helper Series by Series.cumsum, then remove values with 0, because it is first values non matched first time from condition and use GroupBy.head:
s = pd.Series(df.index.time == datetime.time(7, 0), index=df.index).cumsum()
df = df[s != 0].groupby(s).head(3)
print (df)
Open
time
2017-01-03 07:00:00 5.2475
2017-01-03 08:00:00 5.2180
2017-01-03 09:00:00 5.2128
2017-01-04 07:00:00 5.4122
2017-01-04 08:00:00 5.2123
2017-01-04 09:00:00 5.2475
2017-01-05 07:00:00 5.2180
2017-01-05 08:00:00 5.2128
2017-01-05 09:00:00 5.4122
If need filter by hours and by dates with boolean indexing and Series.isin:
date_list = ['2017-01-03', '2017-01-02', '2017-01-07']
df = df[df.index.hour.isin([7,8,9]) & df.index.floor('d').isin(date_list)]
print (df)
Open
time
2017-01-03 07:00:00 5.2475
2017-01-03 08:00:00 5.2180
2017-01-03 09:00:00 5.2128
Or by times and dates:
date_list = ['2017-01-03', '2017-01-02', '2017-01-07']
times = [datetime.time(7, 0), datetime.time(8, 0), datetime.time(9, 0)]
df = df[np.in1d(df.index.time, times) & df.index.floor('d').isin(date_list)]
print (df)
Open
time
2017-01-03 07:00:00 5.2475
2017-01-03 08:00:00 5.2180
2017-01-03 09:00:00 5.2128

How to reset x tick labels in matplotlib

I have 5 time series data in DataFrames and each of them have a different time scale.
For example, data1 is from 4/15 0:00 to 4/16 0:00, data2 is from 9/16 06:30 to 7:00.
All these data are in different DataFrames and I wanna draw graphs of them by using matplotlib. I want to set the numbers of x tick labels 5 and put a date of the data just on the leftmost x tick label. I tried the code below but I couldn't get graphs I wanted.
fig = plt.figure(figsize=(15, 3))
for i in range(1,6): # because I have 5 DataFrames in 'df_event_num'
ax = plt.subplot(150+i)
plt.title('event_num{}'.format(i))
df_event_num[i-1]['Load_Avg'].plot(color=colors_2018[i-1])
ax.tick_params(rotation=270)
fig.tight_layout()
And I got a graph like this
Again, I want to set the numbers of x tick labels to 5 and put a date just on the leftmost x tick label on every graph. And hopefully, I want to rotate the characters of x tick labels.
Could anyone teach me how to get graphs I want?
df_event_num has 5 DataFrames and I want to make time series graphs of the column data named 'Load_Avg'.
Here is the sample data of 'df_event_num'.
print(df_event_num[0]['Load_Avg'])
>>>
TIMESTAMP
2018-04-15 00:00:00 406.2
2018-04-15 00:30:00 407.4
2018-04-15 01:00:00 409.6
2018-04-15 01:30:00 403.3
2018-04-15 02:00:00 405.0
2018-04-15 02:30:00 401.8
2018-04-15 03:00:00 401.1
2018-04-15 03:30:00 401.0
2018-04-15 04:00:00 402.3
2018-04-15 04:30:00 402.5
2018-04-15 05:00:00 404.3
2018-04-15 05:30:00 404.7
2018-04-15 06:00:00 417.0
2018-04-15 06:30:00 438.9
2018-04-15 07:00:00 466.4
2018-04-15 07:30:00 476.6
2018-04-15 08:00:00 499.3
2018-04-15 08:30:00 523.1
2018-04-15 09:00:00 550.2
2018-04-15 09:30:00 590.2
2018-04-15 10:00:00 604.4
2018-04-15 10:30:00 622.4
2018-04-15 11:00:00 657.7
2018-04-15 11:30:00 737.2
2018-04-15 12:00:00 775.0
2018-04-15 12:30:00 819.0
2018-04-15 13:00:00 835.0
2018-04-15 13:30:00 848.0
2018-04-15 14:00:00 858.0
2018-04-15 14:30:00 866.0
2018-04-15 15:00:00 874.0
2018-04-15 15:30:00 879.0
2018-04-15 16:00:00 883.0
2018-04-15 16:30:00 889.0
2018-04-15 17:00:00 893.0
2018-04-15 17:30:00 894.0
2018-04-15 18:00:00 895.0
2018-04-15 18:30:00 897.0
2018-04-15 19:00:00 895.0
2018-04-15 19:30:00 898.0
2018-04-15 20:00:00 899.0
2018-04-15 20:30:00 900.0
2018-04-15 21:00:00 903.0
2018-04-15 21:30:00 904.0
2018-04-15 22:00:00 905.0
2018-04-15 22:30:00 906.0
2018-04-15 23:00:00 906.0
2018-04-15 23:30:00 907.0
2018-04-16 00:00:00 909.0
Freq: 30T, Name: Load_Avg, dtype: float64
print(df_event_num[1]['Load_Avg'])
>>>
TIMESTAMP
2018-04-25 06:30:00 1133.0
2018-04-25 07:00:00 1159.0
Freq: 30T, Name: Load_Avg, dtype: float64
print(df_event_num[2]['Load_Avg'])
TIMESTAMP
2018-06-28 09:30:00 925.0
2018-06-28 10:00:00 1008.0
Freq: 30T, Name: Load_Avg, dtype: float64
print(df_event_num[3]['Load_Avg'])
>>>
TIMESTAMP
2018-09-08 00:00:00 769.3
2018-09-08 00:30:00 772.4
2018-09-08 01:00:00 778.3
2018-09-08 01:30:00 787.5
2018-09-08 02:00:00 812.0
2018-09-08 02:30:00 825.0
2018-09-08 03:00:00 836.0
2018-09-08 03:30:00 862.0
2018-09-08 04:00:00 884.0
2018-09-08 04:30:00 905.0
2018-09-08 05:00:00 920.0
2018-09-08 05:30:00 926.0
2018-09-08 06:00:00 931.0
2018-09-08 06:30:00 942.0
2018-09-08 07:00:00 948.0
2018-09-08 07:30:00 956.0
2018-09-08 08:00:00 981.0
Freq: 30T, Name: Load_Avg, dtype: float64
print(df_event_num[4]['Load_Avg'])
>>>
TIMESTAMP
2018-09-30 21:00:00 252.2
2018-09-30 21:30:00 256.5
2018-09-30 22:00:00 264.1
2018-09-30 22:30:00 271.1
2018-09-30 23:00:00 277.7
2018-09-30 23:30:00 310.0
2018-10-01 00:00:00 331.6
2018-10-01 00:30:00 356.3
2018-10-01 01:00:00 397.2
2018-10-01 01:30:00 422.4
2018-10-01 02:00:00 444.2
2018-10-01 02:30:00 464.7
2018-10-01 03:00:00 477.2
2018-10-01 03:30:00 487.2
2018-10-01 04:00:00 494.7
2018-10-01 04:30:00 515.2
2018-10-01 05:00:00 527.6
2018-10-01 05:30:00 537.5
2018-10-01 06:00:00 541.7
Freq: 30T, Name: Load_Avg, dtype: float64
I modified your code a little bit:
You do not need to use range() to loop over, you can iterate directly over the list of DataFrames
Use the created ax subplot to set the data and the title on it.
Create 5 linear separated ticks on the x-axis based on the first and last index of the individual dataframe: pd.to_datetime(np.linspace(df.index[0].value, df.index[-1].value, 5))
Use just the last value as label, and replace all other with empty stings: ts_names = ['','','','',ts_loc[-1]]
import numpy as np
colors_2018 = ['red', 'blue', 'green', 'yellow', 'orange', 'brown']
fig = plt.figure(figsize=(15, 4))
for i, df in enumerate(df_event_num): # because I have 5 DataFrames in 'df_event_num'
ax = plt.subplot(1,5,i+1)
ax.plot(df['Load_Avg'], color=colors_2018[i])
ax.set_title('event_num{}'.format(i))
# If the index is not a Timestamp-type already:
df.index = pd.to_datetime(df.index)
# x-Axis locations of 5 timestamps
ts_loc = pd.to_datetime(np.linspace(df.index[0].value, df.index[-1].value, 5))
ax.set_xticks(ts_loc, minor=False)
# Names of the timestamps (only last shown)
ts_names = ['','','','',ts_loc[-1]]
ax.set_xticklabels(ts_names, rotation="vertical")
fig.tight_layout()

Pandas: Find original index of a value with a grouped dataframe

I have a dataframe with a RangeIndex, timestamps in the first column and several thousands hourly temperature observations in the second.
It is easy enough to group the observations by 24 and find daily Tmax and Tmin. But I also want the timestamp of each day's max and min values.
How can I do that?
I hope I can get help without posting a working example, because the nature of the data makes it unpractical.
EDIT: Here's some data, spanning two days.
DT T-C
0 2015-01-01 00:00:00 -2.5
1 2015-01-01 01:00:00 -2.1
2 2015-01-01 02:00:00 -2.3
3 2015-01-01 03:00:00 -2.3
4 2015-01-01 04:00:00 -2.3
5 2015-01-01 05:00:00 -2.0
...
24 2015-01-02 00:00:00 1.1
25 2015-01-02 01:00:00 1.1
26 2015-01-02 02:00:00 0.8
27 2015-01-02 03:00:00 0.5
28 2015-01-02 04:00:00 1.0
29 2015-01-02 05:00:00 0.7
First create DatetimeIndex, then aggregate by Grouper with days and idxmax
idxmin for datetimes for min and max temperature:
df['DT'] = pd.to_datetime(df['DT'])
df = df.set_index('DT')
df = df.groupby(pd.Grouper(freq='D'))['T-C'].agg(['idxmax','idxmin','max','min'])
print (df)
idxmax idxmin max min
DT
2015-01-01 2015-01-01 05:00:00 2015-01-01 00:00:00 -2.0 -2.5
2015-01-02 2015-01-02 00:00:00 2015-01-02 03:00:00 1.1 0.5

Resources