how to extract missing datetime interval in python - python-3.x

I have a date dataframe where date contains 15 min of interval. I want to find the missing datetime interval. id should be copied from previous line but value should be nan
'''
date value id
2021-12-02 07:00:00 12456677 693214
2021-01-02 07:30:00 12456677 693214
2021-01-02 07:45:00 12456677 693214
2021-01-02 08:00:00 12456677 693214
2021-01-02 08:15:00 12456665 693215
2021-01-02 08:45:00 12456665 693215
2021-01-03 09:00:00 12456666 693217
2021-01-03 10:30:00 12456666 693217
expacted output is
date value id
2021-01-02 08:30:00 NAN 693215
2021-01-02 09:15:00 NAN 693217
2021-01-03 09:30:00 NAN 693217
2021-01-03 09:45:00 NAN 693217
2021-01-03 10:00:00 NAN 693217
I am trying
df['Datetime'] = pd.to_datetime(df['date'])
df[ df['Datetime'].diff() > pd.Timedelta('15min') ]
but it is just giving a time after which date is missing. not the missing date and time.it showed me this output
date value id
2021-01-02 08:15:00 12456665 693215
2021-01-03 09:00:00 12456666 693217
2021-01-03 10:30:00 12456666 693217
can some one please guide me how can I extract missing date and time?
Thanks in advance

Use Series.asfreq per groups for get missing intervals:
#create DatetimeIndex
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#add 15 Minutes index per days and per id
df1 = (df.groupby([pd.Grouper(freq='D'), 'id'])['value']
.apply(lambda x: x.asfreq('15min'))
.reset_index(level=0, drop=True)
.reset_index())
print (df1)
id date value
0 693214 2021-01-02 07:30:00 12456677.0
1 693214 2021-01-02 07:45:00 12456677.0
2 693214 2021-01-02 08:00:00 12456677.0
3 693215 2021-01-02 08:15:00 12456665.0
4 693215 2021-01-02 08:30:00 NaN
5 693215 2021-01-02 08:45:00 12456665.0
6 693217 2021-01-03 09:00:00 12456666.0
7 693217 2021-01-03 09:15:00 NaN
8 693217 2021-01-03 09:30:00 NaN
9 693217 2021-01-03 09:45:00 NaN
10 693217 2021-01-03 10:00:00 NaN
11 693217 2021-01-03 10:15:00 NaN
12 693217 2021-01-03 10:30:00 12456666.0
13 693214 2021-12-02 07:00:00 12456677.0
Test missing values in boolean indexing:
df2 = df1[df1['value'].isna()]
print (df2)
id date value
4 693215 2021-01-02 08:30:00 NaN
7 693217 2021-01-03 09:15:00 NaN
8 693217 2021-01-03 09:30:00 NaN
9 693217 2021-01-03 09:45:00 NaN
10 693217 2021-01-03 10:00:00 NaN
11 693217 2021-01-03 10:15:00 NaN

Related

Filter for timestamp with maximum number of records for each date and extract the filtered rows into another df

I have a dataframe with a timestamp column, another date column and price column.
The timestamp column is more like every 5 min data for a specific hour (between 10 am and 11 am) that am pulling out.
Eg:
Timestamp EndDate Price
2021-01-01 10:00:00 2021-06-30 08:00:00 100
2021-01-01 10:00:00 2021-09-30 08:00:00 105
2021-01-01 10:05:00 2021-03-30 08:00:00 102
2021-01-01 10:05:00 2021-06-30 08:00:00 100
2021-01-01 10:05:00 2021-09-30 08:00:00 105
2021-01-01 10:10:00 2021-03-30 08:00:00 102
2021-01-01 10:10:00 2021-06-30 08:00:00 100
2021-01-02 10:00:00 2021-06-30 08:00:00 100
2021-01-02 10:00:00 2021-09-30 08:00:00 105
2021-01-02 10:00:00 2021-03-30 08:00:00 102
2021-01-02 10:00:00 2021-06-30 08:00:00 100
2021-01-02 10:05:00 2021-09-30 08:00:00 105
2021-01-02 10:05:00 2021-03-30 08:00:00 102
2021-01-02 10:05:00 2021-06-30 08:00:00 100
For the snapshot every 5 min, some end up with 3 records, some with 2, some with 4 records.
Within that hour (or day) I want to pull out the set of records such that the set contains the maximum number of records, so for the 1st of jan in the above example, it should pull out 10:05 data, for 2nd jan it should pull out 10:00 data. If there are multiple sets with the same number of max records, then it can pull out the latest time for that day.
Am not sure how I can do this efficiently, perhaps use a count ?
u can split the the timstap for a better use, so i did this:
import numpy as np
import pandas as pd
filename=(r'C:xxxxxx\Example2.xlsx')
df0=pd.read_excel(filename)
df0['new_date'] = [d.date() for d in df0['Timestamp']]
df0['new_time'] = [d.time() for d in df0['Timestamp']]
this yields:
then we can use groupby() and thn apply() to count values as follow:
df = df0.groupby('new_date')['new_time'].apply(lambda x:
x.value_counts().index[0]).reset_index()
that yields:

Pandas set_index creates NoneType object without inplace=True

I'm facing a weird behavior with pandas set_index function. I initilally have this dataframe:
Unnamed: 0 Timestamps PM10
0 NaN NaT PM10
1 NaN NaT ▒g/m▒
2 NaN 2018-12-31 23:00:00 10.76
3 NaN 2018-12-31 22:00:00 9.46
4 NaN 2018-12-31 21:00:00 8.67
... ... ... ...
8682 NaN 2018-01-01 04:00:00 25.14
8683 NaN 2018-01-01 03:00:00 31.34
8684 NaN 2018-01-01 02:00:00 36.28
8685 NaN 2018-01-01 01:00:00 21.78
8686 NaN 2018-01-01 00:00:00 20.59
I want to drop the first two rows and set the Timestamps as indeces so I do this:
df_final = df.drop([0,1]).set_index('Timestamps', drop=True)
and I get this dataframe:
Unnamed: 0 PM10
Timestamps
2018-12-31 23:00:00 NaN 10.76
2018-12-31 22:00:00 NaN 9.46
2018-12-31 21:00:00 NaN 8.67
2018-12-31 20:00:00 NaN 10.42
2018-12-31 19:00:00 NaN 10.04
... ... ...
2018-01-01 04:00:00 NaN 25.14
2018-01-01 03:00:00 NaN 31.34
2018-01-01 02:00:00 NaN 36.28
2018-01-01 01:00:00 NaN 21.78
2018-01-01 00:00:00 NaN 20.59
So far so good, but finally I want to re-index the PM10 column by a new time index I have created called t_index, so I do this:
data_write = df_final.PM10[-1::-1].reindex(t_index)
That is where I get an error:
TypeError: 'NoneType' object is not iterable
After some debugging I have concluded that set_index is causing this but I can't figure out why, any help is appreciated!
After some trial and error I managed to make this work and here is the code that does it:
df = df.drop([0,1]).drop("Unnamed: 0", axis=1).set_index('Timestamps', drop=True)
df = df.sort_values(by="Timestamps", ascending=True)
year = 2018
start_index = '{}-01-01 00:00:00'.format(year) # define start of the year
end_index = '{}-12-31 23:00:00'.format(year) # define end of the year
t_index = pd.DatetimeIndex(start=start_index, end=end_index, freq='1h').strftime("%Y-%m-%d %H:%M:%S")
df_final = pd.to_numeric(df.PM10).resample('H').mean().reindex(t_index)
Still not sure what was causing the erro, or why the .asfreq method did not work.

Apply timestamp convert to date to multiple columns in Python

I want to convert two timestamp columns start_date and end_date to normal date columns:
id start_date end_date
0 1 1578448800000 1583632800000
1 2 1582164000000 1582250400000
2 3 1582509600000 1582596000000
3 4 1583373600000 1588557600000
4 5 1582509600000 1582596000000
5 6 1582164000000 1582250400000
6 7 1581040800000 1586224800000
7 8 1582423200000 1582509600000
8 9 1583287200000 1583373600000
The following code works for one timestamp, but how could I apply it to those two columns?
Thanks for your kind helps.
import datetime
timestamp = datetime.datetime.fromtimestamp(1500000000)
print(timestamp.strftime('%Y-%m-%d %H:%M:%S'))
Output:
2017-07-14 10:40:00
I also try with pd.to_datetime(df['start_date']/1000).apply(lambda x: x.date()) which give a incorrect result.
0 1970-01-01
1 1970-01-01
2 1970-01-01
3 1970-01-01
4 1970-01-01
5 1970-01-01
6 1970-01-01
7 1970-01-01
8 1970-01-01
Use DataFrame.apply with list of columns names and to_datetime with parameter unit='ms':
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(pd.to_datetime, unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
EDIT: For dates add lambda function with Series.dt.date:
cols = ['start_date', 'end_date']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x, unit='ms').dt.date)
print (df)
id start_date end_date
0 1 2020-01-08 2020-03-08
1 2 2020-02-20 2020-02-21
2 3 2020-02-24 2020-02-25
3 4 2020-03-05 2020-05-04
4 5 2020-02-24 2020-02-25
5 6 2020-02-20 2020-02-21
6 7 2020-02-07 2020-04-07
7 8 2020-02-23 2020-02-24
8 9 2020-03-04 2020-03-05
Or convert each column separately:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms')
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms')
print (df)
id start_date end_date
0 1 2020-01-08 02:00:00 2020-03-08 02:00:00
1 2 2020-02-20 02:00:00 2020-02-21 02:00:00
2 3 2020-02-24 02:00:00 2020-02-25 02:00:00
3 4 2020-03-05 02:00:00 2020-05-04 02:00:00
4 5 2020-02-24 02:00:00 2020-02-25 02:00:00
5 6 2020-02-20 02:00:00 2020-02-21 02:00:00
6 7 2020-02-07 02:00:00 2020-04-07 02:00:00
7 8 2020-02-23 02:00:00 2020-02-24 02:00:00
8 9 2020-03-04 02:00:00 2020-03-05 02:00:00
And for dates:
df['start_date'] = pd.to_datetime(df['start_date'], unit='ms').dt.date
df['end_date'] = pd.to_datetime(df['end_date'], unit='ms').dt.date

How to reset x tick labels in matplotlib

I have 5 time series data in DataFrames and each of them have a different time scale.
For example, data1 is from 4/15 0:00 to 4/16 0:00, data2 is from 9/16 06:30 to 7:00.
All these data are in different DataFrames and I wanna draw graphs of them by using matplotlib. I want to set the numbers of x tick labels 5 and put a date of the data just on the leftmost x tick label. I tried the code below but I couldn't get graphs I wanted.
fig = plt.figure(figsize=(15, 3))
for i in range(1,6): # because I have 5 DataFrames in 'df_event_num'
ax = plt.subplot(150+i)
plt.title('event_num{}'.format(i))
df_event_num[i-1]['Load_Avg'].plot(color=colors_2018[i-1])
ax.tick_params(rotation=270)
fig.tight_layout()
And I got a graph like this
Again, I want to set the numbers of x tick labels to 5 and put a date just on the leftmost x tick label on every graph. And hopefully, I want to rotate the characters of x tick labels.
Could anyone teach me how to get graphs I want?
df_event_num has 5 DataFrames and I want to make time series graphs of the column data named 'Load_Avg'.
Here is the sample data of 'df_event_num'.
print(df_event_num[0]['Load_Avg'])
>>>
TIMESTAMP
2018-04-15 00:00:00 406.2
2018-04-15 00:30:00 407.4
2018-04-15 01:00:00 409.6
2018-04-15 01:30:00 403.3
2018-04-15 02:00:00 405.0
2018-04-15 02:30:00 401.8
2018-04-15 03:00:00 401.1
2018-04-15 03:30:00 401.0
2018-04-15 04:00:00 402.3
2018-04-15 04:30:00 402.5
2018-04-15 05:00:00 404.3
2018-04-15 05:30:00 404.7
2018-04-15 06:00:00 417.0
2018-04-15 06:30:00 438.9
2018-04-15 07:00:00 466.4
2018-04-15 07:30:00 476.6
2018-04-15 08:00:00 499.3
2018-04-15 08:30:00 523.1
2018-04-15 09:00:00 550.2
2018-04-15 09:30:00 590.2
2018-04-15 10:00:00 604.4
2018-04-15 10:30:00 622.4
2018-04-15 11:00:00 657.7
2018-04-15 11:30:00 737.2
2018-04-15 12:00:00 775.0
2018-04-15 12:30:00 819.0
2018-04-15 13:00:00 835.0
2018-04-15 13:30:00 848.0
2018-04-15 14:00:00 858.0
2018-04-15 14:30:00 866.0
2018-04-15 15:00:00 874.0
2018-04-15 15:30:00 879.0
2018-04-15 16:00:00 883.0
2018-04-15 16:30:00 889.0
2018-04-15 17:00:00 893.0
2018-04-15 17:30:00 894.0
2018-04-15 18:00:00 895.0
2018-04-15 18:30:00 897.0
2018-04-15 19:00:00 895.0
2018-04-15 19:30:00 898.0
2018-04-15 20:00:00 899.0
2018-04-15 20:30:00 900.0
2018-04-15 21:00:00 903.0
2018-04-15 21:30:00 904.0
2018-04-15 22:00:00 905.0
2018-04-15 22:30:00 906.0
2018-04-15 23:00:00 906.0
2018-04-15 23:30:00 907.0
2018-04-16 00:00:00 909.0
Freq: 30T, Name: Load_Avg, dtype: float64
print(df_event_num[1]['Load_Avg'])
>>>
TIMESTAMP
2018-04-25 06:30:00 1133.0
2018-04-25 07:00:00 1159.0
Freq: 30T, Name: Load_Avg, dtype: float64
print(df_event_num[2]['Load_Avg'])
TIMESTAMP
2018-06-28 09:30:00 925.0
2018-06-28 10:00:00 1008.0
Freq: 30T, Name: Load_Avg, dtype: float64
print(df_event_num[3]['Load_Avg'])
>>>
TIMESTAMP
2018-09-08 00:00:00 769.3
2018-09-08 00:30:00 772.4
2018-09-08 01:00:00 778.3
2018-09-08 01:30:00 787.5
2018-09-08 02:00:00 812.0
2018-09-08 02:30:00 825.0
2018-09-08 03:00:00 836.0
2018-09-08 03:30:00 862.0
2018-09-08 04:00:00 884.0
2018-09-08 04:30:00 905.0
2018-09-08 05:00:00 920.0
2018-09-08 05:30:00 926.0
2018-09-08 06:00:00 931.0
2018-09-08 06:30:00 942.0
2018-09-08 07:00:00 948.0
2018-09-08 07:30:00 956.0
2018-09-08 08:00:00 981.0
Freq: 30T, Name: Load_Avg, dtype: float64
print(df_event_num[4]['Load_Avg'])
>>>
TIMESTAMP
2018-09-30 21:00:00 252.2
2018-09-30 21:30:00 256.5
2018-09-30 22:00:00 264.1
2018-09-30 22:30:00 271.1
2018-09-30 23:00:00 277.7
2018-09-30 23:30:00 310.0
2018-10-01 00:00:00 331.6
2018-10-01 00:30:00 356.3
2018-10-01 01:00:00 397.2
2018-10-01 01:30:00 422.4
2018-10-01 02:00:00 444.2
2018-10-01 02:30:00 464.7
2018-10-01 03:00:00 477.2
2018-10-01 03:30:00 487.2
2018-10-01 04:00:00 494.7
2018-10-01 04:30:00 515.2
2018-10-01 05:00:00 527.6
2018-10-01 05:30:00 537.5
2018-10-01 06:00:00 541.7
Freq: 30T, Name: Load_Avg, dtype: float64
I modified your code a little bit:
You do not need to use range() to loop over, you can iterate directly over the list of DataFrames
Use the created ax subplot to set the data and the title on it.
Create 5 linear separated ticks on the x-axis based on the first and last index of the individual dataframe: pd.to_datetime(np.linspace(df.index[0].value, df.index[-1].value, 5))
Use just the last value as label, and replace all other with empty stings: ts_names = ['','','','',ts_loc[-1]]
import numpy as np
colors_2018 = ['red', 'blue', 'green', 'yellow', 'orange', 'brown']
fig = plt.figure(figsize=(15, 4))
for i, df in enumerate(df_event_num): # because I have 5 DataFrames in 'df_event_num'
ax = plt.subplot(1,5,i+1)
ax.plot(df['Load_Avg'], color=colors_2018[i])
ax.set_title('event_num{}'.format(i))
# If the index is not a Timestamp-type already:
df.index = pd.to_datetime(df.index)
# x-Axis locations of 5 timestamps
ts_loc = pd.to_datetime(np.linspace(df.index[0].value, df.index[-1].value, 5))
ax.set_xticks(ts_loc, minor=False)
# Names of the timestamps (only last shown)
ts_names = ['','','','',ts_loc[-1]]
ax.set_xticklabels(ts_names, rotation="vertical")
fig.tight_layout()

Pandas: Find original index of a value with a grouped dataframe

I have a dataframe with a RangeIndex, timestamps in the first column and several thousands hourly temperature observations in the second.
It is easy enough to group the observations by 24 and find daily Tmax and Tmin. But I also want the timestamp of each day's max and min values.
How can I do that?
I hope I can get help without posting a working example, because the nature of the data makes it unpractical.
EDIT: Here's some data, spanning two days.
DT T-C
0 2015-01-01 00:00:00 -2.5
1 2015-01-01 01:00:00 -2.1
2 2015-01-01 02:00:00 -2.3
3 2015-01-01 03:00:00 -2.3
4 2015-01-01 04:00:00 -2.3
5 2015-01-01 05:00:00 -2.0
...
24 2015-01-02 00:00:00 1.1
25 2015-01-02 01:00:00 1.1
26 2015-01-02 02:00:00 0.8
27 2015-01-02 03:00:00 0.5
28 2015-01-02 04:00:00 1.0
29 2015-01-02 05:00:00 0.7
First create DatetimeIndex, then aggregate by Grouper with days and idxmax
idxmin for datetimes for min and max temperature:
df['DT'] = pd.to_datetime(df['DT'])
df = df.set_index('DT')
df = df.groupby(pd.Grouper(freq='D'))['T-C'].agg(['idxmax','idxmin','max','min'])
print (df)
idxmax idxmin max min
DT
2015-01-01 2015-01-01 05:00:00 2015-01-01 00:00:00 -2.0 -2.5
2015-01-02 2015-01-02 00:00:00 2015-01-02 03:00:00 1.1 0.5

Resources