Adding Minutes to Pandas DatetimeIndex - python-3.x

I have an index that contain dates.
DatetimeIndex(['2004-01-02', '2004-01-05', '2004-01-06', '2004-01-07',
'2004-01-08', '2004-01-09', '2004-01-12', '2004-01-13',
'2004-01-14', '2004-01-15',
...
'2015-12-17', '2015-12-18', '2015-12-21', '2015-12-22',
'2015-12-23', '2015-12-24', '2015-12-28', '2015-12-29',
'2015-12-30', '2015-12-31'],
dtype='datetime64[ns]', length=3021, freq=None)
Now for each day I would like to generate every minute (24*60=1440 minutes) within each day and make an index with all days and minutes.
The result should look like:
['2004-01-02 00:00:00', '2004-01-02 00:01:00', ..., '2004-01-02 23:59:00',
'2004-01-03 00:00:00', '2004-01-03 00:01:00', ..., '2004-01-03 23:59:00',
...
'2015-12-31 00:00:00', '2015-12-31 00:01:00', ..., '2015-12-31 23:59:00']
Is there a smart trick for this?

You should be able to use .asfreq() here:
>>> import pandas as pd
>>> days = pd.date_range(start='2018-01-01', days=10)
>>> df = pd.DataFrame(list(range(len(days))), index=days)
>>> df.asfreq('min')
0
2018-01-01 00:00:00 0.0
2018-01-01 00:01:00 NaN
2018-01-01 00:02:00 NaN
2018-01-01 00:03:00 NaN
2018-01-01 00:04:00 NaN
2018-01-01 00:05:00 NaN
2018-01-01 00:06:00 NaN
# ...
>>> df.shape
(10, 1)
>>> df.asfreq('min').shape
(12961, 1)
If that doesn't work for some reason, you might also want to have a look into pd.MultiIndex.from_product(); then pd.to_datetime() on the concatenated result.

Related

how to get employee count by Hour and Date using pySpark / python?

I have employee id, their clock in, and clock out timings by day. I want to calculate number of employee present in office by hour by Date.
Example Data
import pandas as pd
data1 = {'emp_id': ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4', 'Employee 5'],
'Clockin': ['12/5/2021 0:08','8/7/2021 0:04','3/30/2021 1:24','12/23/2021 22:45', '12/23/2021 23:29'],
'Clockout': ['12/5/2021 3:28','8/7/2021 0:34','3/30/2021 4:37','12/24/2021 0:42', '12/24/2021 1:42']}
df1 = pd.DataFrame(data1)
Example of output
import pandas as pd
data2 = {'Date': ['12/5/2021', '8/7/2021', '3/30/2021','3/30/2021','3/30/2021','3/30/2021', '12/23/2021','12/23/2021','12/24/2021','12/24/2021'],
'Hour': ['01:00','01:00','02:00','03:00','04:00','05:00', '22:00','23:00', '01:00','02:00'],
'emp_count': [1,1,1,1,1,1,1,2, 2,1]}
df2 = pd.DataFrame(data2)
Try this:
# Round clock in DOWN to the nearest PRECEDING hour
clock_in = pd.to_datetime(df1["Clockin"]).dt.floor("H")
# Round clock out UP to the nearest SUCCEEDING hour
clock_out = pd.to_datetime(df1["Clockout"]).dt.ceil("H")
# Generate time series at hourly frequency between adjusted clock in and clock
# out time
hours = pd.Series(
[
pd.date_range(in_, out_, freq="H", inclusive="right")
for in_, out_ in zip(clock_in, clock_out)
]
).explode()
# Final result
hours.groupby(hours).count()
Result:
2021-03-30 02:00:00 1
2021-03-30 03:00:00 1
2021-03-30 04:00:00 1
2021-03-30 05:00:00 1
2021-08-07 01:00:00 1
2021-12-05 01:00:00 1
2021-12-05 02:00:00 1
2021-12-05 03:00:00 1
2021-12-05 04:00:00 1
2021-12-23 23:00:00 1
2021-12-24 00:00:00 2
2021-12-24 01:00:00 2
2021-12-24 02:00:00 1
dtype: int64
It's slightly different from your expected output but consistent with your business rules.

ValueError: cannot reindex from a duplicate axis while shift one column in Pandas

Given a dataframe df with date index as follows:
value
2017-03-31 NaN
2017-04-01 27863.7
2017-04-02 27278.5
2017-04-03 27278.5
2017-04-04 27278.5
...
2021-10-27 NaN
2021-10-28 NaN
2021-10-29 NaN
2021-10-30 NaN
2021-10-31 NaN
I'm able to shift value column by one year use df['value'].shift(freq=pd.DateOffset(years=1)):
Out:
2018-03-31 NaN
2018-04-01 27863.7
2018-04-02 27278.5
2018-04-03 27278.5
2018-04-04 27278.5
...
2022-10-27 NaN
2022-10-28 NaN
2022-10-29 NaN
2022-10-30 NaN
2022-10-31 NaN
But when I use it to replace orginal value by df['value'] = df['value'].shift(freq=pd.DateOffset(years=1)), it raises an error:
ValueError: cannot reindex from a duplicate axis
Since the code below works smoothly, so I think the issue caused by NaNs in value column:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130101', periods=720)
df = pd.DataFrame(np.random.randint(0, 100, size=(720, 3)), index=dates, columns=list('ABC'))
df
df.B = df.B.shift(freq=pd.DateOffset(years=1))
I also try with df['value'].shift(freq=relativedelta(years=+1)), but it generates: pandas.errors.NullFrequencyError: Cannot shift with no freq
Someone could help to deal with this issue? Sincere thanks.
Since the code below works smoothly, so I think the issue caused by NaNs in value column
No I don't think so. It's probably because in your 2nd sample you have only 1 leap year.
Reproducible error with 2 leap years:
# 2018 (366 days), 2019 (365 days) and 2020 (366 days)
dates = pd.date_range('20180101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
...
ValueError: cannot reindex from a duplicate axis
...
The example below works:
# 2017 (365 days), 2018 (366 days) and 2019 (365 days)
dates = pd.date_range('20170101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
Just look to value_counts:
# 2018 -> 2020
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2021-02-28 2 # The duplicated index
2020-12-29 1
2021-01-04 1
2021-01-03 1
2021-01-02 1
..
2020-01-07 1
2020-01-08 1
2020-01-09 1
2020-01-10 1
2021-12-31 1
Length: 1095, dtype: int64
# 2017 -> 2019
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2018-01-01 1
2019-12-30 1
2020-01-05 1
2020-01-04 1
2020-01-03 1
..
2019-01-07 1
2019-01-08 1
2019-01-09 1
2019-01-10 1
2021-01-01 1
Length: 1096, dtype: int64
Solution
Obviously, the solution is to remove duplicated index, in our case '2021-02-28', by using resample('D') and an aggregate function first, last, min, max, mean, sum or a custom one:
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28']
2021-02-28 41
2021-02-28 96
Name: B, dtype: int64
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28'] \
.resample('D').agg(('first', 'last', 'min', 'max', 'mean', 'sum')).T
2021-02-28
first 41.0
last 96.0
min 41.0
max 96.0
mean 68.5
sum 137.0
# Choose `last` for example
df.B = df.B.shift(freq=pd.DateOffset(years=1)).resample('D').last()
Note, you can replace .resample(...).func by .loc[lambda x: x.index.duplicated()]

Copy a single row value and apply it as a column using Pandas

My dataset looks like this:
# df1 - minute based dataset
date Open
2018-01-01 00:00:00 1.0536
2018-01-01 00:01:00 1.0527
2018-01-01 00:02:00 1.0558
2018-01-01 00:03:00 1.0534
2018-01-01 00:04:00 1.0524
...
What I want to do is get the value at 05:00:00 daily and create a new column called, OpenVal_5AM and put that corresponding value on that column. The new df will look like this:
# df2 - minute based dataset with 05:00:00 Open value
date Open OpenVal_5AM
2018-01-01 00:00:00 1.0536 1.0133
2018-01-01 00:01:00 1.0527 1.0133
2018-01-01 00:02:00 1.0558 1.0133
2018-01-01 00:03:00 1.0534 1.0133
2018-01-01 00:04:00 1.0524 1.0133
...
Since this is a minute based data, we will have 1440 same data point in the new column OpenVal_5AM for each day. It is because we are just grabbing the value at a point in time in a day and creating a new column.
What did I do?
I used this step:
df['OpenVal_5AM'] = df.groupby(df.date.dt.date,sort=False).Open.dt.hour.between(5, 5)
That's the closest I could come but it does not work.
Here's my suggestion:
df['OpenVal_5AM'] = df.apply(lambda r: df.Open.loc[r.name.replace(minute=5)], axis=1)
Disclaimer: I didn't test it with a huge dataset; so I don't know how it'll perform in that situation.

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

How to get all indexes which had a particular value in last row of a Pandas DataFrame?

For a sample DataFrame like,
>>> import pandas as pd
>>> index = pd.date_range(start='1/1/2018', periods=6, freq='15T')
>>> data = ['ON_PEAK', 'OFF_PEAK', 'ON_PEAK', 'ON_PEAK', 'OFF_PEAK', 'OFF_PEAK']
>>> df = pd.DataFrame(data, index=index, columns=['tou'])
>>> df
tou
2018-01-01 00:00:00 ON PEAK
2018-01-01 00:15:00 OFF PEAK
2018-01-01 00:30:00 ON PEAK
2018-01-01 00:45:00 ON PEAK
2018-01-01 01:00:00 OFF PEAK
2018-01-01 01:15:00 OFF PEAK
How to get all indexes for which tou value is not ON_PEAK but of row before them is ON_PEAK, i.e. the output would be:
['2018-01-01 00:15:00', '2018-01-01 01:00:00']
Or, if it's easier to get all rows with ON_PEAK and the first row next to them, i.e
['2018-01-01 00:00:00', '2018-01-01 00:15:00', '2018-01-01 00:30:00', '2018-01-01 00:45:00', '2018-01-01 01:00:00']
You need to find rows where tou is not ON_PEAK and the previous tou found using pandas.shift() is ON_PEAK. Note that positive values in shift give nth previous values and negative values gives nth next value in the dataframe.
df.loc[(df['tou']!='ON_PEAK') & (df['tou'].shift(1)=='ON_PEAK')]
Output:
tou
2018-01-01 00:15:00 OFF_PEAK
2018-01-01 01:00:00 OFF_PEAK

Resources