Iterate over unique date and hour in the pandas dataframe to run a function - python-3.x

Hi I am currently running a for loop through by unique dates in the dataframe to pass it to a function.
However what I wanted is to iterate over the unique date and hour (e.g. 2020-12-18 15:00, 2020-12-18 16:00) through my dataframe. Is there any possible way to do this?
This is my code and a sample of my dataframe.
for day in df['DateTime'].dt.day.unique():
testdf = df[df['DateTime'].dt.day == day]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
DateTime Values
0 2020-12-18 15:00:00 554.0
1 2020-12-18 15:00:00 594.0
2 2020-12-18 15:00:00 513.0
3 2020-12-18 16:00:00 651.0
4 2020-12-18 16:00:00 593.0
5 2020-12-18 17:00:00 521.0
6 2020-12-18 17:00:00 539.0
7 2020-12-18 17:00:00 534.0
8 2020-12-18 18:00:00 562.0
9 2020-12-19 08:00:00 511.0
10 2020-12-19 09:00:00 512.0
11 2020-12-19 09:00:00 584.0
12 2020-12-19 09:00:00 597.0
13 2020-12-22 09:00:00 585.0
14 2020-12-22 09:00:00 620.0
15 2020-12-22 09:00:00 593.0

You can use groupby if need filter by all dates in DataFrame:
for day, testdf in df.groupby('DateTime'):
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
EDIT: If need filter only some dates from list use:
for date in ['2020-12-18 15:00', '2020-12-18 16:00']:
testdf = df[df['DateTime'] == date]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
EDIT1:
for date in df['DateTime']:
testdf = df[df['DateTime'] == date]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)

Related

how to get employee count by Hour and Date using pySpark / python?

I have employee id, their clock in, and clock out timings by day. I want to calculate number of employee present in office by hour by Date.
Example Data
import pandas as pd
data1 = {'emp_id': ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4', 'Employee 5'],
'Clockin': ['12/5/2021 0:08','8/7/2021 0:04','3/30/2021 1:24','12/23/2021 22:45', '12/23/2021 23:29'],
'Clockout': ['12/5/2021 3:28','8/7/2021 0:34','3/30/2021 4:37','12/24/2021 0:42', '12/24/2021 1:42']}
df1 = pd.DataFrame(data1)
Example of output
import pandas as pd
data2 = {'Date': ['12/5/2021', '8/7/2021', '3/30/2021','3/30/2021','3/30/2021','3/30/2021', '12/23/2021','12/23/2021','12/24/2021','12/24/2021'],
'Hour': ['01:00','01:00','02:00','03:00','04:00','05:00', '22:00','23:00', '01:00','02:00'],
'emp_count': [1,1,1,1,1,1,1,2, 2,1]}
df2 = pd.DataFrame(data2)
Try this:
# Round clock in DOWN to the nearest PRECEDING hour
clock_in = pd.to_datetime(df1["Clockin"]).dt.floor("H")
# Round clock out UP to the nearest SUCCEEDING hour
clock_out = pd.to_datetime(df1["Clockout"]).dt.ceil("H")
# Generate time series at hourly frequency between adjusted clock in and clock
# out time
hours = pd.Series(
[
pd.date_range(in_, out_, freq="H", inclusive="right")
for in_, out_ in zip(clock_in, clock_out)
]
).explode()
# Final result
hours.groupby(hours).count()
Result:
2021-03-30 02:00:00 1
2021-03-30 03:00:00 1
2021-03-30 04:00:00 1
2021-03-30 05:00:00 1
2021-08-07 01:00:00 1
2021-12-05 01:00:00 1
2021-12-05 02:00:00 1
2021-12-05 03:00:00 1
2021-12-05 04:00:00 1
2021-12-23 23:00:00 1
2021-12-24 00:00:00 2
2021-12-24 01:00:00 2
2021-12-24 02:00:00 1
dtype: int64
It's slightly different from your expected output but consistent with your business rules.

ValueError: cannot reindex from a duplicate axis while shift one column in Pandas

Given a dataframe df with date index as follows:
value
2017-03-31 NaN
2017-04-01 27863.7
2017-04-02 27278.5
2017-04-03 27278.5
2017-04-04 27278.5
...
2021-10-27 NaN
2021-10-28 NaN
2021-10-29 NaN
2021-10-30 NaN
2021-10-31 NaN
I'm able to shift value column by one year use df['value'].shift(freq=pd.DateOffset(years=1)):
Out:
2018-03-31 NaN
2018-04-01 27863.7
2018-04-02 27278.5
2018-04-03 27278.5
2018-04-04 27278.5
...
2022-10-27 NaN
2022-10-28 NaN
2022-10-29 NaN
2022-10-30 NaN
2022-10-31 NaN
But when I use it to replace orginal value by df['value'] = df['value'].shift(freq=pd.DateOffset(years=1)), it raises an error:
ValueError: cannot reindex from a duplicate axis
Since the code below works smoothly, so I think the issue caused by NaNs in value column:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130101', periods=720)
df = pd.DataFrame(np.random.randint(0, 100, size=(720, 3)), index=dates, columns=list('ABC'))
df
df.B = df.B.shift(freq=pd.DateOffset(years=1))
I also try with df['value'].shift(freq=relativedelta(years=+1)), but it generates: pandas.errors.NullFrequencyError: Cannot shift with no freq
Someone could help to deal with this issue? Sincere thanks.
Since the code below works smoothly, so I think the issue caused by NaNs in value column
No I don't think so. It's probably because in your 2nd sample you have only 1 leap year.
Reproducible error with 2 leap years:
# 2018 (366 days), 2019 (365 days) and 2020 (366 days)
dates = pd.date_range('20180101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
...
ValueError: cannot reindex from a duplicate axis
...
The example below works:
# 2017 (365 days), 2018 (366 days) and 2019 (365 days)
dates = pd.date_range('20170101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
Just look to value_counts:
# 2018 -> 2020
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2021-02-28 2 # The duplicated index
2020-12-29 1
2021-01-04 1
2021-01-03 1
2021-01-02 1
..
2020-01-07 1
2020-01-08 1
2020-01-09 1
2020-01-10 1
2021-12-31 1
Length: 1095, dtype: int64
# 2017 -> 2019
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2018-01-01 1
2019-12-30 1
2020-01-05 1
2020-01-04 1
2020-01-03 1
..
2019-01-07 1
2019-01-08 1
2019-01-09 1
2019-01-10 1
2021-01-01 1
Length: 1096, dtype: int64
Solution
Obviously, the solution is to remove duplicated index, in our case '2021-02-28', by using resample('D') and an aggregate function first, last, min, max, mean, sum or a custom one:
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28']
2021-02-28 41
2021-02-28 96
Name: B, dtype: int64
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28'] \
.resample('D').agg(('first', 'last', 'min', 'max', 'mean', 'sum')).T
2021-02-28
first 41.0
last 96.0
min 41.0
max 96.0
mean 68.5
sum 137.0
# Choose `last` for example
df.B = df.B.shift(freq=pd.DateOffset(years=1)).resample('D').last()
Note, you can replace .resample(...).func by .loc[lambda x: x.index.duplicated()]

Function to calculate timespan for a certain event

I have a pandas dataframe which looks like this
timestamp phase
2019-07-01 07:10:00 a
2019-07-01 07:11:00 a
2019-07-01 07:12:00 b
2019-07-01 07:13:00 b
2019-07-01 07:17:00 a
2019-07-01 07:19:00 a
2019-07-01 07:20:00 c
I am working on a function that creates a dataframe with an duration for every phase, till it hits the next phase.
I already have a solution but I have no clue how to write this in an user-defined-function, as I am new to python.
This is my "static" solution:
df['prev_phase'] = df["phase"].shift(1)
df['next_phase'] = df["phase"].shift(-1)
dfshift = df[df.next_phase != df.prev_phase]
dfshift["delta"] = (dfshift["timestamp"]-dfshift["timestamp"].shift()).fillna(0)
dfshift["helpcolumn"] = dfshift["phase"].shift(1)
dfshift2 = dfshift[dfshift.helpcolumn == dfshift["phase"]]
dfshift3 = dfshift2[["timestamp","phase","delta"]]
dfshift3["deltaminutes"] = dfshift3['delta'] / np.timedelta64(60, 's')
This gives me this as output (example):
timestamp phase delta deltam
2019-05-01 06:44:00 a 0 days 04:51:00 291.0
2019-05-01 07:25:00 b 0 days 00:40:00 40.0
2019-05-01 21:58:00 a 0 days 14:32:00 872.0
2019-05-01 22:07:00 c 0 days 00:08:00 8.0
I just need this in a function.
Thanks in advance
Edit for #Tom
timestamp phase
2019-05-05 08:58:00 a
2019-05-05 08:59:00 a
2019-05-05 09:00:00 b
2019-05-05 09:01:00 b
2019-05-05 09:02:00 b
2019-05-05 09:03:00 b
...
...
2019-05-05 09:38:00 b
2019-05-05 09:39:00 c
2019-05-05 09:40:00 c
2019-05-05 09:41:00 c
Those are the two colums + Index
df = pd.DataFrame({"timestamp": ["2019-07-01 07:10:00",
"2019-07-01 07:11:00",
"2019-07-01 07:12:00",
"2019-07-01 07:13:00",
"2019-07-01 07:17:00",
"2019-07-01 07:19:00",
"2019-07-01 07:20:00"],
"phase": ["a", "a", "b", "b", "a" ,"a", "c"]})
df["timestamp"] = pd.to_datetime(df["timestamp"])
# Create a 'phase_id' column to track when phase changes
df['phase_id'] = df['phase'].ne(df['phase'].shift()) + df.index
# Groupby new 'phase_id' variable and get time range for each phase
df_tdiff = df.groupby("phase_id").diff().reset_index()
df_tdiff.columns = ['phase_id', 'timediff']
# Merge this to old dataframe
df_new = pd.merge(df, df_tdiff, on=["phase_id"], how="left")
This then gives:
df_new
timestamp phase phase_id timediff
0 2019-07-01 07:10:00 a 1 00:01:00
1 2019-07-01 07:11:00 a 1 00:01:00
2 2019-07-01 07:12:00 b 3 00:01:00
3 2019-07-01 07:13:00 b 3 00:01:00
4 2019-07-01 07:17:00 a 5 00:02:00
5 2019-07-01 07:19:00 a 5 00:02:00
6 2019-07-01 07:20:00 c 7 NaT
Finally:
df_new = df_new.groupby("phase_id").first().reset_index(drop=True)
df_new
timestamp phase timediff
0 2019-07-01 07:10:00 a 00:01:00
1 2019-07-01 07:12:00 b 00:01:00
2 2019-07-01 07:17:00 a 00:02:00
3 2019-07-01 07:20:00 c NaT
Of course, if you need that all as a function (as originally requested), then:
def get_phase_timediff(df):
# Create a 'phase_id' column to track when phase changes
df['phase_id'] = df['phase'].ne(df['phase'].shift()) + df.index
# Groupby new 'phase_id' variable and get time range for each phase
df_tdiff = df.groupby("phase_id").diff().reset_index()
df_tdiff.columns = ['phase_id', 'timediff']
# Merge this to old dataframe
df_new = pd.merge(df, df_tdiff, on=["phase_id"], how="left")
# Groupby 'phase_id' again for final output
df_new = df_new.groupby("phase_id").first().reset_index(drop=True)
return(df_new)

Check whether a certain datetime value is missing in a given period

I have a df with DateTime index as follows:
DateTime
2017-01-02 15:00:00
2017-01-02 16:00:00
2017-01-02 18:00:00
....
....
2019-12-07 22:00:00
2019-12-07 23:00:00
Now, I want to know is there any time missing in the 1-hour interval. So, for instance, the 3rd reading is missing 1 reading as we went from 16:00 to 18:00 so is it possible to detect this?
Create date_range with minimal and maximal datetime and filter values by Index.isin with boolean indexing with ~ for inverting mask:
print (df)
DateTime
0 2017-01-02 15:00:00
1 2017-01-02 16:00:00
2 2017-01-02 18:00:00
r = pd.date_range(df['DateTime'].min(), df['DateTime'].max(), freq='H')
print (r)
DatetimeIndex(['2017-01-02 15:00:00', '2017-01-02 16:00:00',
'2017-01-02 17:00:00', '2017-01-02 18:00:00'],
dtype='datetime64[ns]', freq='H')
out = r[~r.isin(df['DateTime'])]
print (out)
DatetimeIndex(['2017-01-02 17:00:00'], dtype='datetime64[ns]', freq='H')
Another idea is create DatetimeIndex with helper column, change frequency by Series.asfreq and filter index values with missing values:
s = df[['DateTime']].assign(val=1).set_index('DateTime')['val'].asfreq('H')
print (s)
DateTime
2017-01-02 15:00:00 1.0
2017-01-02 16:00:00 1.0
2017-01-02 17:00:00 NaN
2017-01-02 18:00:00 1.0
Freq: H, Name: val, dtype: float64
out = s.index[s.isna()]
print (out)
DatetimeIndex(['2017-01-02 17:00:00'], dtype='datetime64[ns]', name='DateTime', freq='H')
Is it safe to assume that the datetime format will always be the same? If yes, why don't you extract the "hour" values from your respective timestamps and compare them to the interval you desire, e.g:
import re
#store some datetime values for show
datetimes=[
"2017-01-02 15:00:00",
"2017-01-02 16:00:00",
"2017-01-02 18:00:00",
"2019-12-07 22:00:00",
"2019-12-07 23:00:00"
]
#extract hour value via regex (first match always is the hours in this format)
findHour = re.compile("\d{2}(?=\:)")
prevx = findHour.findall(datetimes[1])[0]
#simple comparison: compare to previous value, calculate difference, set previous value to current value
for x in datetimes[2:]:
cmp = findHour.findall(x)[0]
diff = int(cmp) - int(prevx)
if diff > 1:
print("Missing Timestamp(s) between {} and {} hours!".format(prevx, cmp))
prevx = cmp

Indicate whether datetime of row is in a daterange

I'm trying to get dummy variables for holidays in a dataset. I have a couple of dateranges (pd.daterange()) with holidays and a dataframe to which I would like to append a dummy to indicate whether the datetime of that row is in a certain daterange of the specified holidays.
Small example:
ChristmasBreak = list(pd.date_range('2014-12-20','2015-01-04').date)
dates = pd.date_range('2015-01-03', '2015-01-06, freq='H')
d = {'Date': dates, 'Number': np.rand(len(dates))}
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
for i, row in df.iterrows():
if i in ChristmasBreak:
df[i,'Christmas] = 1
The if loop is never entered, so matching the dates won't work. Is there any way to do this? Alternative methods to come to dummies for this case are welcome as well!
First dont use iterrows, because really slow.
Better is use dt.date with Series,isin, last convert boolean mask to integer - Trues are 1:
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
Or use between:
df['Christmas'] = df['Date'].between('2014-12-20', '2015-01-04').astype(int)
If want compare with DatetimeIndex:
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
df['Christmas'] = df.index.date.isin(ChristmasBreak).astype(int)
df['Christmas'] = ((df.index > '2014-12-20') & (df.index < '2015-01-04')).astype(int)
Sample:
ChristmasBreak = pd.date_range('2014-12-20','2015-01-04').date
dates = pd.date_range('2014-12-19 20:00', '2014-12-20 05:00', freq='H')
d = {'Date': dates, 'Number': np.random.randint(10, size=len(dates))}
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
print (df)
Date Number Christmas
0 2014-12-19 20:00:00 6 0
1 2014-12-19 21:00:00 7 0
2 2014-12-19 22:00:00 0 0
3 2014-12-19 23:00:00 9 0
4 2014-12-20 00:00:00 1 1
5 2014-12-20 01:00:00 3 1
6 2014-12-20 02:00:00 1 1
7 2014-12-20 03:00:00 8 1
8 2014-12-20 04:00:00 2 1
9 2014-12-20 05:00:00 1 1
This should do what you want:
df['Christmas'] = df.index.isin(ChristmasBreak).astype(int)

Resources