how to get employee count by Hour and Date using pySpark / python? - python-3.x

I have employee id, their clock in, and clock out timings by day. I want to calculate number of employee present in office by hour by Date.
Example Data
import pandas as pd
data1 = {'emp_id': ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4', 'Employee 5'],
'Clockin': ['12/5/2021 0:08','8/7/2021 0:04','3/30/2021 1:24','12/23/2021 22:45', '12/23/2021 23:29'],
'Clockout': ['12/5/2021 3:28','8/7/2021 0:34','3/30/2021 4:37','12/24/2021 0:42', '12/24/2021 1:42']}
df1 = pd.DataFrame(data1)
Example of output
import pandas as pd
data2 = {'Date': ['12/5/2021', '8/7/2021', '3/30/2021','3/30/2021','3/30/2021','3/30/2021', '12/23/2021','12/23/2021','12/24/2021','12/24/2021'],
'Hour': ['01:00','01:00','02:00','03:00','04:00','05:00', '22:00','23:00', '01:00','02:00'],
'emp_count': [1,1,1,1,1,1,1,2, 2,1]}
df2 = pd.DataFrame(data2)

Try this:
# Round clock in DOWN to the nearest PRECEDING hour
clock_in = pd.to_datetime(df1["Clockin"]).dt.floor("H")
# Round clock out UP to the nearest SUCCEEDING hour
clock_out = pd.to_datetime(df1["Clockout"]).dt.ceil("H")
# Generate time series at hourly frequency between adjusted clock in and clock
# out time
hours = pd.Series(
[
pd.date_range(in_, out_, freq="H", inclusive="right")
for in_, out_ in zip(clock_in, clock_out)
]
).explode()
# Final result
hours.groupby(hours).count()
Result:
2021-03-30 02:00:00 1
2021-03-30 03:00:00 1
2021-03-30 04:00:00 1
2021-03-30 05:00:00 1
2021-08-07 01:00:00 1
2021-12-05 01:00:00 1
2021-12-05 02:00:00 1
2021-12-05 03:00:00 1
2021-12-05 04:00:00 1
2021-12-23 23:00:00 1
2021-12-24 00:00:00 2
2021-12-24 01:00:00 2
2021-12-24 02:00:00 1
dtype: int64
It's slightly different from your expected output but consistent with your business rules.

Related

Iterate over unique date and hour in the pandas dataframe to run a function

Hi I am currently running a for loop through by unique dates in the dataframe to pass it to a function.
However what I wanted is to iterate over the unique date and hour (e.g. 2020-12-18 15:00, 2020-12-18 16:00) through my dataframe. Is there any possible way to do this?
This is my code and a sample of my dataframe.
for day in df['DateTime'].dt.day.unique():
testdf = df[df['DateTime'].dt.day == day]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
DateTime Values
0 2020-12-18 15:00:00 554.0
1 2020-12-18 15:00:00 594.0
2 2020-12-18 15:00:00 513.0
3 2020-12-18 16:00:00 651.0
4 2020-12-18 16:00:00 593.0
5 2020-12-18 17:00:00 521.0
6 2020-12-18 17:00:00 539.0
7 2020-12-18 17:00:00 534.0
8 2020-12-18 18:00:00 562.0
9 2020-12-19 08:00:00 511.0
10 2020-12-19 09:00:00 512.0
11 2020-12-19 09:00:00 584.0
12 2020-12-19 09:00:00 597.0
13 2020-12-22 09:00:00 585.0
14 2020-12-22 09:00:00 620.0
15 2020-12-22 09:00:00 593.0
You can use groupby if need filter by all dates in DataFrame:
for day, testdf in df.groupby('DateTime'):
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
EDIT: If need filter only some dates from list use:
for date in ['2020-12-18 15:00', '2020-12-18 16:00']:
testdf = df[df['DateTime'] == date]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)
EDIT1:
for date in df['DateTime']:
testdf = df[df['DateTime'] == date]
testdf.set_index('DateTimeStarted', inplace=True)
output = mk.original_test(testdf, alpha =0.05)
output_df = pd.DataFrame(output).T
output_df.rename({0:"Trend", 1: "h", 2:"p", 3:"z", 4:"Tau", 5:"s", 6:"var_s", 7:"slope", 8:"intercept"}, axis = 1, inplace = True)
result_df = result_df.append(output_df)

finding the frequency distribution of values in a column

I have df (8360 x 3 columns)
Time A B
0 01.01.2018 00:00:00 0.019098 32.437083
1 01.01.2018 01:00:00 0.018871 32.462083
2 01.01.2018 02:00:00 0.018643 32.487083
3 01.01.2018 03:00:00 0.018416 32.512083
4 01.01.2018 04:00:00 0.018189 32.537083
5 01.01.2018 05:00:00 0.017961 32.562083
6 01.01.2018 06:00:00 0.017734 33.189708
7 01.01.2018 07:00:00 0.017507 34.122968
8 01.01.2018 08:00:00 0.017279 32.897831
9 01.01.2018 09:00:00 0.017052 32.482338
and want to group the df after the numeric value of column B. I want to find out at what range the numbers in the column are increasing/decreasing the most (frequency distribution).
Right now I just use df.describe() and play with the numbers.
for example I found out that they are 300 values which are smaller than 1
new_df = df[df['B'] < 1]
Is there a specific function to help me with this task?
To get idea about distribution of values just plot histogram. For example in Jupyter notebook:
%matplotlib inline
df.B.hist()
or compute cumulative frequency histogram with scipy
import scipy.stats
scipy.stats.cumfreq(df.B)

Check whether a certain datetime value is missing in a given period

I have a df with DateTime index as follows:
DateTime
2017-01-02 15:00:00
2017-01-02 16:00:00
2017-01-02 18:00:00
....
....
2019-12-07 22:00:00
2019-12-07 23:00:00
Now, I want to know is there any time missing in the 1-hour interval. So, for instance, the 3rd reading is missing 1 reading as we went from 16:00 to 18:00 so is it possible to detect this?
Create date_range with minimal and maximal datetime and filter values by Index.isin with boolean indexing with ~ for inverting mask:
print (df)
DateTime
0 2017-01-02 15:00:00
1 2017-01-02 16:00:00
2 2017-01-02 18:00:00
r = pd.date_range(df['DateTime'].min(), df['DateTime'].max(), freq='H')
print (r)
DatetimeIndex(['2017-01-02 15:00:00', '2017-01-02 16:00:00',
'2017-01-02 17:00:00', '2017-01-02 18:00:00'],
dtype='datetime64[ns]', freq='H')
out = r[~r.isin(df['DateTime'])]
print (out)
DatetimeIndex(['2017-01-02 17:00:00'], dtype='datetime64[ns]', freq='H')
Another idea is create DatetimeIndex with helper column, change frequency by Series.asfreq and filter index values with missing values:
s = df[['DateTime']].assign(val=1).set_index('DateTime')['val'].asfreq('H')
print (s)
DateTime
2017-01-02 15:00:00 1.0
2017-01-02 16:00:00 1.0
2017-01-02 17:00:00 NaN
2017-01-02 18:00:00 1.0
Freq: H, Name: val, dtype: float64
out = s.index[s.isna()]
print (out)
DatetimeIndex(['2017-01-02 17:00:00'], dtype='datetime64[ns]', name='DateTime', freq='H')
Is it safe to assume that the datetime format will always be the same? If yes, why don't you extract the "hour" values from your respective timestamps and compare them to the interval you desire, e.g:
import re
#store some datetime values for show
datetimes=[
"2017-01-02 15:00:00",
"2017-01-02 16:00:00",
"2017-01-02 18:00:00",
"2019-12-07 22:00:00",
"2019-12-07 23:00:00"
]
#extract hour value via regex (first match always is the hours in this format)
findHour = re.compile("\d{2}(?=\:)")
prevx = findHour.findall(datetimes[1])[0]
#simple comparison: compare to previous value, calculate difference, set previous value to current value
for x in datetimes[2:]:
cmp = findHour.findall(x)[0]
diff = int(cmp) - int(prevx)
if diff > 1:
print("Missing Timestamp(s) between {} and {} hours!".format(prevx, cmp))
prevx = cmp

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Adding Minutes to Pandas DatetimeIndex

I have an index that contain dates.
DatetimeIndex(['2004-01-02', '2004-01-05', '2004-01-06', '2004-01-07',
'2004-01-08', '2004-01-09', '2004-01-12', '2004-01-13',
'2004-01-14', '2004-01-15',
...
'2015-12-17', '2015-12-18', '2015-12-21', '2015-12-22',
'2015-12-23', '2015-12-24', '2015-12-28', '2015-12-29',
'2015-12-30', '2015-12-31'],
dtype='datetime64[ns]', length=3021, freq=None)
Now for each day I would like to generate every minute (24*60=1440 minutes) within each day and make an index with all days and minutes.
The result should look like:
['2004-01-02 00:00:00', '2004-01-02 00:01:00', ..., '2004-01-02 23:59:00',
'2004-01-03 00:00:00', '2004-01-03 00:01:00', ..., '2004-01-03 23:59:00',
...
'2015-12-31 00:00:00', '2015-12-31 00:01:00', ..., '2015-12-31 23:59:00']
Is there a smart trick for this?
You should be able to use .asfreq() here:
>>> import pandas as pd
>>> days = pd.date_range(start='2018-01-01', days=10)
>>> df = pd.DataFrame(list(range(len(days))), index=days)
>>> df.asfreq('min')
0
2018-01-01 00:00:00 0.0
2018-01-01 00:01:00 NaN
2018-01-01 00:02:00 NaN
2018-01-01 00:03:00 NaN
2018-01-01 00:04:00 NaN
2018-01-01 00:05:00 NaN
2018-01-01 00:06:00 NaN
# ...
>>> df.shape
(10, 1)
>>> df.asfreq('min').shape
(12961, 1)
If that doesn't work for some reason, you might also want to have a look into pd.MultiIndex.from_product(); then pd.to_datetime() on the concatenated result.

Resources