finding the frequency distribution of values in a column - python-3.x

I have df (8360 x 3 columns)
Time A B
0 01.01.2018 00:00:00 0.019098 32.437083
1 01.01.2018 01:00:00 0.018871 32.462083
2 01.01.2018 02:00:00 0.018643 32.487083
3 01.01.2018 03:00:00 0.018416 32.512083
4 01.01.2018 04:00:00 0.018189 32.537083
5 01.01.2018 05:00:00 0.017961 32.562083
6 01.01.2018 06:00:00 0.017734 33.189708
7 01.01.2018 07:00:00 0.017507 34.122968
8 01.01.2018 08:00:00 0.017279 32.897831
9 01.01.2018 09:00:00 0.017052 32.482338
and want to group the df after the numeric value of column B. I want to find out at what range the numbers in the column are increasing/decreasing the most (frequency distribution).
Right now I just use df.describe() and play with the numbers.
for example I found out that they are 300 values which are smaller than 1
new_df = df[df['B'] < 1]
Is there a specific function to help me with this task?

To get idea about distribution of values just plot histogram. For example in Jupyter notebook:
%matplotlib inline
df.B.hist()
or compute cumulative frequency histogram with scipy
import scipy.stats
scipy.stats.cumfreq(df.B)

Related

how to get employee count by Hour and Date using pySpark / python?

I have employee id, their clock in, and clock out timings by day. I want to calculate number of employee present in office by hour by Date.
Example Data
import pandas as pd
data1 = {'emp_id': ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4', 'Employee 5'],
'Clockin': ['12/5/2021 0:08','8/7/2021 0:04','3/30/2021 1:24','12/23/2021 22:45', '12/23/2021 23:29'],
'Clockout': ['12/5/2021 3:28','8/7/2021 0:34','3/30/2021 4:37','12/24/2021 0:42', '12/24/2021 1:42']}
df1 = pd.DataFrame(data1)
Example of output
import pandas as pd
data2 = {'Date': ['12/5/2021', '8/7/2021', '3/30/2021','3/30/2021','3/30/2021','3/30/2021', '12/23/2021','12/23/2021','12/24/2021','12/24/2021'],
'Hour': ['01:00','01:00','02:00','03:00','04:00','05:00', '22:00','23:00', '01:00','02:00'],
'emp_count': [1,1,1,1,1,1,1,2, 2,1]}
df2 = pd.DataFrame(data2)
Try this:
# Round clock in DOWN to the nearest PRECEDING hour
clock_in = pd.to_datetime(df1["Clockin"]).dt.floor("H")
# Round clock out UP to the nearest SUCCEEDING hour
clock_out = pd.to_datetime(df1["Clockout"]).dt.ceil("H")
# Generate time series at hourly frequency between adjusted clock in and clock
# out time
hours = pd.Series(
[
pd.date_range(in_, out_, freq="H", inclusive="right")
for in_, out_ in zip(clock_in, clock_out)
]
).explode()
# Final result
hours.groupby(hours).count()
Result:
2021-03-30 02:00:00 1
2021-03-30 03:00:00 1
2021-03-30 04:00:00 1
2021-03-30 05:00:00 1
2021-08-07 01:00:00 1
2021-12-05 01:00:00 1
2021-12-05 02:00:00 1
2021-12-05 03:00:00 1
2021-12-05 04:00:00 1
2021-12-23 23:00:00 1
2021-12-24 00:00:00 2
2021-12-24 01:00:00 2
2021-12-24 02:00:00 1
dtype: int64
It's slightly different from your expected output but consistent with your business rules.

Pandas Get date timestamp which belong to businesshours or Businessdays

Find out all the Businesshours and BusinessDays from the given list. I followed couple of docs about pandas offsets, but could not figure it out. followed stackoverflow as well, here is similar but no luck.
>>> d = {'hours': ['2020-02-11 13:44:53', '2020-02-12 13:44:53', '2020-02-11 8:44:53', '2020-02-02 13:44:53']}
>>> df = pd.DataFrame(d)
>>> df
hours
0 2020-02-11 13:44:53
1 2020-02-12 13:44:53
2 2020-02-11 8:44:53
3 2020-02-02 13:44:53
>>> y = df['hours']
>>> from pandas.tseries.offsets import *
>>> y.apply(pd.Timestamp).asfreq(BDay())
1970-01-01 NaT
Freq: B, Name: hours, dtype: datetime64[ns]
>>> y.apply(pd.Timestamp).asfreq(BusinessHour())
Series([], Freq: BH, Name: hours, dtype: datetime64[ns])
I suppose, you are looking for something like:
bh = pd.offsets.BusinessHour() # avoid not necessary imports
y.apply(pd.Timestamp).apply(bh.rollforward)
The result is:
0 2020-02-11 13:44:53
1 2020-02-12 13:44:53
2 2020-02-11 09:00:00
3 2020-02-03 09:00:00
Name: hours, dtype: datetime64[ns]
So:
two first hours have not been changed (they are within business hours).
third (2020-02-11 8:44:53) has been advanced to 9:00 (start of the
business day).
fourth (2020-02-02 13:44:53 on Sunday) has been advanced to the next
day (Monday) at 9:00.
Or, if you want only to check whether particulat date / hour is within
business hours, run:
y.apply(pd.Timestamp).apply(bh.onOffset)
The resutl is:
0 True
1 True
2 False
3 False
Name: hours, dtype: bool
meaning that two last date / hours are outside business hours.

Pandas dataframe - Setting with copy warning

I would like to forwardfill an hourly dataframe so that the value for hour 1 gets forwardfilled for every hour 1 on the following days. The same for each of the 24 hours.
The dataframe looks like this:
Timestamp input1 input2 input3
… … … ..
01.01.2018 00:00 2 5 4
01.01.2018 01:00 3 3 2
01.01.2018 02:00 5 6 1
…
01.01.2018 22:00 2 0 1
01.01.2018 23:00 5 3 3
02.01.2018 00:00 6 2 5
02.01.2018 01:00 3 6 4
02.01.2018 02:00 3 9 6
02.01.2018 03:00 5 1 7
…
02.01.2018 23:00 2 5 1
03.01.2018 00:00 NaN NaN NaN
…
03.01.2018 23:00 NaN NaN NaN
I am using the following code for this:
for hr in range(0,24):
df.loc[df.index.hour == hr, Inputs] = df.loc[df.index.hour == hr, Inputs].fillna(method='ffill')
This works.
Unfortunately I am getting a Warning Message:
\Python\WPy-3670_32bit\python-3.6.7\lib\site-packages\pandas\core\indexing.py:543: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
How can I solve this, that I do not get a warning anymore?
The resulting df should have the NaNs filled.
I executed your code and dind't get the mentioned warning (neither any other).
Using loc is just the right way to avoid such warning (as said in this message).
Maybe you are using some older version of Pandas?
Upgrade to 0.25 if you have an older version and try again.
Another suspicion: Maybe this warning pertains to some other instruction
in your code (without loc)?
This works:
df[df.index.hour == hr] = df[df.index.hour == hr].fillna(method="ffill")
Very similar to .loc, but does not tend to raise as many Settingwithcopy warnings.

Groupby expanding count - elements changing of group at different time stamps

I have a HUGHE DataFrame that looks as follows (this is just an example to illustrate the problem):
id timestamp target_time interval
1 08:00:00 10:20:00 (10-11]
1 08:30:00 10:21:00 (10-11]
1 09:10:00 11:30:00 (11-12]
2 09:15:00 10:15:00 (10-11]
2 09:35:00 10:11:00 (10-11]
3 09:45:00 11:12:00 (11-12]
...
I would like to create a series looking as follows:
interval timestamp unique_ids
(10-11] 08:00:00 1
08:30:00 1
09:15:00 1
09:35:00 1
(11-12] 09:10:00 1
09:45:00 2
The objective is to count, for each time interval, how many unique ids had their corresponding target_time within the interval at their timestamp. Note that the target_time for each id can change at different timestamps. For instance, for the id 1 the interval is (10-11] from 08:00:00 to 08:30:00, but then it changes to (11-12] at 09:10:00. Therefore, at 09:15:00 I do not want to count the id 1 in the resulting Series.
I tried a groupby -> expand -> np.unique approach, but it does not provide the result that I want:
df.set_index('timestamp').groupby('interval').id.expanding().apply(lambda x: np.unique(x).shape[0])
interval timestamp unique_ids
(10-11] 08:00:00 1
08:30:00 1
09:15:00 2
09:35:00 2
(11-12] 09:10:00 1
09:45:00 2
Any hint on how can I approach this problem? I want to make use of pandas routines as much as possible, in order to reduce computational time, since the length of the DataFrame is 1453076...
Many thanks in advance!

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Resources