Groupby expanding count - elements changing of group at different time stamps - python-3.x

I have a HUGHE DataFrame that looks as follows (this is just an example to illustrate the problem):
id timestamp target_time interval
1 08:00:00 10:20:00 (10-11]
1 08:30:00 10:21:00 (10-11]
1 09:10:00 11:30:00 (11-12]
2 09:15:00 10:15:00 (10-11]
2 09:35:00 10:11:00 (10-11]
3 09:45:00 11:12:00 (11-12]
...
I would like to create a series looking as follows:
interval timestamp unique_ids
(10-11] 08:00:00 1
08:30:00 1
09:15:00 1
09:35:00 1
(11-12] 09:10:00 1
09:45:00 2
The objective is to count, for each time interval, how many unique ids had their corresponding target_time within the interval at their timestamp. Note that the target_time for each id can change at different timestamps. For instance, for the id 1 the interval is (10-11] from 08:00:00 to 08:30:00, but then it changes to (11-12] at 09:10:00. Therefore, at 09:15:00 I do not want to count the id 1 in the resulting Series.
I tried a groupby -> expand -> np.unique approach, but it does not provide the result that I want:
df.set_index('timestamp').groupby('interval').id.expanding().apply(lambda x: np.unique(x).shape[0])
interval timestamp unique_ids
(10-11] 08:00:00 1
08:30:00 1
09:15:00 2
09:35:00 2
(11-12] 09:10:00 1
09:45:00 2
Any hint on how can I approach this problem? I want to make use of pandas routines as much as possible, in order to reduce computational time, since the length of the DataFrame is 1453076...
Many thanks in advance!

Related

Define function to classify records within a df and adding new columns. Pandas dfs

I have a list of about 20 dfs and I want to clean the data for analysis.
Can there be a function that loops through all the dfs in the list & performs the tasks below, if all the columns are the same?
Create a column [time_class] that classifies each as arrival time as "early" or "late" by comparing with the [appt_time] column. Next I want to classify each record as "early_yes", "early_no", "late_yes" and "late_no" in another column called [time_response]. This column would check the values of [time_class], [YES] and [NO]. If a record is 'early' and '1' for yes then the [time_response] column should say "early_yes" Then a frequency table to count the [time_response] occurrences. the frequency table headers will be from the [time_response] column.
How can I check to make sure the time columns are reading as times in pandas?
How can I change the values in the yes and no column to 'yes' and 'no' instead of the 1's?
each df has this format for these specific columns:
Arrival_time Appt_Time YES NO
07:25:00 08:00 1
08:24:00 08:40 1
08:12:00 09:00 1
09:20:00 09:30 1
10:01:00 10:00 1
09:33:00 09:30 1
10:22:00 10:20 1
10:29:00 10:30 1
I also have an age column in each df that I have tried binning using the cut() method, and I usually get the error that the input must be one dimensional array. Does this mean I cannot use this method if the df has other columns other than just the age?
How can you define a function to check the age column and create bins grouped by 10 [20-100], then use these bins to create a frequency table? Ideally I'd like the freq table to be columns in each df. I am using pandas.
Any help is appreciated!!
UPDATE: When I try to compare arrival time and scheduled time, I get a type error TypeError: '<=' not supported between instances of 'int' and 'datetime.time'
Hopefully this helps you get started - you'll see that there are a few useful methods like replace in pandas and select from the numpy library. Also if you want to apply any of the code to multiple dataframes that are all in the same format, you'll want to wrap this code in a function.
import numpy as np
import pandas as pd
### this code helps recreate the df you posted
df = pd.DataFrame({
"Arrival_time": ['07:25:00', '08:24:00', '08:12:00', '09:20:00', '10:01:00', '09:33:00', '10:22:00', '10:29:00'],
"Appt_Time":['08:00', '08:40', '09:00', '09:30', '10:00', '09:30', '10:20', '10:30'],
"YES": ['1','1','','','','1','','1'],
"NO": ['','','1','1','1','','1','']})
df.Arrival_time = pd.to_datetime(df.Arrival_time, format='%H:%M:%S').dt.time
df.Appt_Time = pd.to_datetime(df.Appt_Time, format='%H:%M').dt.time
### end here
# you can start using the code from this line onward:
# creates "time_class" column based on Arrival_time being before Appt_Time
df["time_class"] = (df.Arrival_time <= df.Appt_Time).replace({True: "early", False: "late"})
# creates a new column "time_response" based on conditions
# this may need to be changed depending on whether your "YES" and "NO" columns
# are a string or an int... I just assumed a string so you can modify this code as needed
conditions = [
(df.time_class == "early") & (df.YES == '1'),
(df.time_class == "early") & (df.YES != '1'),
(df.time_class == "late") & (df.YES == '1'),
(df.time_class == "late") & (df.YES != '1')]
choices = ["early_yes", "early_no", "late_yes", "late_no"]
df["time_response"] = np.select(conditions, choices)
# creates a new df to sum up each time_response
df_time_response_count = pd.DataFrame({"Counts": df["time_response"].value_counts()})
# replace 1 with YES and 1 with NO in your YES and NO columns
df.YES = df.YES.replace({'1': "YES"})
df.NO = df.NO.replace({'1': "NO"})
Output:
>>> df
Arrival_time Appt_Time YES NO time_class time_response
0 07:25:00 08:00:00 YES early early_yes
1 08:24:00 08:40:00 YES early early_yes
2 08:12:00 09:00:00 NO early early_no
3 09:20:00 09:30:00 NO early early_no
4 10:01:00 10:00:00 NO late late_no
5 09:33:00 09:30:00 YES late late_yes
6 10:22:00 10:20:00 NO late late_no
7 10:29:00 10:30:00 YES early early_yes
>>> df_time_response_count
Counts
early_yes 3
late_no 2
early_no 2
late_yes 1
To answer your question about binning, I think np.linspace() is easiest to create the bins you want.
So I'll add some random ages between 20 and 100 to the df:
df['age'] = [21,31,34,26,46,70,56,55]
So the dataframe looks like this:
df
Arrival_time Appt_Time YES NO time_class time_response age
0 07:25:00 08:00:00 YES early early_yes 21
1 08:24:00 08:40:00 YES early early_yes 31
2 08:12:00 09:00:00 NO early early_no 34
3 09:20:00 09:30:00 NO early early_no 26
4 10:01:00 10:00:00 NO late late_no 46
5 09:33:00 09:30:00 YES late late_yes 70
6 10:22:00 10:20:00 NO late late_no 56
7 10:29:00 10:30:00 YES early early_yes 55
Then use the value_counts method in pandas and with the bins parameter:
df_age_counts = pd.DataFrame({"Counts": df.age.value_counts(bins = np.linspace(20,100,9))})
df_age_counts = df_age_counts.sort_index()
Output:
>>> df_age_counts
Counts
(19.999, 30.0] 2
(30.0, 40.0] 2
(40.0, 50.0] 1
(50.0, 60.0] 2
(60.0, 70.0] 1
(70.0, 80.0] 0
(80.0, 90.0] 0
(90.0, 100.0] 0

finding the frequency distribution of values in a column

I have df (8360 x 3 columns)
Time A B
0 01.01.2018 00:00:00 0.019098 32.437083
1 01.01.2018 01:00:00 0.018871 32.462083
2 01.01.2018 02:00:00 0.018643 32.487083
3 01.01.2018 03:00:00 0.018416 32.512083
4 01.01.2018 04:00:00 0.018189 32.537083
5 01.01.2018 05:00:00 0.017961 32.562083
6 01.01.2018 06:00:00 0.017734 33.189708
7 01.01.2018 07:00:00 0.017507 34.122968
8 01.01.2018 08:00:00 0.017279 32.897831
9 01.01.2018 09:00:00 0.017052 32.482338
and want to group the df after the numeric value of column B. I want to find out at what range the numbers in the column are increasing/decreasing the most (frequency distribution).
Right now I just use df.describe() and play with the numbers.
for example I found out that they are 300 values which are smaller than 1
new_df = df[df['B'] < 1]
Is there a specific function to help me with this task?
To get idea about distribution of values just plot histogram. For example in Jupyter notebook:
%matplotlib inline
df.B.hist()
or compute cumulative frequency histogram with scipy
import scipy.stats
scipy.stats.cumfreq(df.B)

Dividing two dataframes gives NaN

I have two dataframes, one with a metric as of the last day of the month. The other contains a metric summed for the whole month. The former (monthly_profit) looks like this:
profit
yyyy_mm_dd
2018-01-01 8797234233.0
2018-02-01 3464234233.0
2018-03-01 5676234233.0
...
2019-10-01 4368234233.0
While the latter (monthly_employees) looks like this:
employees
yyyy_mm_dd
2018-01-31 924358
2018-02-28 974652
2018-03-31 146975
...
2019-10-31 255589
I want to get profit per employee, so I've done this:
profit_per_employee = (monthly_profit['profit']/monthly_employees['employees'])*100
This is the output that I get:
yyyy_mm_dd
2018-01-01 NaN
2018-01-31 NaN
2018-02-01 NaN
2018-02-28 NaN
How could I fix this? The reason that one dataframe is the last day of the month and the other is the first day of the month is due to rolling vs non-rolling data.
monthly_profit is the result of grouping and summing daily profit data:
monthly_profit = df.groupby(['yyyy_mm_dd'])[['proft']].sum()
monthly_profit = monthly_profit.resample('MS').sum()
While monthly_employees is a running total, so I need to take the current value for the last day of each month:
monthly_employees = df.groupby(['yyyy_mm_dd'])[['employees']].sum()
monthly_employees = monthly_employees.groupby([monthly_employees.index.year, monthly_employees.index.month]).tail(1)
Change MS to M for end of months for match both DatatimeIndex:
monthly_profit = monthly_profit.resample('M').sum()

How to join Minute based time-range with Date using Pandas?

My dataset df looks like this:
DateTimeVal Open
2017-01-01 17:00:00 5.1532
2017-01-01 17:01:00 5.3522
2017-01-01 17:02:00 5.4535
2017-01-01 17:03:00 5.3567
2017-01-01 17:04:00 5.1512
....
It is a Minute based data
The Time value starts from 17:00:00 however I want to only change the Time value to start from 00:00:00 as a Minute based data and up to 23:59:00
The current Time starts at 17:00:00 and increments per Minute and ends on 16:59:00. The total row value is 1440 so I can confirm that it is a Minute based 24 Hour data
My new df should looks like this:
DateTimeVal Open
2017-01-01 00:00:00 5.1532
2017-01-01 00:01:00 5.3522
2017-01-01 00:02:00 5.4535
2017-01-01 00:03:00 5.3567
2017-01-01 00:04:00 5.1512
....
Here, we did not change anything except the Time part.
What did I do?
My logic was to remove the Time and then populate with new Time
Here is what I did:
pd.DatetimeIndex(df['DateTimeVal'].astype(str).str.rsplit(' ', 1).str[0], dayfirst=True)
But I do not know how to add the new Time data. Could you please help?
How about subtracting 17 hours from your DateTimeVal:
df['DateTimeVal'] -= pd.Timedelta(hours=17)

Apply a value to max values in a groupby

I have a DF like this:
ID Time
1 20:29
1 20:45
1 23:16
2 11:00
2 13:00
3 01:00
I want to create a new column that puts a 1 next to the largest time value within each ID grouping like so:
ID Time Value
1 20:29 0
1 20:45 0
1 23:16 1
2 11:00 0
2 13:00 1
3 01:00 1
I know the answer involves a groupby mechanism and have been fiddling around with something like:
df.groupby('ID')['Time'].max() = 1
The idea is to write an anonymous function that operates on each of your groups and feed this to your groupby using apply:
df['Value']=df.groupby('ID',as_index=False).apply(lambda x : x.Time == max(x.Time)).values
Assuming that your 'Time' column is already a datetime64 then you want to groupby on 'ID' column and then call transform to apply a lambda to create a series with an index aligned with your original df:
In [92]:
df['Value'] = df.groupby('ID')['Time'].transform(lambda x: (x == x.max())).dt.nanosecond
df
Out[92]:
ID Time Value
0 1 2015-11-20 20:29:00 0
1 1 2015-11-20 20:45:00 0
2 1 2015-11-20 23:16:00 1
3 2 2015-11-20 11:00:00 0
4 2 2015-11-20 13:00:00 1
5 3 2015-11-20 01:00:00 1
The dt.nanosecond call is because the dtype returned is a datetime for some reason rather than a boolean:
In [93]:
df.groupby('ID')['Time'].transform(lambda x: (x == x.max()))
Out[93]:
0 1970-01-01 00:00:00.000000000
1 1970-01-01 00:00:00.000000000
2 1970-01-01 00:00:00.000000001
3 1970-01-01 00:00:00.000000000
4 1970-01-01 00:00:00.000000001
5 1970-01-01 00:00:00.000000001
Name: Time, dtype: datetime64[ns]

Resources