Define function to classify records within a df and adding new columns. Pandas dfs - python-3.x

I have a list of about 20 dfs and I want to clean the data for analysis.
Can there be a function that loops through all the dfs in the list & performs the tasks below, if all the columns are the same?
Create a column [time_class] that classifies each as arrival time as "early" or "late" by comparing with the [appt_time] column. Next I want to classify each record as "early_yes", "early_no", "late_yes" and "late_no" in another column called [time_response]. This column would check the values of [time_class], [YES] and [NO]. If a record is 'early' and '1' for yes then the [time_response] column should say "early_yes" Then a frequency table to count the [time_response] occurrences. the frequency table headers will be from the [time_response] column.
How can I check to make sure the time columns are reading as times in pandas?
How can I change the values in the yes and no column to 'yes' and 'no' instead of the 1's?
each df has this format for these specific columns:
Arrival_time Appt_Time YES NO
07:25:00 08:00 1
08:24:00 08:40 1
08:12:00 09:00 1
09:20:00 09:30 1
10:01:00 10:00 1
09:33:00 09:30 1
10:22:00 10:20 1
10:29:00 10:30 1
I also have an age column in each df that I have tried binning using the cut() method, and I usually get the error that the input must be one dimensional array. Does this mean I cannot use this method if the df has other columns other than just the age?
How can you define a function to check the age column and create bins grouped by 10 [20-100], then use these bins to create a frequency table? Ideally I'd like the freq table to be columns in each df. I am using pandas.
Any help is appreciated!!
UPDATE: When I try to compare arrival time and scheduled time, I get a type error TypeError: '<=' not supported between instances of 'int' and 'datetime.time'

Hopefully this helps you get started - you'll see that there are a few useful methods like replace in pandas and select from the numpy library. Also if you want to apply any of the code to multiple dataframes that are all in the same format, you'll want to wrap this code in a function.
import numpy as np
import pandas as pd
### this code helps recreate the df you posted
df = pd.DataFrame({
"Arrival_time": ['07:25:00', '08:24:00', '08:12:00', '09:20:00', '10:01:00', '09:33:00', '10:22:00', '10:29:00'],
"Appt_Time":['08:00', '08:40', '09:00', '09:30', '10:00', '09:30', '10:20', '10:30'],
"YES": ['1','1','','','','1','','1'],
"NO": ['','','1','1','1','','1','']})
df.Arrival_time = pd.to_datetime(df.Arrival_time, format='%H:%M:%S').dt.time
df.Appt_Time = pd.to_datetime(df.Appt_Time, format='%H:%M').dt.time
### end here
# you can start using the code from this line onward:
# creates "time_class" column based on Arrival_time being before Appt_Time
df["time_class"] = (df.Arrival_time <= df.Appt_Time).replace({True: "early", False: "late"})
# creates a new column "time_response" based on conditions
# this may need to be changed depending on whether your "YES" and "NO" columns
# are a string or an int... I just assumed a string so you can modify this code as needed
conditions = [
(df.time_class == "early") & (df.YES == '1'),
(df.time_class == "early") & (df.YES != '1'),
(df.time_class == "late") & (df.YES == '1'),
(df.time_class == "late") & (df.YES != '1')]
choices = ["early_yes", "early_no", "late_yes", "late_no"]
df["time_response"] = np.select(conditions, choices)
# creates a new df to sum up each time_response
df_time_response_count = pd.DataFrame({"Counts": df["time_response"].value_counts()})
# replace 1 with YES and 1 with NO in your YES and NO columns
df.YES = df.YES.replace({'1': "YES"})
df.NO = df.NO.replace({'1': "NO"})
Output:
>>> df
Arrival_time Appt_Time YES NO time_class time_response
0 07:25:00 08:00:00 YES early early_yes
1 08:24:00 08:40:00 YES early early_yes
2 08:12:00 09:00:00 NO early early_no
3 09:20:00 09:30:00 NO early early_no
4 10:01:00 10:00:00 NO late late_no
5 09:33:00 09:30:00 YES late late_yes
6 10:22:00 10:20:00 NO late late_no
7 10:29:00 10:30:00 YES early early_yes
>>> df_time_response_count
Counts
early_yes 3
late_no 2
early_no 2
late_yes 1
To answer your question about binning, I think np.linspace() is easiest to create the bins you want.
So I'll add some random ages between 20 and 100 to the df:
df['age'] = [21,31,34,26,46,70,56,55]
So the dataframe looks like this:
df
Arrival_time Appt_Time YES NO time_class time_response age
0 07:25:00 08:00:00 YES early early_yes 21
1 08:24:00 08:40:00 YES early early_yes 31
2 08:12:00 09:00:00 NO early early_no 34
3 09:20:00 09:30:00 NO early early_no 26
4 10:01:00 10:00:00 NO late late_no 46
5 09:33:00 09:30:00 YES late late_yes 70
6 10:22:00 10:20:00 NO late late_no 56
7 10:29:00 10:30:00 YES early early_yes 55
Then use the value_counts method in pandas and with the bins parameter:
df_age_counts = pd.DataFrame({"Counts": df.age.value_counts(bins = np.linspace(20,100,9))})
df_age_counts = df_age_counts.sort_index()
Output:
>>> df_age_counts
Counts
(19.999, 30.0] 2
(30.0, 40.0] 2
(40.0, 50.0] 1
(50.0, 60.0] 2
(60.0, 70.0] 1
(70.0, 80.0] 0
(80.0, 90.0] 0
(90.0, 100.0] 0

Related

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

Adjust the overlapping dates in group by with priority from another columns

As Title Suggest, I am working on a problem to find overlapping dates based on ID and adjust overlapping date based on priority(weight). Following piece of code helped to find overlapping dates.
df['overlap'] = (df.groupby('ID')
.apply(lambda x: (x['End_date'].shift() - x['Start_date']) > timedelta(0))
.reset_index(level=0, drop=True))
df
Now issue I'm facing is, how to introduce priority(weight) and adjust start_date by that. In the below image, I have highlighted adjusted dates based on weight where A takes precedence over B and B takes over C.
Should I create a dictionary for string to numeric weight values and then what? I'm stuck here to set up logic.
Dataframe:
op_d = {'ID': [1,1,1,2,2,3,3,3],'Start_date':['9/1/2020','10/10/2020','11/18/2020','4/1/2015','5/12/2016','4/1/2015','5/15/2016','8/1/2018'],\
'End_date':['10/9/2020','11/25/2020','12/31/2020','5/31/2016','12/31/2016','5/29/2016','9/25/2018','10/15/2020'],\
'Weight':['A','B','C','A','B','A','B','C']}
df = pd.DataFrame(data=op_d)
You have already identified the overlap condition, you can then try adding a day to End_Date and shift, then assign them to start date where overlap column is true:
arr = np.where(df['overlap'],df['End_date'].add(pd.Timedelta(1,unit='d')).shift(),
df['Start_date'])
out = df.assign(Output_Start_Date = arr,Output_End_Date=df['End_date'])
print(out)
ID Start_date End_date Weight overlap Output_Start_Date Output_End_Date
0 1 2020-09-01 2020-10-09 A False 2020-09-01 2020-10-09
1 1 2020-10-10 2020-11-25 B False 2020-10-10 2020-11-25
2 1 2020-11-18 2020-12-31 C True 2020-11-26 2020-12-31
3 2 2015-04-01 2016-05-31 A False 2015-04-01 2016-05-31
4 2 2016-05-12 2016-12-31 B True 2016-06-01 2016-12-31
5 3 2015-04-01 2016-05-29 A False 2015-04-01 2016-05-29
6 3 2016-05-15 2018-09-25 B True 2016-05-30 2018-09-25
7 3 2018-08-01 2020-10-15 C True 2018-09-26 2020-10-15

select a row based on the highest value of a column python

The dataFrame looks something like this: Name of person and weight at a given date.
Name date w
1 Mike 2019-01-21 89.1
2 Mike 2018-11-12 88.1
3 Mike 2018-03-14 87.2
4 Hans 2019-03-21 66.5
5 Hans 2018-03-12 57.4
6 Hans 2017-04-21 55.3
7 Hans 2016-10-12 nan
I want to select the last time Hans has logged in his weight. So the answer would be
4 Hans 2019-03-21 66.5
Here's what I successfully managed to do:
# select Hans data that don't have nans
cond = ( data['Name'] == 'Hans' )
a = data.loc[ cond ]
a = a.dropna()
# get the index of the most recent weight
b = d['date'].str.split('-', expand=True) # split the date to get the year
now b looks like this
print(b)
#4 2019 03 21
#5 2018 03 12
#6 2017 04 21
how can I extract the row with index=4 and then get the weight?
I cannot use idxmax because the df are not floats but str.
You cannot use idxmax, but a workaround is to use NumPy's argmax with iloc:
df2 = df.query('Name == "Hans"')
# older versions
# df2.iloc[[df['date'].values.argmax()]]
# >=0.24
df2.iloc[[df['date'].to_numpy().argmax()]]
Name date w
4 Hans 2019-03-21 66.5
Another trick is to convert the date to integer using to_datetime. You can then use idxmax with loc as usual.
df2.loc[[pd.to_datetime(df2['date']).astype(int).idxmax()]]
Name date w
4 Hans 2019-03-21 66.5
To do this for each person, use GroupBy.idxmax:
df.iloc[pd.to_datetime(df.date).astype(int).groupby(df['Name']).idxmax().values]
Name date w
5 Hans 2018-03-12 57.4
2 Mike 2018-11-12 88.1

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Determining the number of unique entry's left after experiencing a specific item in pandas

I have a data frame with three columns timestamp, lecture_id, and userid
I am trying to write a loop that will count up the number of students who dropped (never seen again) after experiencing a specific lecture. The goal is to ultimately have a fourth column that shows the number of students remaining after exposure to a specific lecture.
I'm having trouble writing this in python, I tried a for loop which never finished (I have 13m rows).
import pandas as pd
import numpy as np
ids = list(np.random.randint(0,5,size=(100, 1)))
users = list(np.random.randint(0,10,size=(100, 1)))
dates = list(pd.date_range('20130101',periods=100, freq = 'H'))
dft = pd.DataFrame(
{'lecture_id': ids,
'userid': users,
'timestamp': dates
})
I want to make a new data frame that shows for every user that experienced x lecture, how many never came back (dropped).
Not sure if this is what you want and also not sure if this can be done simpler but this could be a way to do it:
import pandas as pd
import numpy as np
np.random.seed(42)
ids = list(np.random.randint(0,5,size=(100, 1)[0]))
users = list(np.random.randint(0,10,size=(100, 1)[0]))
dates = list(pd.date_range('20130101',periods=100, freq = 'H'))
df = pd.DataFrame({'lecture_id': ids, 'userid': users, 'timestamp': dates})
# Get the last date for each user
last_seen = df.timestamp.iloc[df.groupby('userid').timestamp.apply(lambda x: np.argmax(x))]
df['remaining'] = len(df.userid.unique())
tmp = np.zeros(len(df))
tmp[last_seen.index] = 1
df['remaining'] = (df['remaining']- tmp.cumsum()).astype(int)
df[-10:]
where the last 10 entries are:
lecture_id timestamp userid remaining
90 2 2013-01-04 18:00:00 9 6
91 0 2013-01-04 19:00:00 5 6
92 2 2013-01-04 20:00:00 6 6
93 2 2013-01-04 21:00:00 3 5
94 0 2013-01-04 22:00:00 6 4
95 2 2013-01-04 23:00:00 7 4
96 4 2013-01-05 00:00:00 0 3
97 1 2013-01-05 01:00:00 5 2
98 1 2013-01-05 02:00:00 7 1
99 0 2013-01-05 03:00:00 4 0

Resources