How to get the total count from a certain date from a dataframe having datatime column - python-3.x

I am new to pandas Dataframe.
From MYSQL I have bound the following dataset to a Dataframe. Here how to get the total count for a particular date in jupyter. Also how to set a Datepicker widget in jupyter and by selecting the date range in the calendar how to show the total count for that selected date.
To be more specific:
1) Get total count for Todays date(by inputting only date) from RegistrationDate column
2) Get total count for Last 7 days(by inputting only date) from RegistrationDate column
3) Get total count by selecting the date range from Datepicker widget from RegistrationDate column
No RegistrationDate
0 7 2019-07-23 12:23:25
1 9 2019-07-23 03:23:25
2 11 2019-07-23 08:10:10
3 13 2019-07-22 09:23:25
4 15 2019-07-22 04:01:02
5 17 2019-07-21 12:23:25
6 19 2019-07-20 12:23:25
7 21 2019-07-19 12:23:25
8 67 2019-06-04 12:23:25
9 68 2019-06-05 12:23:25
10 69 2019-06-06 12:23:25

First index by date
Set index label to 'RegistrationDate' using
df.set_label('RegistrationDate', inplace=True)
Objective 1
Get user input for date using
today = input('2019-07-22 04:01:02')
count1 = df.loc[today]
will return
15
Objective 3
Ensure that your df.['RegistrationDate'] is a Series type
df.['RegistrationDate'] = pd.to_datetime(df.['RegistrationDate'])
get user inputs on start and end dates
start_date = input("start date:\t")
end_date = input("end date:\t")
create a Boolean mask and ensure that the input dates are datetime.datetime or datetime strings or pd.Timestamp
mask = (df['RegistrationDate'] > start_date) & (df['RegistrationDate'] <= end_date)
re-assign this to a temp_df and sum over columns
temp_df = df.loc[mask]
total_in_range = temp_df['No'].sum()

Related

Pandas : Finding correct time window

I have a pandas dataframe which gets updated every hour with latest hourly data. I have to filter out IDs based upon a threshold, i.e. PR_Rate > 50 and CNT_12571 < 30 for 3 consecutive hours from a lookback period of 5 hours. I was using the below statements to accomplish this:
df_thld=df[(df['Date'] > df['Date'].max() - pd.Timedelta(hours=5))& (df.PR_Rate>50) & (df.CNT_12571 < 30)]
df_thld.loc[:,'HR_CNT'] = df_thld.groupby('ID')['Date'].nunique().to_frame('HR_CNT').reset_index()
df_thld[(df_thld['HR_CNT'] >3]
The problem with this approach is that since lookback period requirement is 5 hours, so, this HR_CNT can count any non consecutive hours breaching this critieria.
MY Dataset is as below:
DataFrame
Date IDs CT_12571 PR_Rate
16/06/2021 10:00 A1 15 50.487
16/06/2021 11:00 A1 31 40.806
16/06/2021 12:00 A1 25 52.302
16/06/2021 13:00 A1 13 61.45
16/06/2021 14:00 A1 7 73.805
In the above Dataframe, threshold was not breached at 1100 hrs, but while counting the hours, 10,12 and 13 as the hours that breached the threshold instead of 12,13,14 as required. Each id may or may not have this critieria breached in a single day. Any idea, How can I fix this issue?
Please excuse me, if I have misinterpreted your problem. As I understand the issues you have a dataframe which is updated hourly. An example of this dataframe is illustrated below as df. From this dataframe, you want to filter only those rows which satisfy the following two conditions:
PR_Rate > 50 and CNT_12571 < 30
If and only if the threshold is surpassed for three consecutive hours
Given these assumptions, I would proceed as follows:
df:
Date IDs CT_1257 PR_Rate
0 2021-06-16 10:00:00 A1 15 50.487
1 2021-06-16 12:00:00 A1 31 40.806
2 2021-06-16 14:00:00 A1 25 52.302
3 2021-06-16 15:00:00 A1 13 61.450
4 2021-06-16 16:00:00 A1 7 73.805
Note in this dataframe, the only time fr5ame which satisfies the above conditions is the entries for the of 14:00, 15:00 and 16:00.
def filterFrame(df, dur, pr_threshold, ct_threshold):
ff = df[(df['CT_1257']< ct_threshold) & (df['PR_Rate'] >pr_threshold) ].reset_index()
ml = list(ff.rolling(f'{dur}h', on='Date').count()['IDs'])
r = len(ml)- 1
rows= []
while r >= 0:
end = r
start = None
if int(ml[r]) < dur:
r -= 1
else:
k = int(ml[r])
for i in range(k):
rows.append(r-i)
r -= k
rows = rows[::-1]
return ff.filter(items= rows, axis = 0).reset_index()
running filterFrame(df, 3, 50, 30) yields:
level_0 index Date IDs CT_1257 PR_Rate
0 1 2 2021-06-16 14:00:00 A1 25 52.302
1 2 3 2021-06-16 15:00:00 A1 13 61.450
2 3 4 2021-06-16 16:00:00 A1 7 73.805

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

Reshape a pandas DataFrame using combination of row values in two columns

I have data for multiple customers in data frame as below-
Customer_id event_type month mins_spent
1 live CM 10
1 live CM1 10
1 catchup CM2 20
1 live CM2 30
2 live CM 45
2 live CM1 30
2 catchup CM2 20
2 live CM2 20
I need the result data frame so that there is one row for each customer and column are combined value of column month and event_type and value will be mins_spent. Result data frame as below-
Customer_id CM_live CM_catchup CM1_live CM1_catchup CM2_live CM2_catchup
1 10 0 10 0 30 20
2 45 0 30 0 20 20
Is there an efficient way to do this instead of iterating the input data frame and creating the new data frame ?
you can use pivot_table
# pivot your data frame
p = df.pivot_table(values='mins_spent', index='Customer_id',
columns=['month', 'event_type'], aggfunc=np.sum)
# flatten multi indexed columns with list comprehension
p.columns = ['_'.join(col) for col in p.columns]
CM_live CM1_live CM2_catchup CM2_live
Customer_id
1 10 10 20 30
2 45 30 20 20
You can create a new column (key) by concatenating columns month and event_type, and then use pivot() to reshape your data.
(df.assign(key = lambda d: d['month'] + '_' + d['event_type'])
.pivot(
index='Customer_id',
columns='key',
values='mins_spent'
))

binning with months column

i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
Here i used below code but i didnot match my requirement.i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
i have data frame as below
casenumber count CREATEDDATE
3820516 1 jan
3820547 1 jan
3820554 2 feb
3820562 1 feb
3820584 1 march
4226616 1 april
4226618 2 may
4226621 2 may
4226655 1 june
4226663 1 june
Here i used below code but i didnot match my requirement.i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
import pandas as pd
import numpy as np
df = pd.read_excel(r"")
bins = [0, 1 ,4,8,15, np.inf]
names = ['0-1','1-4','4-8','8-15','15+']
df1 = df.groupby(pd.cut(df['CREATEDDATE'],bins,labels=names))['casenumber'].size().reset_index(name='No_of_times_statuschanged')
CREATEDDATE No_of_times_statuschanged
0 0-1 2092
1 1-4 9062
2 4-8 12578
3 8-15 3858
4 15+ 0
I got the above data as out put but my expected should be range for month on month based on the cases per month .
expected output should be like
CREATEDDATE jan feb march april may june
0-1 1 2 3 4 5 6
1-4 3 0 6 7 8 9
4-8 4 6 3 0 9 2
8-15 0 3 4 5 8 9
I got the above data as out put but my expected should be range for month on month based on the cases per month .
expected output should be like
Use crosstab with change CREATEDDATE to count for pd.cut and change order of column by subset by list of columns names:
#add another months if necessary
months = ["jan", "feb", "march", "april", "may", "june"]
bins = [0, 1 ,4,8,15, np.inf]
names = ['0-1','1-4','4-8','8-15','15+']
df1 = pd.crosstab(pd.cut(df['count'],bins,labels=names), df['CREATEDDATE'])[months]
print (df1)
CREATEDDATE jan feb march april may june
count
0-1 2 1 1 1 0 2
1-4 0 1 0 0 2 0
Another idea is use ordered categoricals:
df1 = pd.crosstab(pd.cut(df['count'],bins,labels=names),
pd.Categorical(df['CREATEDDATE'], ordered=True, categories=months))
print (df1)
col_0 jan feb march april may june
count
0-1 2 1 1 1 0 2
1-4 0 1 0 0 2 0

Split dates into time ranges in pandas

14 [2018-03-14, 2018-03-13, 2017-03-06, 2017-02-13]
15 [2017-07-26, 2017-06-09, 2017-02-24]
16 [2018-09-06, 2018-07-06, 2018-07-04, 2017-10-20]
17 [2018-10-03, 2018-09-13, 2018-09-12, 2018-08-3]
18 [2017-02-08]
this is my data, every ID has it's own dates that range between 2017-02-05 and 2018-06-30. I need to split dates into 5 time ranges of 4 months each, so that for the first 4 months every ID should have dates only in that time range (from 2017-02-05 to 2017-06-05), like this
14 [2017-03-06, 2017-02-13]
15 [2017-02-24]
16 [null] # or delete empty rows, it doesn't matter
17 [null]
18 [2017-02-08]
then for 2017-06-05 to 2017-10-05 and so on for every 4 month ranges. Also I can't use nested for loops because the data is too big. This is what I tried so far
months_4 = individual_dates.copy()
for _ in months_4['Date']:
_ = np.where(pd.to_datetime(_) <= pd.to_datetime('2017-9-02'), _, np.datetime64('NaT'))
and
months_8 = individual_dates.copy()
range_8 = pd.date_range(start='2017-9-02', end='2017-11-02')
for _ in months_8['Date']:
_ = _[np.isin(_, range_8)]
achieved absolutely no result, data stays the same no matter what
update: I did what you said
individual_dates['Date'] = individual_dates['Date'].str.strip('[]').str.split(', ')
df = pd.DataFrame({
'Date' : list(chain.from_iterable(individual_dates['Date'].tolist())),
'ID' : individual_dates['ClientId'].repeat(individual_dates['Date'].str.len())
})
df
and here is the result
Date ID
0 '2018-06-30T00:00:00.000000000' '2018-06-29T00... 14
1 '2017-03-28T00:00:00.000000000' '2017-03-27T00... 15
2 '2018-03-14T00:00:00.000000000' '2018-03-13T00... 16
3 '2017-12-14T00:00:00.000000000' '2017-03-28T00... 17
4 '2017-05-30T00:00:00.000000000' '2017-05-22T00... 18
5 '2017-03-28T00:00:00.000000000' '2017-03-27T00... 19
6 '2017-03-27T00:00:00.000000000' '2017-03-26T00... 20
7 '2017-12-15T00:00:00.000000000' '2017-11-20T00... 21
8 '2017-07-05T00:00:00.000000000' '2017-07-04T00... 22
9 '2017-12-12T00:00:00.000000000' '2017-04-06T00... 23
10 '2017-05-21T00:00:00.000000000' '2017-05-07T00... 24
For better performance I suggest convert list to column - flatten it and then filtering by isin with boolean indexing:
from itertools import chain
df = pd.DataFrame({
'Date' : list(chain.from_iterable(individual_dates['Date'].tolist())),
'ID' : individual_dates['ID'].repeat(individual_dates['Date'].str.len())
})
range_8 = pd.date_range(start='2017-02-05', end='2017-06-05')
df['Date'] = pd.to_datetime(df['Date'])
df = df[df['Date'].isin(range_8)]
print (df)
Date ID
0 2017-03-06 14
0 2017-02-13 14
1 2017-02-24 15
4 2017-02-08 18

Resources