Pandas changing dates near each other - python-3.x

I have a pandas dataframe with dates and users which looks like this-
date = ['1/2/2020','1/9/2020','1/10/2020','1/17/2020','1/18/2020','1/24/2020','1/25/2020','5/17/2019','5/18/2019','5/24/2019','5/29/2019']
user =['A','B','C','B','A','A','B','C','A','A','B']
df = pd.DataFrame(data={"Date":date, "User":user})
I am trying to find all dates that are next to each other (Jan-1 and Jan-2) and convert them to a single date so both entries would then become the lower of the two. The number of entries are over a million. This data is created from a scan results that triggers nightly and sometime flows into the other day.
Update-
I wanted to consolidate the date of the scan so that I can show the visualization properly. As right now the results would have more entry on the day the scan starts but very few entries for the day where the scan overflowed. There is a primary date and time stored so I am not loosing the data. The user column is presented as it scans a file with all the usernames and the date stores the date when it was scanned.
So far I was able to read the dataframe and then sort it based on the date to have the entries one after the other.
The output should look like the following -
Is there a pytonic way of doing this?

One issue to consider is the case of multiple consecutive days and how you want to handle these. The following code sets the day to the first of the consecutive days in each block:
import pandas as pd
from datetime import timedelta
# prepend two dates to show multiple consecutive days "use-case"
date = ['12/31/2019','1/1/2020','1/2/2020','1/9/2020','1/10/2020','1/17/2020','1/18/2020','1/24/2020','1/25/2020','5/17/2019','5/18/2019','5/24/2019','5/29/2019']
user = ['Z','Z','A','B','C','B','A','A','B','C','A','A','B']
df = pd.DataFrame(data={"Date":date, "User":user})
# first convert to datetime to allow date operations
df.Date = pd.to_datetime(df.Date)
# check if the the date is one day after the row before (by shifting the Date column)
df['isConsecutive'] = (df.Date == df.Date.shift()+pd.DateOffset(1))
# get number of consecutive days in each block
df['numConsecutive'] = df.isConsecutive.groupby((~df.isConsecutive).cumsum()).cumsum()
# convert to timedelta
df.numConsecutive = df.numConsecutive.apply(lambda x: timedelta(days=x))
# take this as differnce to Date
df['NewDate'] = df.Date - df.numConsecutive
print(df)
This returns:
Date User isConsecutive numConsecutive NewDate
0 2019-12-31 Z False 0 days 2019-12-31
1 2020-01-01 Z True 1 days 2019-12-31
2 2020-01-02 A True 2 days 2019-12-31
3 2020-01-09 B False 0 days 2020-01-09
4 2020-01-10 C True 1 days 2020-01-09
5 2020-01-17 B False 0 days 2020-01-17
6 2020-01-18 A True 1 days 2020-01-17
7 2020-01-24 A False 0 days 2020-01-24
8 2020-01-25 B True 1 days 2020-01-24
9 2019-05-17 C False 0 days 2019-05-17
10 2019-05-18 A True 1 days 2019-05-17
11 2019-05-24 A False 0 days 2019-05-24
12 2019-05-29 B False 0 days 2019-05-29

Related

Calculating a duration from two dates in different time zones

I have a CSV file with trip data:
Trip ID,Depart Time,Arrive Time,Depart Timezone,Arrive Timezone
1,08/29/21 09:00 PM,08/29/21 09:45 PM,GMT-04:00,GMT-04:00
2,08/29/21 10:00 PM,08/30/21 01:28 AM,GMT-04:00,GMT-04:00
3,08/30/21 01:29 AM,08/30/21 01:30 AM,GMT-04:00,GMT-04:00
4,08/30/21 01:45 AM,08/30/21 03:06 AM,GMT-04:00,GMT-04:00
5,08/30/21 03:08 AM,08/30/21 03:58 AM,GMT-04:00,GMT-04:00
6,08/30/21 03:59 AM,08/30/21 04:15 AM,GMT-04:00,GMT-04:00
I can read this file into a dataframe:
trips = pd.read_csv("trips.csv", sep=',')
What I would like to accomplish is to add a column 'duration' which gives me the trip duration in minutes. The trip duration has to be calculated as the difference between the trip arrival time and the trip departure time. In the above table, the 'depart time' is relative to the 'Depart Timezone'. Similarly, the 'Arrive Time' is relative to the 'Arrive Timezone'.
Note that in the above example, the arrival and departure dates, as well as the arrival and departure time zones happen to be the same, but this does not hold in general for my data.
What you have are UTC offsets (GMT-04:00 is four hours behind UTC); you can join the date/time column and respective offset column by ' ' and parse to_datetime. You can then calculate duration (timedelta) from the resulting tz-aware datetime columns. Ex:
# make datetime columns:
df['dt_depart'] = pd.to_datetime(df['Depart Time'] + ' ' + df['Depart Timezone'],
utc=True)
df['dt_arrive'] = pd.to_datetime(df['Arrive Time'] + ' ' + df['Arrive Timezone'],
utc=True)
Note: I'm using UTC=True here in case there are mixed UTC offsets in the input. That gives e.g.
df['dt_depart']
Out[6]:
0 2021-08-29 17:00:00+00:00
1 2021-08-29 18:00:00+00:00
2 2021-08-29 21:29:00+00:00
3 2021-08-29 21:45:00+00:00
4 2021-08-29 23:08:00+00:00
5 2021-08-29 23:59:00+00:00
Name: dt_depart, dtype: datetime64[ns, UTC]
then
# calculate the travel duration (timedelta column):
df['traveltime'] = df['dt_arrive'] - df['dt_depart']
gives e.g.
df['traveltime']
Out[7]:
0 0 days 00:45:00
1 0 days 03:28:00
2 0 days 00:01:00
3 0 days 01:21:00
4 0 days 00:50:00
5 0 days 00:16:00
Name: traveltime, dtype: timedelta64[ns]

Pandas : Finding correct time window

I have a pandas dataframe which gets updated every hour with latest hourly data. I have to filter out IDs based upon a threshold, i.e. PR_Rate > 50 and CNT_12571 < 30 for 3 consecutive hours from a lookback period of 5 hours. I was using the below statements to accomplish this:
df_thld=df[(df['Date'] > df['Date'].max() - pd.Timedelta(hours=5))& (df.PR_Rate>50) & (df.CNT_12571 < 30)]
df_thld.loc[:,'HR_CNT'] = df_thld.groupby('ID')['Date'].nunique().to_frame('HR_CNT').reset_index()
df_thld[(df_thld['HR_CNT'] >3]
The problem with this approach is that since lookback period requirement is 5 hours, so, this HR_CNT can count any non consecutive hours breaching this critieria.
MY Dataset is as below:
DataFrame
Date IDs CT_12571 PR_Rate
16/06/2021 10:00 A1 15 50.487
16/06/2021 11:00 A1 31 40.806
16/06/2021 12:00 A1 25 52.302
16/06/2021 13:00 A1 13 61.45
16/06/2021 14:00 A1 7 73.805
In the above Dataframe, threshold was not breached at 1100 hrs, but while counting the hours, 10,12 and 13 as the hours that breached the threshold instead of 12,13,14 as required. Each id may or may not have this critieria breached in a single day. Any idea, How can I fix this issue?
Please excuse me, if I have misinterpreted your problem. As I understand the issues you have a dataframe which is updated hourly. An example of this dataframe is illustrated below as df. From this dataframe, you want to filter only those rows which satisfy the following two conditions:
PR_Rate > 50 and CNT_12571 < 30
If and only if the threshold is surpassed for three consecutive hours
Given these assumptions, I would proceed as follows:
df:
Date IDs CT_1257 PR_Rate
0 2021-06-16 10:00:00 A1 15 50.487
1 2021-06-16 12:00:00 A1 31 40.806
2 2021-06-16 14:00:00 A1 25 52.302
3 2021-06-16 15:00:00 A1 13 61.450
4 2021-06-16 16:00:00 A1 7 73.805
Note in this dataframe, the only time fr5ame which satisfies the above conditions is the entries for the of 14:00, 15:00 and 16:00.
def filterFrame(df, dur, pr_threshold, ct_threshold):
ff = df[(df['CT_1257']< ct_threshold) & (df['PR_Rate'] >pr_threshold) ].reset_index()
ml = list(ff.rolling(f'{dur}h', on='Date').count()['IDs'])
r = len(ml)- 1
rows= []
while r >= 0:
end = r
start = None
if int(ml[r]) < dur:
r -= 1
else:
k = int(ml[r])
for i in range(k):
rows.append(r-i)
r -= k
rows = rows[::-1]
return ff.filter(items= rows, axis = 0).reset_index()
running filterFrame(df, 3, 50, 30) yields:
level_0 index Date IDs CT_1257 PR_Rate
0 1 2 2021-06-16 14:00:00 A1 25 52.302
1 2 3 2021-06-16 15:00:00 A1 13 61.450
2 3 4 2021-06-16 16:00:00 A1 7 73.805

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

binning with months column

i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
Here i used below code but i didnot match my requirement.i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
i have data frame as below
casenumber count CREATEDDATE
3820516 1 jan
3820547 1 jan
3820554 2 feb
3820562 1 feb
3820584 1 march
4226616 1 april
4226618 2 may
4226621 2 may
4226655 1 june
4226663 1 june
Here i used below code but i didnot match my requirement.i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
import pandas as pd
import numpy as np
df = pd.read_excel(r"")
bins = [0, 1 ,4,8,15, np.inf]
names = ['0-1','1-4','4-8','8-15','15+']
df1 = df.groupby(pd.cut(df['CREATEDDATE'],bins,labels=names))['casenumber'].size().reset_index(name='No_of_times_statuschanged')
CREATEDDATE No_of_times_statuschanged
0 0-1 2092
1 1-4 9062
2 4-8 12578
3 8-15 3858
4 15+ 0
I got the above data as out put but my expected should be range for month on month based on the cases per month .
expected output should be like
CREATEDDATE jan feb march april may june
0-1 1 2 3 4 5 6
1-4 3 0 6 7 8 9
4-8 4 6 3 0 9 2
8-15 0 3 4 5 8 9
I got the above data as out put but my expected should be range for month on month based on the cases per month .
expected output should be like
Use crosstab with change CREATEDDATE to count for pd.cut and change order of column by subset by list of columns names:
#add another months if necessary
months = ["jan", "feb", "march", "april", "may", "june"]
bins = [0, 1 ,4,8,15, np.inf]
names = ['0-1','1-4','4-8','8-15','15+']
df1 = pd.crosstab(pd.cut(df['count'],bins,labels=names), df['CREATEDDATE'])[months]
print (df1)
CREATEDDATE jan feb march april may june
count
0-1 2 1 1 1 0 2
1-4 0 1 0 0 2 0
Another idea is use ordered categoricals:
df1 = pd.crosstab(pd.cut(df['count'],bins,labels=names),
pd.Categorical(df['CREATEDDATE'], ordered=True, categories=months))
print (df1)
col_0 jan feb march april may june
count
0-1 2 1 1 1 0 2
1-4 0 1 0 0 2 0

day of Year values starting from a particular date

I have a dataframe with a date column. The duration is 365 days starting from 02/11/2017 and ending at 01/11/2018.
Date
02/11/2017
03/11/2017
05/11/2017
.
.
01/11/2018
I want to add an adjacent column called Day_Of_Year as follows:
Date Day_Of_Year
02/11/2017 1
03/11/2017 2
05/11/2017 4
.
.
01/11/2018 365
I apologize if it's a very basic question, but unfortunately I haven't been able to start with this.
I could use datetime(), but that would return values such as 1 for 1st january, 2 for 2nd january and so on.. irrespective of the year. So, that wouldn't work for me.
First convert column to_datetime and then subtract datetime, convert to days and add 1:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Day_Of_Year'] = df['Date'].sub(pd.Timestamp('2017-11-02')).dt.days + 1
print (df)
Date Day_Of_Year
0 02/11/2017 1
1 03/11/2017 2
2 05/11/2017 4
3 01/11/2018 365
Or subtract by first value of column:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Day_Of_Year'] = df['Date'].sub(df['Date'].iat[0]).dt.days + 1
print (df)
Date Day_Of_Year
0 2017-11-02 1
1 2017-11-03 2
2 2017-11-05 4
3 2018-11-01 365
Using strftime with '%j'
s=pd.to_datetime(df.Date,dayfirst=True).dt.strftime('%j').astype(int)
s-s.iloc[0]
Out[750]:
0 0
1 1
2 3
Name: Date, dtype: int32
#df['new']=s-s.iloc[0]
Python has dayofyear. So put your column in the right format with pd.to_datetime and then apply Series.dt.dayofyear. Lastly, use some modulo arithmetic to find everything in terms of your original date
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['day of year'] = df['Date'].dt.dayofyear - df['Date'].dt.dayofyear[0] + 1
df['day of year'] = df['day of year'] + 365*((365 - df['day of year']) // 365)
Output
Date day of year
0 2017-11-02 1
1 2017-11-03 2
2 2017-11-05 4
3 2018-11-01 365
But I'm doing essentially the same as Jezrael in more lines of code, so my vote goes to her/him

Resources