Calculating a duration from two dates in different time zones - python-3.x

I have a CSV file with trip data:
Trip ID,Depart Time,Arrive Time,Depart Timezone,Arrive Timezone
1,08/29/21 09:00 PM,08/29/21 09:45 PM,GMT-04:00,GMT-04:00
2,08/29/21 10:00 PM,08/30/21 01:28 AM,GMT-04:00,GMT-04:00
3,08/30/21 01:29 AM,08/30/21 01:30 AM,GMT-04:00,GMT-04:00
4,08/30/21 01:45 AM,08/30/21 03:06 AM,GMT-04:00,GMT-04:00
5,08/30/21 03:08 AM,08/30/21 03:58 AM,GMT-04:00,GMT-04:00
6,08/30/21 03:59 AM,08/30/21 04:15 AM,GMT-04:00,GMT-04:00
I can read this file into a dataframe:
trips = pd.read_csv("trips.csv", sep=',')
What I would like to accomplish is to add a column 'duration' which gives me the trip duration in minutes. The trip duration has to be calculated as the difference between the trip arrival time and the trip departure time. In the above table, the 'depart time' is relative to the 'Depart Timezone'. Similarly, the 'Arrive Time' is relative to the 'Arrive Timezone'.
Note that in the above example, the arrival and departure dates, as well as the arrival and departure time zones happen to be the same, but this does not hold in general for my data.

What you have are UTC offsets (GMT-04:00 is four hours behind UTC); you can join the date/time column and respective offset column by ' ' and parse to_datetime. You can then calculate duration (timedelta) from the resulting tz-aware datetime columns. Ex:
# make datetime columns:
df['dt_depart'] = pd.to_datetime(df['Depart Time'] + ' ' + df['Depart Timezone'],
utc=True)
df['dt_arrive'] = pd.to_datetime(df['Arrive Time'] + ' ' + df['Arrive Timezone'],
utc=True)
Note: I'm using UTC=True here in case there are mixed UTC offsets in the input. That gives e.g.
df['dt_depart']
Out[6]:
0 2021-08-29 17:00:00+00:00
1 2021-08-29 18:00:00+00:00
2 2021-08-29 21:29:00+00:00
3 2021-08-29 21:45:00+00:00
4 2021-08-29 23:08:00+00:00
5 2021-08-29 23:59:00+00:00
Name: dt_depart, dtype: datetime64[ns, UTC]
then
# calculate the travel duration (timedelta column):
df['traveltime'] = df['dt_arrive'] - df['dt_depart']
gives e.g.
df['traveltime']
Out[7]:
0 0 days 00:45:00
1 0 days 03:28:00
2 0 days 00:01:00
3 0 days 01:21:00
4 0 days 00:50:00
5 0 days 00:16:00
Name: traveltime, dtype: timedelta64[ns]

Related

Pandas : Finding correct time window

I have a pandas dataframe which gets updated every hour with latest hourly data. I have to filter out IDs based upon a threshold, i.e. PR_Rate > 50 and CNT_12571 < 30 for 3 consecutive hours from a lookback period of 5 hours. I was using the below statements to accomplish this:
df_thld=df[(df['Date'] > df['Date'].max() - pd.Timedelta(hours=5))& (df.PR_Rate>50) & (df.CNT_12571 < 30)]
df_thld.loc[:,'HR_CNT'] = df_thld.groupby('ID')['Date'].nunique().to_frame('HR_CNT').reset_index()
df_thld[(df_thld['HR_CNT'] >3]
The problem with this approach is that since lookback period requirement is 5 hours, so, this HR_CNT can count any non consecutive hours breaching this critieria.
MY Dataset is as below:
DataFrame
Date IDs CT_12571 PR_Rate
16/06/2021 10:00 A1 15 50.487
16/06/2021 11:00 A1 31 40.806
16/06/2021 12:00 A1 25 52.302
16/06/2021 13:00 A1 13 61.45
16/06/2021 14:00 A1 7 73.805
In the above Dataframe, threshold was not breached at 1100 hrs, but while counting the hours, 10,12 and 13 as the hours that breached the threshold instead of 12,13,14 as required. Each id may or may not have this critieria breached in a single day. Any idea, How can I fix this issue?
Please excuse me, if I have misinterpreted your problem. As I understand the issues you have a dataframe which is updated hourly. An example of this dataframe is illustrated below as df. From this dataframe, you want to filter only those rows which satisfy the following two conditions:
PR_Rate > 50 and CNT_12571 < 30
If and only if the threshold is surpassed for three consecutive hours
Given these assumptions, I would proceed as follows:
df:
Date IDs CT_1257 PR_Rate
0 2021-06-16 10:00:00 A1 15 50.487
1 2021-06-16 12:00:00 A1 31 40.806
2 2021-06-16 14:00:00 A1 25 52.302
3 2021-06-16 15:00:00 A1 13 61.450
4 2021-06-16 16:00:00 A1 7 73.805
Note in this dataframe, the only time fr5ame which satisfies the above conditions is the entries for the of 14:00, 15:00 and 16:00.
def filterFrame(df, dur, pr_threshold, ct_threshold):
ff = df[(df['CT_1257']< ct_threshold) & (df['PR_Rate'] >pr_threshold) ].reset_index()
ml = list(ff.rolling(f'{dur}h', on='Date').count()['IDs'])
r = len(ml)- 1
rows= []
while r >= 0:
end = r
start = None
if int(ml[r]) < dur:
r -= 1
else:
k = int(ml[r])
for i in range(k):
rows.append(r-i)
r -= k
rows = rows[::-1]
return ff.filter(items= rows, axis = 0).reset_index()
running filterFrame(df, 3, 50, 30) yields:
level_0 index Date IDs CT_1257 PR_Rate
0 1 2 2021-06-16 14:00:00 A1 25 52.302
1 2 3 2021-06-16 15:00:00 A1 13 61.450
2 3 4 2021-06-16 16:00:00 A1 7 73.805

How to remove successive similar numbers in a pandas DF column

I have a pandas DF with a column - this column can have 3 values, either 0, 1 or ' ' (see example below).
What I want to do is remove all successive numbers that are similar. So a 0 can never be followed by a 0 and a 1 can never be followed by 1. Instead I want to replace these by a ' '.
Current pandas DF
time
value
1:00
0
2:00
3:00
0
4:00
1
5:00
6:00
7:00
1
8:00
1
9:00
0
What I want
time
value
1:00
0
2:00
3:00
4:00
1
5:00
6:00
7:00
8:00
9:00
0
I tried to work with loops, but cannot find a clean way to refer to 'the next same value'.
Any simple solution for this?
An itertools solution:
from itertools import chain, groupby
df.value = list(chain.from_iterable(
[key, *['']*(len(list(gr))-1)]
for key, gr in groupby(df.value.replace("", np.nan).ffill())
)
)
replaceing empty strings with np.nan
forward filling the NaNs to get streams of 0's and 1's
grouping by 0's and 1's
placing back the key (which is 0 or 1) along with some empty strings (group's length - 1)
flattening these blocks with chain.from_iterable
casting to a list to assign it back to the dataframe
to get
time value
0 1:00 0
1 2:00
2 3:00
3 4:00 1
4 5:00
5 6:00
6 7:00
7 8:00
8 9:00 0
We can use loc on value to drop the rows having empty strings, then shift and compare the filtered rows to create a boolean mask, next mask the values with empty string where the boolean mask holds True
s = df['value'].loc[lambda x: x != '']
m = s.eq(s.shift())
df.loc[m[m].index, 'value'] = ''
time value
0 1:00 0
1 2:00
2 3:00
3 4:00 1
4 5:00
5 6:00
6 7:00
7 8:00
8 9:00 0

How Can I extract data in python with the same value of fields using pandas

I Have a dataset with fields id, time, date, name, etc. I want to extract data that has the same id and date. How can I do that?
For example
id time date
1 16:00 03/05/2020
2 16:00 03/05/2020
1 17:00 03/05/2020
1 16:00 04/05/2020
2 16:00 04/05/2020
Now I want to fetch :
1 16:00 03/05/2020
1 17:00 03/05/2020
Can groupby and filter
df.groupby(['id', 'date']).filter(lambda s: len(s) > 1)
id time date
0 1 16:00 03/05/2020
2 1 17:00 03/05/2020

Fastest, most efficient way to aggregate a large dataset in python

Let's say I'm measuring the speed over time of a car moving forward on a single axis, with a new measure every 10 minutes.
I have a column in my DataFrame called delta_x, which contains how much the car moved on my axis in the last 10 minutes, values are integers only.
Now let's say that I want to aggregate my data, and have only the amount of movement over each hour, but I want to optimize my code as much as possible because my dataset is extremely large, what's the most efficient way to achieve that ?
df.head(9)
date time delta_x
0 01/01/2018 00:00 9
1 01/01/2018 00:10 9
2 01/01/2018 00:20 9
3 01/01/2018 00:30 9
4 01/01/2018 00:40 11
5 01/01/2018 00:50 12
6 01/01/2018 01:00 10
7 01/01/2018 01:10 10
8 01/01/2018 01:20 10
Currently my solution is to do the following
for file in os.listdir('temp'):
if(file.endswith('.txt'):
df = pd.read_csv(''.join(["./temp/",file]), header=None, delim_whitespace=True)
df.columns = ['date', 'time', 'delta_x']
df['hour'] = [(datetime.strptime(x, "%H:%M")).hour for x in df['time'].values]
df = df.groupby(['date','hour']).agg({'delta_x': 'sum'})
Which outputs the correct:
date hour delta_x
01/01/2018 0 59
But I was wondering, is there a better, faster and more efficient way, perhaps using NumPy ?
You can try with following packages which are used for speeding up pandas operation
https://github.com/jmcarpenter2/swifter
https://github.com/modin-project/modin

Pandas changing dates near each other

I have a pandas dataframe with dates and users which looks like this-
date = ['1/2/2020','1/9/2020','1/10/2020','1/17/2020','1/18/2020','1/24/2020','1/25/2020','5/17/2019','5/18/2019','5/24/2019','5/29/2019']
user =['A','B','C','B','A','A','B','C','A','A','B']
df = pd.DataFrame(data={"Date":date, "User":user})
I am trying to find all dates that are next to each other (Jan-1 and Jan-2) and convert them to a single date so both entries would then become the lower of the two. The number of entries are over a million. This data is created from a scan results that triggers nightly and sometime flows into the other day.
Update-
I wanted to consolidate the date of the scan so that I can show the visualization properly. As right now the results would have more entry on the day the scan starts but very few entries for the day where the scan overflowed. There is a primary date and time stored so I am not loosing the data. The user column is presented as it scans a file with all the usernames and the date stores the date when it was scanned.
So far I was able to read the dataframe and then sort it based on the date to have the entries one after the other.
The output should look like the following -
Is there a pytonic way of doing this?
One issue to consider is the case of multiple consecutive days and how you want to handle these. The following code sets the day to the first of the consecutive days in each block:
import pandas as pd
from datetime import timedelta
# prepend two dates to show multiple consecutive days "use-case"
date = ['12/31/2019','1/1/2020','1/2/2020','1/9/2020','1/10/2020','1/17/2020','1/18/2020','1/24/2020','1/25/2020','5/17/2019','5/18/2019','5/24/2019','5/29/2019']
user = ['Z','Z','A','B','C','B','A','A','B','C','A','A','B']
df = pd.DataFrame(data={"Date":date, "User":user})
# first convert to datetime to allow date operations
df.Date = pd.to_datetime(df.Date)
# check if the the date is one day after the row before (by shifting the Date column)
df['isConsecutive'] = (df.Date == df.Date.shift()+pd.DateOffset(1))
# get number of consecutive days in each block
df['numConsecutive'] = df.isConsecutive.groupby((~df.isConsecutive).cumsum()).cumsum()
# convert to timedelta
df.numConsecutive = df.numConsecutive.apply(lambda x: timedelta(days=x))
# take this as differnce to Date
df['NewDate'] = df.Date - df.numConsecutive
print(df)
This returns:
Date User isConsecutive numConsecutive NewDate
0 2019-12-31 Z False 0 days 2019-12-31
1 2020-01-01 Z True 1 days 2019-12-31
2 2020-01-02 A True 2 days 2019-12-31
3 2020-01-09 B False 0 days 2020-01-09
4 2020-01-10 C True 1 days 2020-01-09
5 2020-01-17 B False 0 days 2020-01-17
6 2020-01-18 A True 1 days 2020-01-17
7 2020-01-24 A False 0 days 2020-01-24
8 2020-01-25 B True 1 days 2020-01-24
9 2019-05-17 C False 0 days 2019-05-17
10 2019-05-18 A True 1 days 2019-05-17
11 2019-05-24 A False 0 days 2019-05-24
12 2019-05-29 B False 0 days 2019-05-29

Resources