How to remove successive similar numbers in a pandas DF column - python-3.x

I have a pandas DF with a column - this column can have 3 values, either 0, 1 or ' ' (see example below).
What I want to do is remove all successive numbers that are similar. So a 0 can never be followed by a 0 and a 1 can never be followed by 1. Instead I want to replace these by a ' '.
Current pandas DF
time
value
1:00
0
2:00
3:00
0
4:00
1
5:00
6:00
7:00
1
8:00
1
9:00
0
What I want
time
value
1:00
0
2:00
3:00
4:00
1
5:00
6:00
7:00
8:00
9:00
0
I tried to work with loops, but cannot find a clean way to refer to 'the next same value'.
Any simple solution for this?

An itertools solution:
from itertools import chain, groupby
df.value = list(chain.from_iterable(
[key, *['']*(len(list(gr))-1)]
for key, gr in groupby(df.value.replace("", np.nan).ffill())
)
)
replaceing empty strings with np.nan
forward filling the NaNs to get streams of 0's and 1's
grouping by 0's and 1's
placing back the key (which is 0 or 1) along with some empty strings (group's length - 1)
flattening these blocks with chain.from_iterable
casting to a list to assign it back to the dataframe
to get
time value
0 1:00 0
1 2:00
2 3:00
3 4:00 1
4 5:00
5 6:00
6 7:00
7 8:00
8 9:00 0

We can use loc on value to drop the rows having empty strings, then shift and compare the filtered rows to create a boolean mask, next mask the values with empty string where the boolean mask holds True
s = df['value'].loc[lambda x: x != '']
m = s.eq(s.shift())
df.loc[m[m].index, 'value'] = ''
time value
0 1:00 0
1 2:00
2 3:00
3 4:00 1
4 5:00
5 6:00
6 7:00
7 8:00
8 9:00 0

Related

Time manipulations

Hello I have to count how many people were scheduled on each hour in excel so I transformed starting and ending data/time to only contain time and basing on it I tried to substract these two information but I only get an hour then but what I need is the hours to be like this:
instead
starting on 9:00
ending on 17:00
this
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
to count every hour that employee was at work. But I don't know how :(
Or is there a better way of doing that?
Assuming your table looks something like this:
Person
Start
End
09:00
10:00
11:00
12:00
13:00
14:00
15:00
Alice
08:35
16:35
1
1
1
1
1
1
1
Bob
09:35
17:35
0
1
1
1
1
1
1
Carl
10:35
18:35
0
0
1
1
1
1
1
Dan
11:35
19:35
0
0
0
1
1
1
1
Ed
12:35
20:35
0
0
0
0
1
1
1
Total present
1
2
3
4
5
5
5
You can compute the entries 0 or 1 in each cell under the times using the formula
=IF(AND((E$4>$C6);(E$4<=$D6));1;0)
In the formula, E$4 is a reference to the column header, e.g. "9:00", $C6 and $D6 are references to the start and end times of the person. They are defined using partial absolute references ($) so the same formula can be copied and pasted in all the cells.
The result will be 1 if the person was present at that time and 0 if not.
The "Total present" formulas just sum up the 1's and 0's in the column.

Calculating a duration from two dates in different time zones

I have a CSV file with trip data:
Trip ID,Depart Time,Arrive Time,Depart Timezone,Arrive Timezone
1,08/29/21 09:00 PM,08/29/21 09:45 PM,GMT-04:00,GMT-04:00
2,08/29/21 10:00 PM,08/30/21 01:28 AM,GMT-04:00,GMT-04:00
3,08/30/21 01:29 AM,08/30/21 01:30 AM,GMT-04:00,GMT-04:00
4,08/30/21 01:45 AM,08/30/21 03:06 AM,GMT-04:00,GMT-04:00
5,08/30/21 03:08 AM,08/30/21 03:58 AM,GMT-04:00,GMT-04:00
6,08/30/21 03:59 AM,08/30/21 04:15 AM,GMT-04:00,GMT-04:00
I can read this file into a dataframe:
trips = pd.read_csv("trips.csv", sep=',')
What I would like to accomplish is to add a column 'duration' which gives me the trip duration in minutes. The trip duration has to be calculated as the difference between the trip arrival time and the trip departure time. In the above table, the 'depart time' is relative to the 'Depart Timezone'. Similarly, the 'Arrive Time' is relative to the 'Arrive Timezone'.
Note that in the above example, the arrival and departure dates, as well as the arrival and departure time zones happen to be the same, but this does not hold in general for my data.
What you have are UTC offsets (GMT-04:00 is four hours behind UTC); you can join the date/time column and respective offset column by ' ' and parse to_datetime. You can then calculate duration (timedelta) from the resulting tz-aware datetime columns. Ex:
# make datetime columns:
df['dt_depart'] = pd.to_datetime(df['Depart Time'] + ' ' + df['Depart Timezone'],
utc=True)
df['dt_arrive'] = pd.to_datetime(df['Arrive Time'] + ' ' + df['Arrive Timezone'],
utc=True)
Note: I'm using UTC=True here in case there are mixed UTC offsets in the input. That gives e.g.
df['dt_depart']
Out[6]:
0 2021-08-29 17:00:00+00:00
1 2021-08-29 18:00:00+00:00
2 2021-08-29 21:29:00+00:00
3 2021-08-29 21:45:00+00:00
4 2021-08-29 23:08:00+00:00
5 2021-08-29 23:59:00+00:00
Name: dt_depart, dtype: datetime64[ns, UTC]
then
# calculate the travel duration (timedelta column):
df['traveltime'] = df['dt_arrive'] - df['dt_depart']
gives e.g.
df['traveltime']
Out[7]:
0 0 days 00:45:00
1 0 days 03:28:00
2 0 days 00:01:00
3 0 days 01:21:00
4 0 days 00:50:00
5 0 days 00:16:00
Name: traveltime, dtype: timedelta64[ns]

Pandas : Finding correct time window

I have a pandas dataframe which gets updated every hour with latest hourly data. I have to filter out IDs based upon a threshold, i.e. PR_Rate > 50 and CNT_12571 < 30 for 3 consecutive hours from a lookback period of 5 hours. I was using the below statements to accomplish this:
df_thld=df[(df['Date'] > df['Date'].max() - pd.Timedelta(hours=5))& (df.PR_Rate>50) & (df.CNT_12571 < 30)]
df_thld.loc[:,'HR_CNT'] = df_thld.groupby('ID')['Date'].nunique().to_frame('HR_CNT').reset_index()
df_thld[(df_thld['HR_CNT'] >3]
The problem with this approach is that since lookback period requirement is 5 hours, so, this HR_CNT can count any non consecutive hours breaching this critieria.
MY Dataset is as below:
DataFrame
Date IDs CT_12571 PR_Rate
16/06/2021 10:00 A1 15 50.487
16/06/2021 11:00 A1 31 40.806
16/06/2021 12:00 A1 25 52.302
16/06/2021 13:00 A1 13 61.45
16/06/2021 14:00 A1 7 73.805
In the above Dataframe, threshold was not breached at 1100 hrs, but while counting the hours, 10,12 and 13 as the hours that breached the threshold instead of 12,13,14 as required. Each id may or may not have this critieria breached in a single day. Any idea, How can I fix this issue?
Please excuse me, if I have misinterpreted your problem. As I understand the issues you have a dataframe which is updated hourly. An example of this dataframe is illustrated below as df. From this dataframe, you want to filter only those rows which satisfy the following two conditions:
PR_Rate > 50 and CNT_12571 < 30
If and only if the threshold is surpassed for three consecutive hours
Given these assumptions, I would proceed as follows:
df:
Date IDs CT_1257 PR_Rate
0 2021-06-16 10:00:00 A1 15 50.487
1 2021-06-16 12:00:00 A1 31 40.806
2 2021-06-16 14:00:00 A1 25 52.302
3 2021-06-16 15:00:00 A1 13 61.450
4 2021-06-16 16:00:00 A1 7 73.805
Note in this dataframe, the only time fr5ame which satisfies the above conditions is the entries for the of 14:00, 15:00 and 16:00.
def filterFrame(df, dur, pr_threshold, ct_threshold):
ff = df[(df['CT_1257']< ct_threshold) & (df['PR_Rate'] >pr_threshold) ].reset_index()
ml = list(ff.rolling(f'{dur}h', on='Date').count()['IDs'])
r = len(ml)- 1
rows= []
while r >= 0:
end = r
start = None
if int(ml[r]) < dur:
r -= 1
else:
k = int(ml[r])
for i in range(k):
rows.append(r-i)
r -= k
rows = rows[::-1]
return ff.filter(items= rows, axis = 0).reset_index()
running filterFrame(df, 3, 50, 30) yields:
level_0 index Date IDs CT_1257 PR_Rate
0 1 2 2021-06-16 14:00:00 A1 25 52.302
1 2 3 2021-06-16 15:00:00 A1 13 61.450
2 3 4 2021-06-16 16:00:00 A1 7 73.805

How Can I extract data in python with the same value of fields using pandas

I Have a dataset with fields id, time, date, name, etc. I want to extract data that has the same id and date. How can I do that?
For example
id time date
1 16:00 03/05/2020
2 16:00 03/05/2020
1 17:00 03/05/2020
1 16:00 04/05/2020
2 16:00 04/05/2020
Now I want to fetch :
1 16:00 03/05/2020
1 17:00 03/05/2020
Can groupby and filter
df.groupby(['id', 'date']).filter(lambda s: len(s) > 1)
id time date
0 1 16:00 03/05/2020
2 1 17:00 03/05/2020

Getting the 30 mins max sum of Column B per day

So I have the first column with dates at different time stamps. for the second column, i have the data. Let First column be A, Second column be B. i need to get the the sum of the data which is the maximum sum within 30 mins duration in a day.
So for example, for the data below,
dateTimeRead(YYYY-MM-DD HH-mm-ss) rain_value(mm) air_pressure(hPa)
1/2/2015 0:00 0 941.5675
1/2/2015 0:15 0 941.4625
1/2/2015 0:30 0 941.3
1/2/2015 0:45 0.1 941.2725
1/2/2015 1:00 0.2 941.12
1/2/2015 1:15 0.3 940.8625
1/2/2015 1:30 0.6 940.7575
1/2/2015 1:45 0.2 940.6075
1/2/2015 2:00 0 940.545
1/2/2015 2:15 0 940.27
1/2/2015 2:30 0 940.2125
1/2/2015 16:15 0 940.625
1/2/2015 16:30 0 940.69
1/2/2015 16:45 0 940.6175
1/2/2015 17:00 0 940.635
1/2/2015 19:00 0 941.9975
1/2/2015 20:45 0 942.7925
1/2/2015 21:00 0 942.745
1/2/2015 21:15 0 942.6325
1/2/2015 21:30 0 942.735
1/2/2015 21:45 0 942.765
1/2/2015 22:00 0 941.6
1/3/2015 2:15 0.1
1/3/2015 2:30 0.2 941.1275
1/3/2015 2:45 0.1 941.125
1/3/2015 3:00 0.1 940.955
1/3/2015 3:15 0 941.035
the desired output would be
Date Max Sum
1/2/2015 1.1
1/3/2015 0.4
and so On
You can do this by keeping track the 30-minute interval sums in a helper column, then using an array formula to calculate the max per day.
For example, let's suppose your data above is in columns A-C. (But we ignore the data in C and focus on column B as you have done in your example.) In $D$1, let's put your desired interval, 0:30. In column E, we'll keep track of, for each time in column A, what the sum of rain_value was for the last 30-minute window. To calculate this, you could paste the following formula in E2 and copy down the column (adjusting if you want > instead of >=, for example):
=SUMIFS(B:B,A:A,"<="&A2,A:A,">="&A2-$D$1)
// assumes the time interval is in $D$1
Now you have a column of data that includes the windows over which you want to take the max. One way to do this is by using the MAX formula as an array formula. First, create a new column F which just extracts the date part of the datetime in column A. You can do this just by putting =INT(A2) in cell F2 and copying down, for example.
Then, create a column G just for your dates (1/2/2015 and 1/3/2015 in your example). In column H, calculate the following array formula* in H2 and copy down to get the max of column E:
{=MAX(IF(F:F=G2,E:E))}
This will get the max per date.
*If you don't know how to execute array formulas, basically just type the formula =MAX(IF(F:F=G2,E:E)) into H2, but then instead of typing Enter, type Ctrl-Shift-Enter on Windows (or Cmd-Enter on a Mac). There are ways to do this last part without array formulas too, with clever use of SUMPRODUCT or INDEX.

Resources