How Can I extract data in python with the same value of fields using pandas - python-3.x

I Have a dataset with fields id, time, date, name, etc. I want to extract data that has the same id and date. How can I do that?
For example
id time date
1 16:00 03/05/2020
2 16:00 03/05/2020
1 17:00 03/05/2020
1 16:00 04/05/2020
2 16:00 04/05/2020
Now I want to fetch :
1 16:00 03/05/2020
1 17:00 03/05/2020

Can groupby and filter
df.groupby(['id', 'date']).filter(lambda s: len(s) > 1)
id time date
0 1 16:00 03/05/2020
2 1 17:00 03/05/2020

Related

Time manipulations

Hello I have to count how many people were scheduled on each hour in excel so I transformed starting and ending data/time to only contain time and basing on it I tried to substract these two information but I only get an hour then but what I need is the hours to be like this:
instead
starting on 9:00
ending on 17:00
this
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
to count every hour that employee was at work. But I don't know how :(
Or is there a better way of doing that?
Assuming your table looks something like this:
Person
Start
End
09:00
10:00
11:00
12:00
13:00
14:00
15:00
Alice
08:35
16:35
1
1
1
1
1
1
1
Bob
09:35
17:35
0
1
1
1
1
1
1
Carl
10:35
18:35
0
0
1
1
1
1
1
Dan
11:35
19:35
0
0
0
1
1
1
1
Ed
12:35
20:35
0
0
0
0
1
1
1
Total present
1
2
3
4
5
5
5
You can compute the entries 0 or 1 in each cell under the times using the formula
=IF(AND((E$4>$C6);(E$4<=$D6));1;0)
In the formula, E$4 is a reference to the column header, e.g. "9:00", $C6 and $D6 are references to the start and end times of the person. They are defined using partial absolute references ($) so the same formula can be copied and pasted in all the cells.
The result will be 1 if the person was present at that time and 0 if not.
The "Total present" formulas just sum up the 1's and 0's in the column.

Calculating a duration from two dates in different time zones

I have a CSV file with trip data:
Trip ID,Depart Time,Arrive Time,Depart Timezone,Arrive Timezone
1,08/29/21 09:00 PM,08/29/21 09:45 PM,GMT-04:00,GMT-04:00
2,08/29/21 10:00 PM,08/30/21 01:28 AM,GMT-04:00,GMT-04:00
3,08/30/21 01:29 AM,08/30/21 01:30 AM,GMT-04:00,GMT-04:00
4,08/30/21 01:45 AM,08/30/21 03:06 AM,GMT-04:00,GMT-04:00
5,08/30/21 03:08 AM,08/30/21 03:58 AM,GMT-04:00,GMT-04:00
6,08/30/21 03:59 AM,08/30/21 04:15 AM,GMT-04:00,GMT-04:00
I can read this file into a dataframe:
trips = pd.read_csv("trips.csv", sep=',')
What I would like to accomplish is to add a column 'duration' which gives me the trip duration in minutes. The trip duration has to be calculated as the difference between the trip arrival time and the trip departure time. In the above table, the 'depart time' is relative to the 'Depart Timezone'. Similarly, the 'Arrive Time' is relative to the 'Arrive Timezone'.
Note that in the above example, the arrival and departure dates, as well as the arrival and departure time zones happen to be the same, but this does not hold in general for my data.
What you have are UTC offsets (GMT-04:00 is four hours behind UTC); you can join the date/time column and respective offset column by ' ' and parse to_datetime. You can then calculate duration (timedelta) from the resulting tz-aware datetime columns. Ex:
# make datetime columns:
df['dt_depart'] = pd.to_datetime(df['Depart Time'] + ' ' + df['Depart Timezone'],
utc=True)
df['dt_arrive'] = pd.to_datetime(df['Arrive Time'] + ' ' + df['Arrive Timezone'],
utc=True)
Note: I'm using UTC=True here in case there are mixed UTC offsets in the input. That gives e.g.
df['dt_depart']
Out[6]:
0 2021-08-29 17:00:00+00:00
1 2021-08-29 18:00:00+00:00
2 2021-08-29 21:29:00+00:00
3 2021-08-29 21:45:00+00:00
4 2021-08-29 23:08:00+00:00
5 2021-08-29 23:59:00+00:00
Name: dt_depart, dtype: datetime64[ns, UTC]
then
# calculate the travel duration (timedelta column):
df['traveltime'] = df['dt_arrive'] - df['dt_depart']
gives e.g.
df['traveltime']
Out[7]:
0 0 days 00:45:00
1 0 days 03:28:00
2 0 days 00:01:00
3 0 days 01:21:00
4 0 days 00:50:00
5 0 days 00:16:00
Name: traveltime, dtype: timedelta64[ns]

How to remove successive similar numbers in a pandas DF column

I have a pandas DF with a column - this column can have 3 values, either 0, 1 or ' ' (see example below).
What I want to do is remove all successive numbers that are similar. So a 0 can never be followed by a 0 and a 1 can never be followed by 1. Instead I want to replace these by a ' '.
Current pandas DF
time
value
1:00
0
2:00
3:00
0
4:00
1
5:00
6:00
7:00
1
8:00
1
9:00
0
What I want
time
value
1:00
0
2:00
3:00
4:00
1
5:00
6:00
7:00
8:00
9:00
0
I tried to work with loops, but cannot find a clean way to refer to 'the next same value'.
Any simple solution for this?
An itertools solution:
from itertools import chain, groupby
df.value = list(chain.from_iterable(
[key, *['']*(len(list(gr))-1)]
for key, gr in groupby(df.value.replace("", np.nan).ffill())
)
)
replaceing empty strings with np.nan
forward filling the NaNs to get streams of 0's and 1's
grouping by 0's and 1's
placing back the key (which is 0 or 1) along with some empty strings (group's length - 1)
flattening these blocks with chain.from_iterable
casting to a list to assign it back to the dataframe
to get
time value
0 1:00 0
1 2:00
2 3:00
3 4:00 1
4 5:00
5 6:00
6 7:00
7 8:00
8 9:00 0
We can use loc on value to drop the rows having empty strings, then shift and compare the filtered rows to create a boolean mask, next mask the values with empty string where the boolean mask holds True
s = df['value'].loc[lambda x: x != '']
m = s.eq(s.shift())
df.loc[m[m].index, 'value'] = ''
time value
0 1:00 0
1 2:00
2 3:00
3 4:00 1
4 5:00
5 6:00
6 7:00
7 8:00
8 9:00 0

Groupby expanding count - elements changing of group at different time stamps

I have a HUGHE DataFrame that looks as follows (this is just an example to illustrate the problem):
id timestamp target_time interval
1 08:00:00 10:20:00 (10-11]
1 08:30:00 10:21:00 (10-11]
1 09:10:00 11:30:00 (11-12]
2 09:15:00 10:15:00 (10-11]
2 09:35:00 10:11:00 (10-11]
3 09:45:00 11:12:00 (11-12]
...
I would like to create a series looking as follows:
interval timestamp unique_ids
(10-11] 08:00:00 1
08:30:00 1
09:15:00 1
09:35:00 1
(11-12] 09:10:00 1
09:45:00 2
The objective is to count, for each time interval, how many unique ids had their corresponding target_time within the interval at their timestamp. Note that the target_time for each id can change at different timestamps. For instance, for the id 1 the interval is (10-11] from 08:00:00 to 08:30:00, but then it changes to (11-12] at 09:10:00. Therefore, at 09:15:00 I do not want to count the id 1 in the resulting Series.
I tried a groupby -> expand -> np.unique approach, but it does not provide the result that I want:
df.set_index('timestamp').groupby('interval').id.expanding().apply(lambda x: np.unique(x).shape[0])
interval timestamp unique_ids
(10-11] 08:00:00 1
08:30:00 1
09:15:00 2
09:35:00 2
(11-12] 09:10:00 1
09:45:00 2
Any hint on how can I approach this problem? I want to make use of pandas routines as much as possible, in order to reduce computational time, since the length of the DataFrame is 1453076...
Many thanks in advance!

Apply a value to max values in a groupby

I have a DF like this:
ID Time
1 20:29
1 20:45
1 23:16
2 11:00
2 13:00
3 01:00
I want to create a new column that puts a 1 next to the largest time value within each ID grouping like so:
ID Time Value
1 20:29 0
1 20:45 0
1 23:16 1
2 11:00 0
2 13:00 1
3 01:00 1
I know the answer involves a groupby mechanism and have been fiddling around with something like:
df.groupby('ID')['Time'].max() = 1
The idea is to write an anonymous function that operates on each of your groups and feed this to your groupby using apply:
df['Value']=df.groupby('ID',as_index=False).apply(lambda x : x.Time == max(x.Time)).values
Assuming that your 'Time' column is already a datetime64 then you want to groupby on 'ID' column and then call transform to apply a lambda to create a series with an index aligned with your original df:
In [92]:
df['Value'] = df.groupby('ID')['Time'].transform(lambda x: (x == x.max())).dt.nanosecond
df
Out[92]:
ID Time Value
0 1 2015-11-20 20:29:00 0
1 1 2015-11-20 20:45:00 0
2 1 2015-11-20 23:16:00 1
3 2 2015-11-20 11:00:00 0
4 2 2015-11-20 13:00:00 1
5 3 2015-11-20 01:00:00 1
The dt.nanosecond call is because the dtype returned is a datetime for some reason rather than a boolean:
In [93]:
df.groupby('ID')['Time'].transform(lambda x: (x == x.max()))
Out[93]:
0 1970-01-01 00:00:00.000000000
1 1970-01-01 00:00:00.000000000
2 1970-01-01 00:00:00.000000001
3 1970-01-01 00:00:00.000000000
4 1970-01-01 00:00:00.000000001
5 1970-01-01 00:00:00.000000001
Name: Time, dtype: datetime64[ns]

Resources