How to filter data by conditions after Groupby in Python - pandas-groupby

I have data like this:
price
Date
Time
100
2021/01/01
9:00
200
2021/01/02
9:00
112
2021/01/01
9:01
223
2021/01/02
9:02
1145
2021/01/01
9:02
2214
2021/01/02
9:03
11
2021/01/01
9:03
20
2021/01/02
9:10
I need to get 3 values from each day. The price at 9:00, the price at 18:00 (There are more data), and a random value from that day except 9:00 and 18:00. 9:00 is not the start time, and 18:00 is not the end time.
I know I should use groupby for example: df.groupby('Date')['price'] But I don't know how to use conditions to filter data after groupby.
Because I need to use these data of every day, after I filter these data, I also need to get these data. The expected answer is like [100,112,200] (100 is price at 9:00,112 is the random price, 200 is the price at 18:00)

I add some data to your dataframe:
import pandas
from io import StringIO
csv = StringIO("""price,date,time
100,2021/01/01,9:00
200,2021/01/02,9:00
1800,2021/01/01,18:00
2800,2021/01/02,18:00
112,2021/01/01,9:01
223,2021/01/02,9:02
1145,2021/01/01,9:02
2214,2021/01/02,9:03
11,2021/01/01,9:03
20,2021/01/02,9:10
1145,2021/01/01,19:02
2214,2021/01/02,11:03
11,2021/01/01,19:03
20,2021/01/02,3:10""")
df = pandas.read_csv(csv, index_col=None)
I know the next part is a mess and I hate pandas
But I hope you find the answer and got the idea.
just run codes :)
grouped = df.groupby('date')
except18_9 = grouped.apply(lambda x: x[(x['time'] != '18:00')&(x['time'] != '9:00')]).reset_index(drop=True)
part1 = except18_9.groupby('date').sample(n=1)
part2 = grouped.apply(lambda x: x.loc[(x['time'] == '18:00') | (x['time'] == '9:00')]).reset_index(drop=True)
pandas.concat([part1,part2]).sort_values(['date','time'])
final result is like this:

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

Convert number into hours and minutes wile reading CSV in Pandas

I have CSV file where the second column indicates a time point with the format HHMMSS.
ID;TIME
A;110500
B;090000
C;130200
This situation indicates some questions for me.
Does pandas have a data format to represent a time point with hour, minutes and seconds but without the day, month, ...?
How can I convert that fields to such a format?
On Python I would iterate over the fields. But I am sure that Pandas have a more efficient way.
If there is no time of day format without date I could add a day-month-year date to that timepoint.
That is an MWE
import pandas
import io
csv = io.StringIO('ID;TIME\nA;110500\nB;090000\nC;130200')
df = pandas.read_csv(csv, sep=';')
print(df)
Results in
ID TIME
0 A 110500
1 B 90000
2 C 130200
But what I want to see is
ID TIME
0 A 11:05:00
1 B 9:00:00
2 C 13:02:00
Or much better cutting the seconds also
ID TIME
0 A 11:05
1 B 9:00
2 C 13:02
You could use the parameter date_parser in read_csv like and the time accesor
df = pandas.read_csv(csv, sep=';',
parse_dates=[1], # need to know the position of the TIME column
date_parser=lambda x: pandas.to_datetime(x, format='%H%M%S').time)
print(df)
ID TIME
0 A 11:05:00
1 B 09:00:00
2 C 13:02:00
But doing it after reading might be as good
df = (pandas.read_csv(csv, sep=';')
.assign(TIME=lambda x: pandas.to_datetime(x['TIME'], format='%H%M%S').dt.time)
#or lambda x: pandas.to_datetime(x['TIME'], format='%H%M%S').dt.strftime('%#H:%M')
)

Iterate over pandas dataframe while updating values

I've looked through a bunch of similar questions, but I cannot figure out how to actually apply the principles to my own case. I'm therefore trying to figure out a simple example I can work from - basically I need the idiots' guide before I can look at more complex examples
Consider a dataframe that contains a list of names and times, and a known start time. I then want to update the dataframe with the finish time, which is calculated from starttime + Time
import pandas as pd
import datetime
df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
"Time":[3, 6,1, 7, 23,3,4]})
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')
I know that for each case I can calculate the finish time using
finishtime = starttine + datetime.datetime.timedelta(minutes = df.iloc[0,1])
what I can't figure out is how to use this while iterating over the df rows and updating a third column in the dataframe with the output.
I tried
df["FinishTime"] = np.nan
for row in df.itertuples():
df.at[row,"FinishTime"] = starttine + datetime.datetime.timedelta(minutes = row.Time)
but it gave a lot of errors I couldn't unravel. How am I meant to do this?
I am aware that the advice to iterating over a dataframe is don't - I'm not committed to iterating, I just need some way to calculate that final column and add it to the dataframe. My real data is about 200k lines.
Use pd.to_timedelta()
import datetime
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')
df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
"Time":[3, 6,1, 7, 23,3,4]})
df.Time = pd.to_timedelta(df.Time, unit='m')
# df = df.assign(FinishTime = df.Time + starttime)
df['FinishTime'] = df.Time + starttime # as pointed out by Trenton McKinney, .assign() is only one way to create new columns
# creating with df['new_col'] has the benefit of not having to copy the full df
print(df)
Output
Name Time FinishTime
0 Kate 00:03:00 2020-02-04 00:03:00
1 Sarah 00:06:00 2020-02-04 00:06:00
2 Isabell 00:01:00 2020-02-04 00:01:00
3 Connie 00:07:00 2020-02-04 00:07:00
4 Elsa 00:23:00 2020-02-04 00:23:00
5 Anne 00:03:00 2020-02-04 00:03:00
6 Lin 00:04:00 2020-02-04 00:04:00
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_timedelta.html
Avoid looping in pandas at all cost
Maybe not at all cost, but pandas takes advantage of C implementations to improve performance by several orders of magnitude. There are many (many) common functions already implemented for our convenience.
Here is a great stackoverflow conversation about this very topic.

What is the best way to calculate month difference between two dates which have format like yyyyMM. in pyspark?

In my python df, I have columns MTH, old_dt
MTH old_dt
201901 2018-03-01
201902 2017-02-20
201903 2016-05-12
to calculate the month difference between two columns, I use the following python code
df['mth'] = pd.to_datetime(df['MTH'], format='%Y%m')
df=df.assign(
dif=
(df.mth.dt.year - df.old_dt.dt.year) * 12 +
(df.mth.dt.month - df.old_dt.dt.month)+1
)
The result will be integer, which is exactly what I want.
Now since my dataset is huge (more than 1 billion records), I decide to move to pyspark. but Not sure how does it work. I searched online and see a function month_difference, but it seems not look like what I want.
Thanks for any help, and Thanks Jens for the editting.
My expected output is :
MTH old_dt dif
201901 2018-03-01 11
201902 2017-02-20 25
201903 2016-05-12 35
will this work please? I was not able to open my AE to test
def mth_interval(df):
df = df.withColumn("mth", F.to_date('MTH', 'yyyyMM'))
df = df.withColumn('month_diff', ((F.year("mth")-F.year("old_dt")) *12+
(F.month("mth")-F.month("old_dt"))+1)
return df
thanks!
just tested and worked!

Finding start time and end time in a column

I have a data set that has employees clocking in and out. It looks like this (note two entries per employee):
Employee Date Time
Emp1 1/1/16 06:00
Emp1 1/1/16 13:00
Emp2 1/1/16 09:00
Emp2 1/1/16 17:00
Emp3 1/1/16 11:00
Emp3 1/1/16 18:00
I want to get the data to look like this:
Employee Date Start End
Emp1 1/1/16 06:00 13:00
Emp2 1/1/16 09:00 17:00
Emp3 1/1/16 11:00 18:00
I would like to get it into a data frame format so that I can do some calculations.
I currently have tried
df['start'] = np.where((df['employee']==df['employee']&df['date']==df['date']),df['time'].min())
I also tried:
df.groupby(['employee','date]['time'].max()
How do I get two columns out of one?
I would recommend to merge Date and Time into one column as DateTime. That would greatly simplify your work. You can do something like this:
df['DateTime']=pd.to_datetime(df['Date']+" "+df['Time'])
df.groupby('Employee')['DateTime'].agg([min, max])
There are other options depending the content of your data. If you know that all the entries will be on the same day you can simply do:
# First convert Date and Time columns to DateTime type
df['Date'] = pd.to_datetime(df['Date']).dt.date
df['Time'] = pd.to_datetime(df['Time']).dt.time
df.groupby('Employee').agg([min, max])
no need to create a DateTime column in this case.
If you want to know Start End times per each day you can do:
# First convert Date and Time columns to DateTime type
df['Date'] = pd.to_datetime(df['Date']).dt.date
df['Time'] = pd.to_datetime(df['Time']).dt.time
df.groupby(['Employee','Date'])['Time'].agg([min, max])

Resources