How to filter data by conditions after Groupby in Python

How to filter data by conditions after Groupby in Python - pandas-groupby

I have data like this:
price
Date
Time
100
2021/01/01
9:00
200
2021/01/02
9:00
112
2021/01/01
9:01
223
2021/01/02
9:02
1145
2021/01/01
9:02
2214
2021/01/02
9:03
11
2021/01/01
9:03
20
2021/01/02
9:10
I need to get 3 values from each day. The price at 9:00, the price at 18:00 (There are more data), and a random value from that day except 9:00 and 18:00. 9:00 is not the start time, and 18:00 is not the end time.
I know I should use groupby for example: df.groupby('Date')['price'] But I don't know how to use conditions to filter data after groupby.
Because I need to use these data of every day, after I filter these data, I also need to get these data. The expected answer is like [100,112,200] (100 is price at 9:00,112 is the random price, 200 is the price at 18:00)

I add some data to your dataframe:
import pandas
from io import StringIO
csv = StringIO("""price,date,time
100,2021/01/01,9:00
200,2021/01/02,9:00
1800,2021/01/01,18:00
2800,2021/01/02,18:00
112,2021/01/01,9:01
223,2021/01/02,9:02
1145,2021/01/01,9:02
2214,2021/01/02,9:03
11,2021/01/01,9:03
20,2021/01/02,9:10
1145,2021/01/01,19:02
2214,2021/01/02,11:03
11,2021/01/01,19:03
20,2021/01/02,3:10""")
df = pandas.read_csv(csv, index_col=None)
I know the next part is a mess and I hate pandas
But I hope you find the answer and got the idea.
just run codes :)
grouped = df.groupby('date')
except18_9 = grouped.apply(lambda x: x[(x['time'] != '18:00')&(x['time'] != '9:00')]).reset_index(drop=True)
part1 = except18_9.groupby('date').sample(n=1)
part2 = grouped.apply(lambda x: x.loc[(x['time'] == '18:00') | (x['time'] == '9:00')]).reset_index(drop=True)
pandas.concat([part1,part2]).sort_values(['date','time'])
final result is like this:

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!

Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

Convert number into hours and minutes wile reading CSV in Pandas

I have CSV file where the second column indicates a time point with the format HHMMSS.
ID;TIME
A;110500
B;090000
C;130200
This situation indicates some questions for me.
Does pandas have a data format to represent a time point with hour, minutes and seconds but without the day, month, ...?
How can I convert that fields to such a format?
On Python I would iterate over the fields. But I am sure that Pandas have a more efficient way.
If there is no time of day format without date I could add a day-month-year date to that timepoint.
That is an MWE
import pandas
import io
csv = io.StringIO('ID;TIME\nA;110500\nB;090000\nC;130200')
df = pandas.read_csv(csv, sep=';')
print(df)
Results in
ID TIME
0 A 110500
1 B 90000
2 C 130200
But what I want to see is
ID TIME
0 A 11:05:00
1 B 9:00:00
2 C 13:02:00
Or much better cutting the seconds also
ID TIME
0 A 11:05
1 B 9:00
2 C 13:02

You could use the parameter date_parser in read_csv like and the time accesor
df = pandas.read_csv(csv, sep=';',
parse_dates=[1], # need to know the position of the TIME column
date_parser=lambda x: pandas.to_datetime(x, format='%H%M%S').time)
print(df)
ID TIME
0 A 11:05:00
1 B 09:00:00
2 C 13:02:00
But doing it after reading might be as good
df = (pandas.read_csv(csv, sep=';')
.assign(TIME=lambda x: pandas.to_datetime(x['TIME'], format='%H%M%S').dt.time)
#or lambda x: pandas.to_datetime(x['TIME'], format='%H%M%S').dt.strftime('%#H:%M')
)

Iterate over pandas dataframe while updating values

I've looked through a bunch of similar questions, but I cannot figure out how to actually apply the principles to my own case. I'm therefore trying to figure out a simple example I can work from - basically I need the idiots' guide before I can look at more complex examples
Consider a dataframe that contains a list of names and times, and a known start time. I then want to update the dataframe with the finish time, which is calculated from starttime + Time
import pandas as pd
import datetime
df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
"Time":[3, 6,1, 7, 23,3,4]})
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')
I know that for each case I can calculate the finish time using
finishtime = starttine + datetime.datetime.timedelta(minutes = df.iloc[0,1])
what I can't figure out is how to use this while iterating over the df rows and updating a third column in the dataframe with the output.
I tried
df["FinishTime"] = np.nan
for row in df.itertuples():
df.at[row,"FinishTime"] = starttine + datetime.datetime.timedelta(minutes = row.Time)
but it gave a lot of errors I couldn't unravel. How am I meant to do this?
I am aware that the advice to iterating over a dataframe is don't - I'm not committed to iterating, I just need some way to calculate that final column and add it to the dataframe. My real data is about 200k lines.

Use pd.to_timedelta()
import datetime
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')
df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
"Time":[3, 6,1, 7, 23,3,4]})
df.Time = pd.to_timedelta(df.Time, unit='m')
# df = df.assign(FinishTime = df.Time + starttime)
df['FinishTime'] = df.Time + starttime # as pointed out by Trenton McKinney, .assign() is only one way to create new columns
# creating with df['new_col'] has the benefit of not having to copy the full df
print(df)
Output
Name Time FinishTime
0 Kate 00:03:00 2020-02-04 00:03:00
1 Sarah 00:06:00 2020-02-04 00:06:00
2 Isabell 00:01:00 2020-02-04 00:01:00
3 Connie 00:07:00 2020-02-04 00:07:00
4 Elsa 00:23:00 2020-02-04 00:23:00
5 Anne 00:03:00 2020-02-04 00:03:00
6 Lin 00:04:00 2020-02-04 00:04:00
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_timedelta.html
Avoid looping in pandas at all cost
Maybe not at all cost, but pandas takes advantage of C implementations to improve performance by several orders of magnitude. There are many (many) common functions already implemented for our convenience.
Here is a great stackoverflow conversation about this very topic.

What is the best way to calculate month difference between two dates which have format like yyyyMM. in pyspark?

In my python df, I have columns MTH, old_dt
MTH old_dt
201901 2018-03-01
201902 2017-02-20
201903 2016-05-12
to calculate the month difference between two columns, I use the following python code
df['mth'] = pd.to_datetime(df['MTH'], format='%Y%m')
df=df.assign(
dif=
(df.mth.dt.year - df.old_dt.dt.year) * 12 +
(df.mth.dt.month - df.old_dt.dt.month)+1
)
The result will be integer, which is exactly what I want.
Now since my dataset is huge (more than 1 billion records), I decide to move to pyspark. but Not sure how does it work. I searched online and see a function month_difference, but it seems not look like what I want.
Thanks for any help, and Thanks Jens for the editting.
My expected output is :
MTH old_dt dif
201901 2018-03-01 11
201902 2017-02-20 25
201903 2016-05-12 35

will this work please? I was not able to open my AE to test
def mth_interval(df):
df = df.withColumn("mth", F.to_date('MTH', 'yyyyMM'))
df = df.withColumn('month_diff', ((F.year("mth")-F.year("old_dt")) *12+
(F.month("mth")-F.month("old_dt"))+1)
return df
thanks!
just tested and worked!

Finding start time and end time in a column

I have a data set that has employees clocking in and out. It looks like this (note two entries per employee):
Employee Date Time
Emp1 1/1/16 06:00
Emp1 1/1/16 13:00
Emp2 1/1/16 09:00
Emp2 1/1/16 17:00
Emp3 1/1/16 11:00
Emp3 1/1/16 18:00
I want to get the data to look like this:
Employee Date Start End
Emp1 1/1/16 06:00 13:00
Emp2 1/1/16 09:00 17:00
Emp3 1/1/16 11:00 18:00
I would like to get it into a data frame format so that I can do some calculations.
I currently have tried
df['start'] = np.where((df['employee']==df['employee']&df['date']==df['date']),df['time'].min())
I also tried:
df.groupby(['employee','date]['time'].max()
How do I get two columns out of one?

I would recommend to merge Date and Time into one column as DateTime. That would greatly simplify your work. You can do something like this:
df['DateTime']=pd.to_datetime(df['Date']+" "+df['Time'])
df.groupby('Employee')['DateTime'].agg([min, max])
There are other options depending the content of your data. If you know that all the entries will be on the same day you can simply do:
# First convert Date and Time columns to DateTime type
df['Date'] = pd.to_datetime(df['Date']).dt.date
df['Time'] = pd.to_datetime(df['Time']).dt.time
df.groupby('Employee').agg([min, max])
no need to create a DateTime column in this case.
If you want to know Start End times per each day you can do:
# First convert Date and Time columns to DateTime type
df['Date'] = pd.to_datetime(df['Date']).dt.date
df['Time'] = pd.to_datetime(df['Time']).dt.time
df.groupby(['Employee','Date'])['Time'].agg([min, max])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to filter data by conditions after Groupby in Python - pandas-groupby

Related

Widening long table grouped on date

Convert number into hours and minutes wile reading CSV in Pandas

Iterate over pandas dataframe while updating values

What is the best way to calculate month difference between two dates which have format like yyyyMM. in pyspark?

Finding start time and end time in a column

Categories

Resources