Apply a value to max values in a groupby - python-3.x

I have a DF like this:
ID Time
1 20:29
1 20:45
1 23:16
2 11:00
2 13:00
3 01:00
I want to create a new column that puts a 1 next to the largest time value within each ID grouping like so:
ID Time Value
1 20:29 0
1 20:45 0
1 23:16 1
2 11:00 0
2 13:00 1
3 01:00 1
I know the answer involves a groupby mechanism and have been fiddling around with something like:
df.groupby('ID')['Time'].max() = 1

The idea is to write an anonymous function that operates on each of your groups and feed this to your groupby using apply:
df['Value']=df.groupby('ID',as_index=False).apply(lambda x : x.Time == max(x.Time)).values

Assuming that your 'Time' column is already a datetime64 then you want to groupby on 'ID' column and then call transform to apply a lambda to create a series with an index aligned with your original df:
In [92]:
df['Value'] = df.groupby('ID')['Time'].transform(lambda x: (x == x.max())).dt.nanosecond
df
Out[92]:
ID Time Value
0 1 2015-11-20 20:29:00 0
1 1 2015-11-20 20:45:00 0
2 1 2015-11-20 23:16:00 1
3 2 2015-11-20 11:00:00 0
4 2 2015-11-20 13:00:00 1
5 3 2015-11-20 01:00:00 1
The dt.nanosecond call is because the dtype returned is a datetime for some reason rather than a boolean:
In [93]:
df.groupby('ID')['Time'].transform(lambda x: (x == x.max()))
Out[93]:
0 1970-01-01 00:00:00.000000000
1 1970-01-01 00:00:00.000000000
2 1970-01-01 00:00:00.000000001
3 1970-01-01 00:00:00.000000000
4 1970-01-01 00:00:00.000000001
5 1970-01-01 00:00:00.000000001
Name: Time, dtype: datetime64[ns]

Related

How to check if dates in a pandas column are after a date

I have a pandas dataframe
date
0 2010-03
1 2017-09-14
2 2020-10-26
3 2004-12
4 2012-04-01
5 2017-02-01
6 2013-01
I basically want to filter where dates are after 2015-12 (Dec 2015)
To get this:
date
0 2017-09-14
1 2020-10-26
2 2017-02-01
I tried this
df = df[(df['date']> "2015-12")]
but I'm getting an error
ValueError: Wrong number of items passed 17, placement implies 1
First for me working solution correct:
df = df[(df['date']> "2015-12")]
print (df)
date
1 2017-09-14
2 2020-10-26
5 2017-02-01
If convert to datetimes, which should be more robust for me working too:
df = df[(pd.to_datetime(df['date'])> "2015-12")]
print (df)
date
1 2017-09-14
2 2020-10-26
5 2017-02-01
Detail:
print (pd.to_datetime(df['date']))
0 2010-03-01
1 2017-09-14
2 2020-10-26
3 2004-12-01
4 2012-04-01
5 2017-02-01
6 2013-01-01
Name: date, dtype: datetime64[ns]

how to reformate the index column in a dataframe?

I have the following dataframe:
0 1
0 0.224960 -1.376689
1 0.059706 -1.330823
2 -0.133850 -1.251549
3 -0.234644 -1.190972
4 -0.281469 -1.156635
... ... ...
295 0.655912 -1.040209
296 0.618599 -1.068238
297 0.594964 -1.109484
298 0.578758 -1.151496
299 0.570207 -1.179523
I added the index as a column and then generate fake time from this column like that:
df['timestamp'] = df.index
# convert the column (it's a string) to datetime type
datetime_series = pd.to_datetime(df['timestamp'])
# create datetime index passing the datetime series
datetime_index = pd.DatetimeIndex(datetime_series.values)
# set timestamp as datframe index
df=df.set_index('timestamp')
df
The result is:
0 1
timestamp
1970-01-01 00:00:00.000000000 0.224960 -1.376689
1970-01-01 00:00:00.000000001 0.059706 -1.330823
1970-01-01 00:00:00.000000002 -0.133850 -1.251549
1970-01-01 00:00:00.000000003 -0.234644 -1.190972
1970-01-01 00:00:00.000000004 -0.281469 -1.156635
... ... ...
1970-01-01 00:00:00.000000295 0.655912 -1.040209
1970-01-01 00:00:00.000000296 0.618599 -1.068238
1970-01-01 00:00:00.000000297 0.594964 -1.109484
1970-01-01 00:00:00.000000298 0.578758 -1.151496
1970-01-01 00:00:00.000000299 0.570207 -1.179523
I want the timestamp to be like 1970-01-01 00:00:00 then 1970-01-01 00:00:01 and so on.
This will be correct answer.
d = { i:"1970-01-01 00:00:00.{:0>9}".format(i) for i in df.index}
df.index = pd.Series(df.index).replace(d)
Tested like following:
tem = pd.DataFrame({'0':[1,2,3,4,5],'1':[3,4,5,6,7]},columns=['0','1'])
d = {0:"asdf",1:"asdf",2:"sdfs",3:"sdfs",4:"sdfs"}
tem.index = pd.Series(tem.index).replace(d)
tem print:
0 1
0 1 3
1 2 4
2 3 5
3 4 6
4 5 7
d print:
{0: 'asdf', 1: 'asdf', 2: 'sdfs', 3: 'sdfs', 4: 'sdfs'}
result tem print;
0 1
asdf 1 3
asdf 2 4
sdfs 3 5
sdfs 4 6
sdfs 5 7

Find no of days gap for a specific ID when it has a flag X in other column

I want to calculate the no. of days gap for when the 'flag' column is equal to 'X' for same IDs.
The dataframe that I have:
ID Date flag
1 1-1-2020 X
1 10-1-2020 null
1 15-1-2020 X
2 1-2-2020 X
2 10-2-2020 X
2 15-2-2020 X
3 15-2-2020 null
The dataframe I want:
ID Date flag no_of_days
1 1-1-2020 X 14
1 10-1-2020 null null
1 15-1-2020 X null
2 1-2-2020 X 9
2 10-2-2020 X 8
2 18-2-2020 X null
3 15-2-2020 null null
Thanks in advance.
First filter rows by X in boolean indexing and then subtract shifted column per groups by DataFrameGroupBy.shift and last convert timedeltas to days by Series.dt.days:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['new'] = df[df['flag'].eq('X')].groupby('ID')['Date'].shift(-1).sub(df['Date']).dt.days
print (df)
ID Date flag new
0 1 2020-01-01 X 14.0
1 1 2020-01-10 NaN NaN
2 1 2020-01-15 X NaN
3 2 2020-02-01 X 9.0
4 2 2020-02-10 X 8.0
5 2 2020-02-18 X NaN
6 3 2020-02-15 NaN NaN

Groupby dates quaterly in a pandas dataframe and find count for their occurence

My Dataframe looks like
"dataframe_time"
INSERTED_UTC
0 2018-05-29
1 2018-05-22
2 2018-02-10
3 2018-04-30
4 2018-03-02
5 2018-11-26
6 2018-03-07
7 2018-05-12
8 2019-02-03
9 2018-08-03
10 2018-04-27
print(type(dataframe_time['INSERTED_UTC'].iloc[1]))
<class 'datetime.date'>
I am trying to group the dates together and find the count of their occurrence quaterly. Desired Output -
Quarter Count
2018-03-31 3
2018-06-30 5
2018-09-30 1
2018-12-31 1
2019-03-31 1
2019-06-30 0
I am running the following command to group them together
dataframe_time['INSERTED_UTC'].groupby(pd.Grouper(freq='Q'))
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
First are dates converted to datetimes and then is used DataFrame.resample with on for get column with datetimes:
dataframe_time.INSERTED_UTC = pd.to_datetime(dataframe_time.INSERTED_UTC)
df = dataframe_time.resample('Q', on='INSERTED_UTC').size().reset_index(name='Count')
Or your solution is possible change to:
df = (dataframe_time.groupby(pd.Grouper(freq='Q', key='INSERTED_UTC'))
.size()
.reset_index(name='Count'))
print (df)
INSERTED_UTC Count
0 2018-03-31 3
1 2018-06-30 5
2 2018-09-30 1
3 2018-12-31 1
4 2019-03-31 1
You can convert the dates to quarters by to_period('Q') and group by those:
df.INSERTED_UTC = pd.to_datetime(df.INSERTED_UTC)
df.groupby(df.INSERTED_UTC.dt.to_period('Q')).size()
You can also use value_counts:
df.INSERTED_UTC.dt.to_period('Q').value_counts()
Output:
INSERTED_UTC
2018Q1 3
2018Q2 5
2018Q3 1
2018Q4 1
2019Q1 1
Freq: Q-DEC, dtype: int64

Convert 6 digits date format to standard one in Pandas

I'm working with a dataframe has one messy date column with irregular format, ie:
date
0 19.01.01
1 19.02.01
2 1991/01/01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Is it possible convert it to standard format XXXX-XX-XX, which represents year-month-date? Thank you.
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Use pd.to_datetime with yearfirst=True
Ex:
df = pd.DataFrame({"date": ['19.01.01', '19.02.01', '1991/01/01', '1996-01-01', '1996-06-30', '1995-12-31', '1997-01-01']})
df['date'] = pd.to_datetime(df['date'], yearfirst=True).dt.strftime("%Y-%m-%d")
print(df)
Output:
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
It depends of format, the most general solution is specify each format and use Series.combine_first:
date1 = pd.to_datetime(df['date'], format='%y.%m.%d', errors='coerce')
date2 = pd.to_datetime(df['date'], format='%Y/%m/%d', errors='coerce')
date3 = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')
df['date'] = date1.combine_first(date2).combine_first(date3)
print (df)
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Try the following
df['date'].replace('\/|.','-', regex=True)
Use pd.to_datetime()
pd.to_datetime(df['date])
Output:
0 2001-01-19
1 2001-02-19
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Name: 0, dtype: datetime64[ns]

Resources