Finding the maximum value in a Pandas column for multiple rows - python-3.x

Please, help me with the following example,
I have DataFrame:
data ={'Сlient':['1', '2', '3', '3', '3', '4'], \
'date1':['2019-11-07', '2019-11-08', '2019-11-08', '2019-11-08', '2019-11-08', '2019-11-11'], \
'date2':['2019-11-01', '2019-11-02', '2019-11-06', '2019-11-07', '2019-11-10', '2019-11-15'] }
df =pd.DataFrame(data)
I need to create a column with a date, which from the group of date2 for each client selects the maximum value, and it should be less than the value of date1 for this client.
For example for client 3, I need to get 2019-11-07.
Can this be done with Lambda function?

First use boolean indexing with Series.lt for filter out rows less like date1 values and then get index values by maximum date2 values by DataFrameGroupBy.idxmax and seelct by loc:
df[['date1','date2']] = df[['date1','date2']].apply(pd.to_datetime)
df1 = df.loc[df[df['date2'].lt(df['date1'])].groupby('Сlient')['date2'].idxmax()]
print (df1)
Сlient date1 date2
0 1 2019-11-07 2019-11-01
1 2 2019-11-08 2019-11-02
3 3 2019-11-08 2019-11-07
Another solution with filtering by DataFrame.query, sorting by DataFrame.sort_values and remove duplicated by DataFrame.drop_duplicates:
df1 = (df.query('date2 < date1')
.sort_values(['Сlient','date2'], ascending=[True, False])
.drop_duplicates('Сlient'))
print (df1)
Сlient date1 date2
0 1 2019-11-07 2019-11-01
1 2 2019-11-08 2019-11-02
3 3 2019-11-08 2019-11-07
EDIT:
Then last step is use Series.map:
df['date2'] = df['Сlient'].map(df1.set_index('Сlient')['date2'])
print (df)
Сlient date1 date2
0 1 2019-11-07 2019-11-01
1 2 2019-11-08 2019-11-02
2 3 2019-11-08 2019-11-07
3 3 2019-11-08 2019-11-07
4 3 2019-11-08 2019-11-07
5 4 2019-11-11 NaT

Related

How to check if dates in a pandas column are after a date

I have a pandas dataframe
date
0 2010-03
1 2017-09-14
2 2020-10-26
3 2004-12
4 2012-04-01
5 2017-02-01
6 2013-01
I basically want to filter where dates are after 2015-12 (Dec 2015)
To get this:
date
0 2017-09-14
1 2020-10-26
2 2017-02-01
I tried this
df = df[(df['date']> "2015-12")]
but I'm getting an error
ValueError: Wrong number of items passed 17, placement implies 1
First for me working solution correct:
df = df[(df['date']> "2015-12")]
print (df)
date
1 2017-09-14
2 2020-10-26
5 2017-02-01
If convert to datetimes, which should be more robust for me working too:
df = df[(pd.to_datetime(df['date'])> "2015-12")]
print (df)
date
1 2017-09-14
2 2020-10-26
5 2017-02-01
Detail:
print (pd.to_datetime(df['date']))
0 2010-03-01
1 2017-09-14
2 2020-10-26
3 2004-12-01
4 2012-04-01
5 2017-02-01
6 2013-01-01
Name: date, dtype: datetime64[ns]

Groupby dates quaterly in a pandas dataframe and find count for their occurence

My Dataframe looks like
"dataframe_time"
INSERTED_UTC
0 2018-05-29
1 2018-05-22
2 2018-02-10
3 2018-04-30
4 2018-03-02
5 2018-11-26
6 2018-03-07
7 2018-05-12
8 2019-02-03
9 2018-08-03
10 2018-04-27
print(type(dataframe_time['INSERTED_UTC'].iloc[1]))
<class 'datetime.date'>
I am trying to group the dates together and find the count of their occurrence quaterly. Desired Output -
Quarter Count
2018-03-31 3
2018-06-30 5
2018-09-30 1
2018-12-31 1
2019-03-31 1
2019-06-30 0
I am running the following command to group them together
dataframe_time['INSERTED_UTC'].groupby(pd.Grouper(freq='Q'))
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
First are dates converted to datetimes and then is used DataFrame.resample with on for get column with datetimes:
dataframe_time.INSERTED_UTC = pd.to_datetime(dataframe_time.INSERTED_UTC)
df = dataframe_time.resample('Q', on='INSERTED_UTC').size().reset_index(name='Count')
Or your solution is possible change to:
df = (dataframe_time.groupby(pd.Grouper(freq='Q', key='INSERTED_UTC'))
.size()
.reset_index(name='Count'))
print (df)
INSERTED_UTC Count
0 2018-03-31 3
1 2018-06-30 5
2 2018-09-30 1
3 2018-12-31 1
4 2019-03-31 1
You can convert the dates to quarters by to_period('Q') and group by those:
df.INSERTED_UTC = pd.to_datetime(df.INSERTED_UTC)
df.groupby(df.INSERTED_UTC.dt.to_period('Q')).size()
You can also use value_counts:
df.INSERTED_UTC.dt.to_period('Q').value_counts()
Output:
INSERTED_UTC
2018Q1 3
2018Q2 5
2018Q3 1
2018Q4 1
2019Q1 1
Freq: Q-DEC, dtype: int64

Pandas Multiple Conditional Mean With Group By

New to python and pandas. I have a pandas DataFrame with list of customer data which includes customer name, Reporting month and performance. I'm trying to get first recorded performance for each customer
CustomerName ReportingMonth Performance
0 7CGC 2019-12-01 1.175000
1 7CGC 2020-01-01 1.125000
2 ACC 2019-11-01 1.216802
3 ACBH 2019-05-01 0.916667
4 ACBH 2019-06-01 0.893333
5 AKC 2019-10-01 4.163636
6 AKC 2019-11-01 3.915215
Desired output
CustomerName ReportingMonth Performance
0 7CGC 2019-12-01 1.175000
1 ACC 2019-11-01 1.216802
2 ACBH 2019-05-01 0.916667
3 AKC 2019-10-01 4.163636
Use DataFrame.sort_values with GroupBy.first or DataFrame.drop_duplicates:
df.sort_values('ReportingMonth').groupby('CustomerName', as_index=False).first()
or
new_df = df.sort_values('ReportingMonth').drop_duplicates('CustomerName',
keep = 'first')
print(new_df)
Output
CustomerName ReportingMonth Performance
3 ACBH 2019-05-01 0.916667
5 AKC 2019-10-01 4.163636
2 ACC 2019-11-01 1.216802
0 7CGC 2019-12-01 1.175000
If it is already sorted you don't need sort again

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Indicate whether datetime of row is in a daterange

I'm trying to get dummy variables for holidays in a dataset. I have a couple of dateranges (pd.daterange()) with holidays and a dataframe to which I would like to append a dummy to indicate whether the datetime of that row is in a certain daterange of the specified holidays.
Small example:
ChristmasBreak = list(pd.date_range('2014-12-20','2015-01-04').date)
dates = pd.date_range('2015-01-03', '2015-01-06, freq='H')
d = {'Date': dates, 'Number': np.rand(len(dates))}
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
for i, row in df.iterrows():
if i in ChristmasBreak:
df[i,'Christmas] = 1
The if loop is never entered, so matching the dates won't work. Is there any way to do this? Alternative methods to come to dummies for this case are welcome as well!
First dont use iterrows, because really slow.
Better is use dt.date with Series,isin, last convert boolean mask to integer - Trues are 1:
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
Or use between:
df['Christmas'] = df['Date'].between('2014-12-20', '2015-01-04').astype(int)
If want compare with DatetimeIndex:
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
df['Christmas'] = df.index.date.isin(ChristmasBreak).astype(int)
df['Christmas'] = ((df.index > '2014-12-20') & (df.index < '2015-01-04')).astype(int)
Sample:
ChristmasBreak = pd.date_range('2014-12-20','2015-01-04').date
dates = pd.date_range('2014-12-19 20:00', '2014-12-20 05:00', freq='H')
d = {'Date': dates, 'Number': np.random.randint(10, size=len(dates))}
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
print (df)
Date Number Christmas
0 2014-12-19 20:00:00 6 0
1 2014-12-19 21:00:00 7 0
2 2014-12-19 22:00:00 0 0
3 2014-12-19 23:00:00 9 0
4 2014-12-20 00:00:00 1 1
5 2014-12-20 01:00:00 3 1
6 2014-12-20 02:00:00 1 1
7 2014-12-20 03:00:00 8 1
8 2014-12-20 04:00:00 2 1
9 2014-12-20 05:00:00 1 1
This should do what you want:
df['Christmas'] = df.index.isin(ChristmasBreak).astype(int)

Resources