Pandas Multiple Conditional Mean With Group By - python-3.x

New to python and pandas. I have a pandas DataFrame with list of customer data which includes customer name, Reporting month and performance. I'm trying to get first recorded performance for each customer
CustomerName ReportingMonth Performance
0 7CGC 2019-12-01 1.175000
1 7CGC 2020-01-01 1.125000
2 ACC 2019-11-01 1.216802
3 ACBH 2019-05-01 0.916667
4 ACBH 2019-06-01 0.893333
5 AKC 2019-10-01 4.163636
6 AKC 2019-11-01 3.915215
Desired output
CustomerName ReportingMonth Performance
0 7CGC 2019-12-01 1.175000
1 ACC 2019-11-01 1.216802
2 ACBH 2019-05-01 0.916667
3 AKC 2019-10-01 4.163636

Use DataFrame.sort_values with GroupBy.first or DataFrame.drop_duplicates:
df.sort_values('ReportingMonth').groupby('CustomerName', as_index=False).first()
or
new_df = df.sort_values('ReportingMonth').drop_duplicates('CustomerName',
keep = 'first')
print(new_df)
Output
CustomerName ReportingMonth Performance
3 ACBH 2019-05-01 0.916667
5 AKC 2019-10-01 4.163636
2 ACC 2019-11-01 1.216802
0 7CGC 2019-12-01 1.175000
If it is already sorted you don't need sort again

Related

How to check if dates in a pandas column are after a date

I have a pandas dataframe
date
0 2010-03
1 2017-09-14
2 2020-10-26
3 2004-12
4 2012-04-01
5 2017-02-01
6 2013-01
I basically want to filter where dates are after 2015-12 (Dec 2015)
To get this:
date
0 2017-09-14
1 2020-10-26
2 2017-02-01
I tried this
df = df[(df['date']> "2015-12")]
but I'm getting an error
ValueError: Wrong number of items passed 17, placement implies 1
First for me working solution correct:
df = df[(df['date']> "2015-12")]
print (df)
date
1 2017-09-14
2 2020-10-26
5 2017-02-01
If convert to datetimes, which should be more robust for me working too:
df = df[(pd.to_datetime(df['date'])> "2015-12")]
print (df)
date
1 2017-09-14
2 2020-10-26
5 2017-02-01
Detail:
print (pd.to_datetime(df['date']))
0 2010-03-01
1 2017-09-14
2 2020-10-26
3 2004-12-01
4 2012-04-01
5 2017-02-01
6 2013-01-01
Name: date, dtype: datetime64[ns]

Display row with False values in validated pandas dataframe column [duplicate]

This question already has answers here:
Display rows with one or more NaN values in pandas dataframe
(5 answers)
Closed 2 years ago.
I was validating 'Price' column in my dataframe. Sample:
ArticleId SiteId ZoneId Date Quantity Price CostPrice
53 194516 9 2 2018-11-26 11.0 40.64 27.73
164 200838 9 2 2018-11-13 5.0 99.75 87.24
373 200838 9 2 2018-11-27 1.0 99.75 87.34
pd.to_numeric(df_sales['Price'], errors='coerce').notna().value_counts()
And I'd love to display those rows with False values so I know whats wrong with them. How do I do that?
True 17984
False 13
Name: Price, dtype: int64
Thank you.
You could print your rows when price isnull():
print(df_sales[df_sales['Price'].isnull()])
ArticleId SiteId ZoneId Date Quantity Price CostPrice
1 200838 9 2 2018-11-13 5 NaN 87.240
pd.to_numeric(df['Price'], errors='coerce').isna() returns a Boolean, which can be used to select the rows that cause errors.
This includes NaN or rows with strings
import pandas as pd
# test data
df = pd.DataFrame({'Price': ['40.64', '99.75', '99.75', pd.NA, 'test', '99. 0', '98 0']})
Price
0 40.64
1 99.75
2 99.75
3 <NA>
4 test
5 99. 0
6 98 0
# find the value of the rows that are causing issues
problem_rows = df[pd.to_numeric(df['Price'], errors='coerce').isna()]
# display(problem_rows)
Price
3 <NA>
4 test
5 99. 0
6 98 0
Alternative
Create an extra column and then use it to select the problem rows
df['Price_Updated'] = pd.to_numeric(df['Price'], errors='coerce')
Price Price_Updated
0 40.64 40.64
1 99.75 99.75
2 99.75 99.75
3 <NA> NaN
4 test NaN
5 99. 0 NaN
6 98 0 NaN
# find the problem rows
problem_rows = df.Price[df.Price_Updated.isna()]
Explanation
Updating the column with .to_numeric(), and then checking for NaNs will not tell you why the rows had to be coerced.
# update the Price row
df.Price = pd.to_numeric(df['Price'], errors='coerce')
# check for NaN
problem_rows = df.Price[df.Price.isnull()]
# display(problem_rows)
3 NaN
4 NaN
5 NaN
6 NaN
Name: Price, dtype: float64

Select top n columns based on another column

I have a database as the following:
And I would like to obtain a pandas dataframe filtered for the 2 rows per date, based on the top ones that have the highest population. The output should look like this:
I know that pandas offers a formula called nlargest:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html
but I don't think it is usable for this use case. Is there any workaround?
Thanks so much in advance!
I have mimicked your dataframe as below and provided a way forward to get the desired, hope that will helpful.
Your Dataframe:
>>> df
Date country population
0 2019-12-31 A 100
1 2019-12-31 B 10
2 2019-12-31 C 1000
3 2020-01-01 A 200
4 2020-01-01 B 20
5 2020-01-01 C 3500
6 2020-01-01 D 12
7 2020-02-01 D 2000
8 2020-02-01 E 54
Your Desired Solution:
You can use nlargest method along with set_index ans groupby method.
This is what you will get..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2)
Date country
2019-12-31 C 1000
A 100
2020-01-01 C 3500
A 200
2020-02-01 D 2000
E 54
Name: population, dtype: int64
Now, as you want the DataFrame into original state by resetting the index of the DataFrame, which will give you following ..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2).reset_index()
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54
Another way around:
With groupby and apply function use reset_index with parameter drop=True and level= ..
>>> df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=[0,1], drop=True)
# df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=['Date',1], drop=True)
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54

Groupby dates quaterly in a pandas dataframe and find count for their occurence

My Dataframe looks like
"dataframe_time"
INSERTED_UTC
0 2018-05-29
1 2018-05-22
2 2018-02-10
3 2018-04-30
4 2018-03-02
5 2018-11-26
6 2018-03-07
7 2018-05-12
8 2019-02-03
9 2018-08-03
10 2018-04-27
print(type(dataframe_time['INSERTED_UTC'].iloc[1]))
<class 'datetime.date'>
I am trying to group the dates together and find the count of their occurrence quaterly. Desired Output -
Quarter Count
2018-03-31 3
2018-06-30 5
2018-09-30 1
2018-12-31 1
2019-03-31 1
2019-06-30 0
I am running the following command to group them together
dataframe_time['INSERTED_UTC'].groupby(pd.Grouper(freq='Q'))
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
First are dates converted to datetimes and then is used DataFrame.resample with on for get column with datetimes:
dataframe_time.INSERTED_UTC = pd.to_datetime(dataframe_time.INSERTED_UTC)
df = dataframe_time.resample('Q', on='INSERTED_UTC').size().reset_index(name='Count')
Or your solution is possible change to:
df = (dataframe_time.groupby(pd.Grouper(freq='Q', key='INSERTED_UTC'))
.size()
.reset_index(name='Count'))
print (df)
INSERTED_UTC Count
0 2018-03-31 3
1 2018-06-30 5
2 2018-09-30 1
3 2018-12-31 1
4 2019-03-31 1
You can convert the dates to quarters by to_period('Q') and group by those:
df.INSERTED_UTC = pd.to_datetime(df.INSERTED_UTC)
df.groupby(df.INSERTED_UTC.dt.to_period('Q')).size()
You can also use value_counts:
df.INSERTED_UTC.dt.to_period('Q').value_counts()
Output:
INSERTED_UTC
2018Q1 3
2018Q2 5
2018Q3 1
2018Q4 1
2019Q1 1
Freq: Q-DEC, dtype: int64

Finding the maximum value in a Pandas column for multiple rows

Please, help me with the following example,
I have DataFrame:
data ={'Сlient':['1', '2', '3', '3', '3', '4'], \
'date1':['2019-11-07', '2019-11-08', '2019-11-08', '2019-11-08', '2019-11-08', '2019-11-11'], \
'date2':['2019-11-01', '2019-11-02', '2019-11-06', '2019-11-07', '2019-11-10', '2019-11-15'] }
df =pd.DataFrame(data)
I need to create a column with a date, which from the group of date2 for each client selects the maximum value, and it should be less than the value of date1 for this client.
For example for client 3, I need to get 2019-11-07.
Can this be done with Lambda function?
First use boolean indexing with Series.lt for filter out rows less like date1 values and then get index values by maximum date2 values by DataFrameGroupBy.idxmax and seelct by loc:
df[['date1','date2']] = df[['date1','date2']].apply(pd.to_datetime)
df1 = df.loc[df[df['date2'].lt(df['date1'])].groupby('Сlient')['date2'].idxmax()]
print (df1)
Сlient date1 date2
0 1 2019-11-07 2019-11-01
1 2 2019-11-08 2019-11-02
3 3 2019-11-08 2019-11-07
Another solution with filtering by DataFrame.query, sorting by DataFrame.sort_values and remove duplicated by DataFrame.drop_duplicates:
df1 = (df.query('date2 < date1')
.sort_values(['Сlient','date2'], ascending=[True, False])
.drop_duplicates('Сlient'))
print (df1)
Сlient date1 date2
0 1 2019-11-07 2019-11-01
1 2 2019-11-08 2019-11-02
3 3 2019-11-08 2019-11-07
EDIT:
Then last step is use Series.map:
df['date2'] = df['Сlient'].map(df1.set_index('Сlient')['date2'])
print (df)
Сlient date1 date2
0 1 2019-11-07 2019-11-01
1 2 2019-11-08 2019-11-02
2 3 2019-11-08 2019-11-07
3 3 2019-11-08 2019-11-07
4 3 2019-11-08 2019-11-07
5 4 2019-11-11 NaT

Resources