Pandas - How do you grouby multiple columns and get the lowest value?

Pandas - How do you grouby multiple columns and get the lowest value? - python-3.x

I have data frame with 75+ number of columns. I am trying to eliminate and keep the relevant data rows for a test. I just created sample data set. I know how I could tackle this in SQL group by and get all the columns. How do I do this here? I have posted one of many tries which made sense to me.
u_id = ['A123','A123','A123','A124','A124','A125']
year = [2016,2017,2018,2018,1997,2015]
text = ['text1','text2','text1','text1','text56','text100']
df = pd.DataFrame({'u_id': u_id,'year': year,'text':text})
df
Data Input
u_id year text
0 A123 2016 text1
1 A123 2017 text2
2 A123 2018 text1
3 A124 2018 text1
4 A124 1997 text56
5 A125 2015 text100
Tried:
df[df.groupby(['u_id','year'])['year'].min()]
# error: `KeyError: '[2016 2017 2018 1997 2018 2015] not in index'`
# Key exists here, why is this an error? 'groupby/having' in SQL?
Output Needed:
u_id year text ... col1 col2 ..... col_x
A123 2016 text1 ...
A124 1997 text56 ...
A125 2015 text100 ...

I think,what you need is groupby u_id and keep the min year
df["year"] = pd.to_numeric(df["year"])
newdf = df.loc[df.groupby(['u_id'])['year'].idxmin()].reset_index(drop=True)

Related

Filter and display all duplicated rows based on multiple columns in Pandas [duplicate]

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 2 years ago.
Given a dataset as follows:
name month year
0 Joe December 2017
1 James January 2018
2 Bob April 2018
3 Joe December 2017
4 Jack February 2018
5 Jack April 2018
I need to filter and display all duplicated rows based on columns month and year in Pandas.
With code below, I get:
df = df[df.duplicated(subset = ['month', 'year'])]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)
Out:
name month year
3 Joe December 2017
5 Jack April 2018
But I want the result as follows:
name month year
0 Joe December 2017
1 Joe December 2017
2 Bob April 2018
3 Jack April 2018
How could I do that in Pandas?

The following code works, by adding keep = False:
df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

Convert column values into rows in the order in which columns are present

Below is a sample dataframe I have. I need to convert each row into multiple rows based on month.
df = pd.DataFrame({'Jan': [100,200,300],
'Feb': [400,500,600],
'March':[700,800,900],
})
Desired output :
Jan 100
Feb 400
March 700
Jan 200
Feb 500
March 800
Jan 300
Feb 600
March 900
Tried using pandas melt function but what it does is it will group Jan together, then Feb and March. It will be like 3 rows for Jan, then 3 for Feb and same for March. But i want to achieve the above output. Could someone please help ?

Use DataFrame.stack with some data cleaning by Series.reset_index with Series.rename_axis:
df1 = (df.stack()
.reset_index(level=0, drop=True)
.rename_axis('months')
.reset_index(name='val'))
Or use numpy - flatten values and repeat columns names by numpy.tile:
df1 = pd.DataFrame({'months': np.tile(df.columns, len(df)),
'val': df.values.reshape(1,-1).ravel()})
print (df1)
months val
0 Jan 100
1 Feb 400
2 March 700
3 Jan 200
4 Feb 500
5 March 800
6 Jan 300
7 Feb 600
8 March 900

binning with months column

i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
Here i used below code but i didnot match my requirement.i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
i have data frame as below
casenumber count CREATEDDATE
3820516 1 jan
3820547 1 jan
3820554 2 feb
3820562 1 feb
3820584 1 march
4226616 1 april
4226618 2 may
4226621 2 may
4226655 1 june
4226663 1 june
Here i used below code but i didnot match my requirement.i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
import pandas as pd
import numpy as np
df = pd.read_excel(r"")
bins = [0, 1 ,4,8,15, np.inf]
names = ['0-1','1-4','4-8','8-15','15+']
df1 = df.groupby(pd.cut(df['CREATEDDATE'],bins,labels=names))['casenumber'].size().reset_index(name='No_of_times_statuschanged')
CREATEDDATE No_of_times_statuschanged
0 0-1 2092
1 1-4 9062
2 4-8 12578
3 8-15 3858
4 15+ 0
I got the above data as out put but my expected should be range for month on month based on the cases per month .
expected output should be like
CREATEDDATE jan feb march april may june
0-1 1 2 3 4 5 6
1-4 3 0 6 7 8 9
4-8 4 6 3 0 9 2
8-15 0 3 4 5 8 9
I got the above data as out put but my expected should be range for month on month based on the cases per month .
expected output should be like

Use crosstab with change CREATEDDATE to count for pd.cut and change order of column by subset by list of columns names:
#add another months if necessary
months = ["jan", "feb", "march", "april", "may", "june"]
bins = [0, 1 ,4,8,15, np.inf]
names = ['0-1','1-4','4-8','8-15','15+']
df1 = pd.crosstab(pd.cut(df['count'],bins,labels=names), df['CREATEDDATE'])[months]
print (df1)
CREATEDDATE jan feb march april may june
count
0-1 2 1 1 1 0 2
1-4 0 1 0 0 2 0
Another idea is use ordered categoricals:
df1 = pd.crosstab(pd.cut(df['count'],bins,labels=names),
pd.Categorical(df['CREATEDDATE'], ordered=True, categories=months))
print (df1)
col_0 jan feb march april may june
count
0-1 2 1 1 1 0 2
1-4 0 1 0 0 2 0

Pandas Top n % of grouped sum

I work for a company and am trying to calculate witch products produced the top 80% of Gross Revenue in different years.
Here is a short example of my data:
Part_no Revision Gross_Revenue Year
1 a 1 2014
2 a 2 2014
3 c 2 2014
4 c 2 2014
5 d 2 2014
I've been looking through various answers and here's the best code I can come up with but it is not working:
df1 = df[['Year', 'Part_No', 'Revision', 'Gross_Revenue']]
df1 = df1.groupby(['Year', 'Part_No','Revision']).agg({'Gross_Revenue':'sum'})
# print(df1.head())
a = 0.8
df2 = (df1.sort_values('Gross_Revenue', ascending = False)
.groupby(['Year', 'Part_No', 'Revision'], group_keys = False)
.apply(lambda x: x.head(int(len(x) * a )))
.reset_index(drop = True))
print(df2)
I'm trying to have the code return, for each year, all the top products that brought in 80% of our company's revenue.
I suspect it's the old 80/20 rule.
Thank you for your help,
Me

You can using cumsum
df[df.groupby('Year').Gross_Revenue.cumsum().div(df.groupby('Year').Gross_Revenue.transform('sum'),axis=0)<0.8]
Out[589]:
Part_no Revision Gross_Revenue Year
1 2 a 2 2014
2 3 c 2 2014
3 4 c 2 2014

Making a list from a pandas column containing multiple values

Let's use this as an example data set:
Year Breeds
0 2009 Collie
1 2010 Shepherd
2 2011 Collie, Shepherd
3 2012 Shepherd, Retriever
4 2013 Shepherd
5 2014 Shepherd, Bulldog
6 2015 Collie, Retriever
7 2016 Retriever, Bulldog
I want to create a list dogs in which dogs contains the unique dog breeds Collie, Shepherd, Retriever, Bulldog. I know it is as simple as calling .unique() on the appropriate column, but I am running into the issue of having more than one value in the Breeds column. Any ideas to circumvent that?
Thanks!

EDIT:
If need extract all possible values use split:
df['new'] = df['Breeds'].str.split(', ')
For unique values convert to sets:
df['new'] = df['Breeds'].str.split(', ').apply(lambda x: list(set(x)))
Or use list comprehension:
df['new'] = [list(set(x.split(', '))) for x in df['Breeds']]
Use findall for extract by list and regex - | for OR if want extract only some values:
L = ["Collie", "Shepherd", "Retriever", "Bulldog"]
df['new'] = df['Breeds'].str.findall('|'.join(L))
If possible duplicates:
df['new'] = df['Breeds'].str.findall('|'.join(L)).apply(lambda x: list(set(x)))
print (df)
Year Breeds new
0 2009 Collie [Collie]
1 2010 Shepherd [Shepherd]
2 2011 Collie, Shepherd [Collie, Shepherd]
3 2012 Shepherd, Retriever [Shepherd, Retriever]
4 2013 Shepherd [Shepherd]
5 2014 Shepherd, Bulldog [Shepherd, Bulldog]
6 2015 Collie, Retriever [Collie, Retriever]
7 2016 Retriever, Bulldog [Retriever, Bulldog]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas - How do you grouby multiple columns and get the lowest value? - python-3.x

I think,what you need is groupby u_id and keep the min year df["year"] = pd.to_numeric(df["year"]) newdf = df.loc[df.groupby(['u_id'])['year'].idxmin()].reset_index(drop=True)

Related

Filter and display all duplicated rows based on multiple columns in Pandas [duplicate]

Convert column values into rows in the order in which columns are present

binning with months column

Pandas Top n % of grouped sum

Making a list from a pandas column containing multiple values

Categories

Resources