Filter and display all duplicated rows based on multiple columns in Pandas [duplicate]

Filter and display all duplicated rows based on multiple columns in Pandas [duplicate] - python-3.x

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 2 years ago.
Given a dataset as follows:
name month year
0 Joe December 2017
1 James January 2018
2 Bob April 2018
3 Joe December 2017
4 Jack February 2018
5 Jack April 2018
I need to filter and display all duplicated rows based on columns month and year in Pandas.
With code below, I get:
df = df[df.duplicated(subset = ['month', 'year'])]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)
Out:
name month year
3 Joe December 2017
5 Jack April 2018
But I want the result as follows:
name month year
0 Joe December 2017
1 Joe December 2017
2 Bob April 2018
3 Jack April 2018
How could I do that in Pandas?

The following code works, by adding keep = False:
df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

Related

Look up a date value from each cell in a column and return a year date dependent upon where date falls between two dates

I'm wanting to add formula to locate the Policy Year in each cell in column B (starting in B2) which is determined from interrogating the date shown in the corresponding cell in Column A and then checking whether it sits in a range (inception date and expiry date) D2:E5 The Policy Year sits in C2:C5 I've shown the values I'd expect the formula in the cells in column B to draw from Column C.
COLUMN A COLUMN B EXPECTED VALUE COLUMN C COLUMN D COLUMN E
2 April 2017 2016 2016 5 December 2016 4 December 2017
5 June 2017 2016 2017 5 December 2017 4 December 2018
6 December 2017 2017 2018 5 December 2018 4 December 2019
4 January 2018 2017 2019 5 December 2019 4 December 2020
6 August 2018 2017
4 December 2018 2017
29 December 2018 2018
6 March 2020 2019

Pandas - How do you grouby multiple columns and get the lowest value?

I have data frame with 75+ number of columns. I am trying to eliminate and keep the relevant data rows for a test. I just created sample data set. I know how I could tackle this in SQL group by and get all the columns. How do I do this here? I have posted one of many tries which made sense to me.
u_id = ['A123','A123','A123','A124','A124','A125']
year = [2016,2017,2018,2018,1997,2015]
text = ['text1','text2','text1','text1','text56','text100']
df = pd.DataFrame({'u_id': u_id,'year': year,'text':text})
df
Data Input
u_id year text
0 A123 2016 text1
1 A123 2017 text2
2 A123 2018 text1
3 A124 2018 text1
4 A124 1997 text56
5 A125 2015 text100
Tried:
df[df.groupby(['u_id','year'])['year'].min()]
# error: `KeyError: '[2016 2017 2018 1997 2018 2015] not in index'`
# Key exists here, why is this an error? 'groupby/having' in SQL?
Output Needed:
u_id year text ... col1 col2 ..... col_x
A123 2016 text1 ...
A124 1997 text56 ...
A125 2015 text100 ...

I think,what you need is groupby u_id and keep the min year
df["year"] = pd.to_numeric(df["year"])
newdf = df.loc[df.groupby(['u_id'])['year'].idxmin()].reset_index(drop=True)

Summing a years worth of data that spans two years pandas

I have a DataFrame that contains data similar to this:
Name Date A B C
John 19/04/2018 10 11 8
John 20/04/2018 9 7 9
John 21/04/2018 22 15 22
… … … … …
John 16/04/2019 8 8 9
John 17/04/2019 10 11 18
John 18/04/2019 8 9 11
Rich 19/04/2018 18 7 6
… … … … …
Rich 18/04/2019 19 11 17
The data can start on any day and contains at least 365 days of data, sometimes more. What I want to end up with is a DataFrame like this:
Name Date Sum
John April 356
John May 276
John June 209
Rich April 452
I need to sum up all of the months to get a year’s worth of data (April - March) but I need to be able to handle taking part of April’s total (in this example) from 2018 and part from 2019. What I would also like to do is shift the days so they are consecutive and follow on in sequence so rather than:
John 16/04/2019 8 8 9 Tuesday
John 17/04/2019 10 11 18 Wednesday
John 18/04/2019 8 9 11 Thursday
John 19/04/2019 10 11 8 Thursday (was 19/04/2018)
John 20/04/2019 9 7 9 Friday (was 20/04/2018)
It becomes
John 16/04/2019 8 8 9 Tuesday
John 17/04/2019 10 11 18 Wednesday
John 18/04/2019 8 9 11 Thursday
John 19/04/2019 9 7 9 Friday (was 20/04/2018)
Prior to summing to get the final DataFrame. Is this possible?
Additional information requested in comments
Here is a link to the initial data set https://github.com/stottp/exampledata/blob/master/SOExample.csv and the required output would be:
Name Month Total
John March 11634
John April 11470
John May 11757
John June 10968
John July 11682
John August 11631
John September 11085
John October 11924
John November 11593
John December 11714
John January 11320
John February 10167
Rich March 11594
Rich April 12383
Rich May 12506
Rich June 11112
Rich July 11636
Rich August 11303
Rich September 10667
Rich October 10992
Rich November 11721
Rich December 11627
Rich January 11669
Rich February 10335

Let's see if I understood correctly. If you want to sum, I suppose you mean sum the values of columns ['A', 'B', 'C'] for each day and get the total value monthly.
If that's right, the first thing to to is set the ['Date'] column as the index so that the data frame is easier to work with:
df.set_index(df['Date'], inplace=True, drop=True)
del df['Date']
Next, you will want to add the new column ['Sum'] by re-sampling your data frame (from days to months) whilst summing the values of ['A', 'B', 'C']:
df['Sum'] = df['A'].resample('M').sum() + df['B'].resample('M').sum() + df['C'].resample('M').sum()
df['Sum'].head()
Out[37]:
Date
2012-11-30 1956265
2012-12-31 2972076
2013-01-31 2972565
2013-02-28 2696121
2013-03-31 2970687
Freq: M, dtype: int64
The last part about squashing February of 2018 and 2019 together as if they were a single month might yield from:
df['2019-02'].merge(df['2018-02'], how='outer', on=['Date', 'A', 'B', 'C'])
Test this last step and see if it works for you.
Cheers

Pandas Top n % of grouped sum

I work for a company and am trying to calculate witch products produced the top 80% of Gross Revenue in different years.
Here is a short example of my data:
Part_no Revision Gross_Revenue Year
1 a 1 2014
2 a 2 2014
3 c 2 2014
4 c 2 2014
5 d 2 2014
I've been looking through various answers and here's the best code I can come up with but it is not working:
df1 = df[['Year', 'Part_No', 'Revision', 'Gross_Revenue']]
df1 = df1.groupby(['Year', 'Part_No','Revision']).agg({'Gross_Revenue':'sum'})
# print(df1.head())
a = 0.8
df2 = (df1.sort_values('Gross_Revenue', ascending = False)
.groupby(['Year', 'Part_No', 'Revision'], group_keys = False)
.apply(lambda x: x.head(int(len(x) * a )))
.reset_index(drop = True))
print(df2)
I'm trying to have the code return, for each year, all the top products that brought in 80% of our company's revenue.
I suspect it's the old 80/20 rule.
Thank you for your help,
Me

You can using cumsum
df[df.groupby('Year').Gross_Revenue.cumsum().div(df.groupby('Year').Gross_Revenue.transform('sum'),axis=0)<0.8]
Out[589]:
Part_no Revision Gross_Revenue Year
1 2 a 2 2014
2 3 c 2 2014
3 4 c 2 2014

Making a list from a pandas column containing multiple values

Let's use this as an example data set:
Year Breeds
0 2009 Collie
1 2010 Shepherd
2 2011 Collie, Shepherd
3 2012 Shepherd, Retriever
4 2013 Shepherd
5 2014 Shepherd, Bulldog
6 2015 Collie, Retriever
7 2016 Retriever, Bulldog
I want to create a list dogs in which dogs contains the unique dog breeds Collie, Shepherd, Retriever, Bulldog. I know it is as simple as calling .unique() on the appropriate column, but I am running into the issue of having more than one value in the Breeds column. Any ideas to circumvent that?
Thanks!

EDIT:
If need extract all possible values use split:
df['new'] = df['Breeds'].str.split(', ')
For unique values convert to sets:
df['new'] = df['Breeds'].str.split(', ').apply(lambda x: list(set(x)))
Or use list comprehension:
df['new'] = [list(set(x.split(', '))) for x in df['Breeds']]
Use findall for extract by list and regex - | for OR if want extract only some values:
L = ["Collie", "Shepherd", "Retriever", "Bulldog"]
df['new'] = df['Breeds'].str.findall('|'.join(L))
If possible duplicates:
df['new'] = df['Breeds'].str.findall('|'.join(L)).apply(lambda x: list(set(x)))
print (df)
Year Breeds new
0 2009 Collie [Collie]
1 2010 Shepherd [Shepherd]
2 2011 Collie, Shepherd [Collie, Shepherd]
3 2012 Shepherd, Retriever [Shepherd, Retriever]
4 2013 Shepherd [Shepherd]
5 2014 Shepherd, Bulldog [Shepherd, Bulldog]
6 2015 Collie, Retriever [Collie, Retriever]
7 2016 Retriever, Bulldog [Retriever, Bulldog]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Filter and display all duplicated rows based on multiple columns in Pandas [duplicate] - python-3.x

The following code works, by adding keep = False: df = df[df.duplicated(subset = ['month', 'year'], keep = False)] df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

Related

Look up a date value from each cell in a column and return a year date dependent upon where date falls between two dates

Pandas - How do you grouby multiple columns and get the lowest value?

Summing a years worth of data that spans two years pandas

Pandas Top n % of grouped sum

Making a list from a pandas column containing multiple values

Categories

Resources