Pandas Identify duplicate records, create a new column and add the ID of first occurrence - python-3.x

I am a newbie in python, so please be mercy with me :)
Let's say, that there is a dataframe like this
ID B C D E isDuplicated
1 Blue Green Blue Pink false
2 Red Green Red Green false
3 Red Orange Yellow Green false
4 Blue Pink Blue Pink false
5 Blue Orange Pink Green false
6 Blue Orange Pink Green true
7 Red Orange Yellow Green true
8 Red Orange Yellow Green true
If I have duplicates in the rows with the subset= B,C,D,E.
Then I would like to add an other column 'firstOccurred', which should have the ID of the first occurrence.
My desired dataframe should look like this:
ID B C D E isDuplicated firstOccurred
1 Blue Green Blue Pink false
2 Red Green Red Green false
3 Red Orange Yellow Green false
4 Blue Pink Blue Pink false
5 Blue Orange Pink Green false
6 Blue Orange Pink Green true 5
7 Red Orange Yellow Green true 3
8 Red Orange Yellow Green true 3
I would be grateful for any help!
Thank you in advance!

Use GroupBy.transform with first only for roww with True passed in numpy.where:
df['firstOccurred'] = np.where(df['isDuplicated'],
df.groupby(['B','C','D','E'])['ID'].transform('first'),
np.nan)
print (df)
ID B C D E isDuplicated firstOccurred
0 1 Blue Green Blue Pink False NaN
1 2 Red Green Red Green False NaN
2 3 Red Orange Yellow Green False NaN
3 4 Blue Pink Blue Pink False NaN
4 5 Blue Orange Pink Green False NaN
5 6 Blue Orange Pink Green True 5.0
6 7 Red Orange Yellow Green True 3.0
7 8 Red Orange Yellow Green True 3.0

Related

Grouping By 2 Columns In Pandas Ignoring Order

I have a Dataframe in Pandas where there are 2 columns that are almost identical but not quite and hence sometimes I want to group by both columns ignoring the order.
As an example:
mydf = pd.DataFrame({'Colour1': ['Red', 'Red', 'Blue', 'Green', 'Blue'], 'Colour2': ['Red', 'Blue', 'Red', 'Blue', 'Green'], 'Rating': [4, 5, 7, 8, 2]})
Colour1 Colour2 Rating
0 Red Red 4
1 Red Blue 5
2 Blue Red 7
3 Green Blue 8
4 Blue Green 2
I would like to group by Colour1 and Colour2 whilst ignoring the order and then transforming the Dataframe by taking the mean to produce the following Dataframe:
Colour1 Colour2 Rating MeanRating
0 Red Red 4 4
1 Red Blue 5 6
2 Blue Red 7 6
3 Green Blue 8 5
4 Blue Green 2 5
Is there a good way of doing this? Thanks in advance.
You can first sort the column1 and 2 using np.sort then groupby:
s = pd.Series(map(tuple,np.sort(mydf[['Colour1','Colour2']],axis=1)),index=mydf.index)
mydf['MeanRating'] = mydf['Rating'].groupby(s).transform('mean')
print(mydf)
Colour1 Colour2 Rating MeanRating
0 Red Red 4 4
1 Red Blue 5 6
2 Blue Red 7 6
3 Green Blue 8 5
4 Blue Green 2 5

Pandas create a new data frame from counting rows into columns

I have something like this data frame:
item color
0 A red
1 A red
2 A green
3 B red
4 B green
5 B green
6 C red
7 C green
And I want to count the times a color repeat for each item and group-by it into columns like this:
item red green
0 A 2 1
1 B 1 2
2 C 1 1
Any though? Thanks in advance

calculating mean with a condition on python pandas Group by on two columns. And print only the mean for each category?

Input
Fruit Count Price tag
Apple 55 35 red
Orange 60 40 orange
Apple 60 36 red
Apple 70 41 red
Output 1
Fruit Mean tag
Apple 35.5 red
Orange 40 orange
I need mean on condition price between 31 and 40
Output 2
Fruit Count tag
Apple 2 red
Orange 1 orange
I need count on condition price between 31 and 40
pls help
Use between with boolean indexing for filtering:
df1 = df[df['Price'].between(31, 40)]
print (df1)
Fruit Count Price tag
0 Apple 55 35 red
1 Orange 60 40 orange
2 Apple 60 36 red
If possible multiple columns by aggregated functions:
df2 = df1.groupby(['Fruit', 'tag'])['Price'].agg(['mean','size']).reset_index()
print (df2)
Fruit tag mean size
0 Apple red 35.5 2
1 Orange orange 40.0 1
Or 2 separately DataFrames:
df3 = df1.groupby(['Fruit', 'tag'], as_index=False)['Price'].mean()
print (df3)
Fruit tag Price
0 Apple red 35.5
1 Orange orange 40.0
df4 = df1.groupby(['Fruit', 'tag'])['Price'].size().reset_index()
print (df4)
Fruit tag Price
0 Apple red 2
1 Orange orange 1

How to find max and return adjacent cell in Excel

Imagine a table:
Red 8 Black 1
Red 2 Black 3
Red 1 Black 0
Red 7 Black 8
Red 4 Black 5
How do I return "Red" or "Black" in a third column for each row depending on which has a larger value?
It would be:
Red 8 Black 1 Red
Red 2 Black 3 Black
Red 1 Black 0 Red
Red 7 Black 8 Black
Red 4 Black 5 Black
Use:
=INDEX(A2:D2,MATCH(MAX(A2:D2),A2:D2,0)-1)
Edit:
Since there are only two Options, a simple IF will work:
=IF(B2>D2,A2,C2)

Excel find string in string in range return data in corresponding column

The Excel thingy again. ;-)
I have columns like this:
A B C
UserID Name Org_Name
1 Brian Green Susan Red
2 Niels Red Susan Blue
3 Susan Yellow Brian Green
4 India Orange Serge Black
I am looking for a formula that can find ORG_Name(C) in Name(B) and return the UserID(A) and Name(B) found.
In this case it could look like this:
A B C D E
UserID Name Org_Name FoundID FoundName
1 Brian Green Susan Red N/A N/A
2 Niels Red Susan Blue N/A N/A
3 Susan Yellow Brian Green 1 Brian Green
4 India Orange Serge Black N/A N/A
Anyone?
Formula for FoundID column:
=INDEX($A$2:$A$5,MATCH(C2,$B$2:$B$5,0))
Formula for FoundName column:
=INDEX($B$2:$B$5,MATCH(C2,$B$2:$B$5,0))
Adjust the end row as required (the 5s in both formulas)

Resources