Dataframe find out duplicate values in column based on other columns, and then add label in to it

Dataframe find out duplicate values in column based on other columns, and then add label in to it - python-3.x

Given the following data frame:
import pandas as pd
d=pd.DataFrame({'ID':[1,1,1,1,2,2,2,2],
'values':['a','b','a','a','a','a','b','b']})
d
ID values
0 1 a
1 1 b
2 1 a
3 1 a
4 2 a
5 2 a
6 2 b
7 2 b
The data I want to get is:
ID values count label(values + ID)
0 1 a 3 a11
1 1 b 1 b11
2 1 a 3 a12
3 1 a 3 a13
4 2 a 2 a21
5 2 a 2 a22
6 2 b 2 b21
7 2 b 2 b22
Thank you so much!!!!!!!!!!!!!!!!!!!!

Seems like you need transform count + cumcount
d['count']=d.groupby(['ID','values'])['values'].transform('count')
d['label']=d['values']+d.ID.astype(str)+d.groupby(['ID','values']).cumcount().add(1).astype(str)
d
Out[511]:
ID values count label
0 1 a 3 a11
1 1 b 1 b11
2 1 a 3 a12
3 1 a 3 a13
4 2 a 2 a21
5 2 a 2 a22
6 2 b 2 b21
7 2 b 2 b22

You want to group by ID and values. Within each group, you are interested in two things: the number of members in the group (count) and the occurrence within the group (order):
df['order'] = df.groupby(['ID', 'values']).cumcount() + 1
df['count'] = df.groupby(['ID', 'values']).transform('count')
You can then concatenate their string values, along with the values using sum:
df['label'] = df[['values', 'ID', 'order']].astype(str).sum(axis=1)
Which leads to:
ID values order count label
0 1 a 1 3 a11
1 1 b 1 1 b11
2 1 a 2 3 a12
3 1 a 3 3 a13
4 2 a 1 2 a21
5 2 a 2 2 a22
6 2 b 1 2 b21
7 2 b 2 2 b22

Related

Grouped Sum on complicated calculated fields in other column

I have an excel sheet with data (Sheet1). First number is a secuencial number representing a number of month.
Sheet1 <month, year, data1, data2>
[first row: titles]
1 1 data11 data12
2 1 data21 data22
3 1 data31 data32
4 1 data41 data42
5 1 data51 data52
6 1 data61 data62
7 1 data71 data72
8 1 data81 data82
9 1 data91 data92
10 1 data101 data102
11 1 data111 data112
12 1 data121 data122
13 2 data131 data132
14 2 data141 data142
Sheet2
[month, year, formule]
1 1 sheet1!C2-3*sheet1!B1
2 1 sheet1!C3-3*sheet1!B2
3 1 sheet1!C4-3*sheet1!B3
4 1 sheet1!C5-3*sheet1!B4
5 1 sheet1!C6-3*sheet1!B5
6 1 sheet1!C7-3*sheet1!B6
7 1 sheet1!C8-3*sheet1!B7
8 1 sheet1!C9-3*sheet1!B8
9 1 sheet1!C10-3*sheet1!B9
10 1 sheet1!C11-3*sheet1!B10
11 1 sheet1!C12-3*sheet1!B11
12 1 sheet1!C13-3*sheet1!B12
13 2 sheet1!C14-3*sheet1!B13
14 2 sheet1!C15-3*sheet1!B114
Sheet3
[year, Sum of column C in sheet2 grouped by year]
Firts row <year,formule>
1 =SUMIF(sheet2!B$2:B$15, A2, sheet!C$2:C$15)
2 =SUMIF(sheet2!B$2:B$15, A3, sheet!C$2:C$15)
My question, Can I remove and do the calculation in Sheet3
I can if the column C of sheet2 is moved to sheet1 but I don't want to put many columns in sheet1 because Sheet2 has many columns. If we can remove Sheet2, we removing a lot of formula (in this example 14 + 2 formules -> only 2 formules)
Thanks

Solved: The year is in column 2 then
=SUMPRODUCT((Sheet1!B$2:B$424=Sheet3!B2)*(Formula using $2:$424 in each column of the mensual formula))

Should I stack, pivot, or groupby?

I'm still learning how to play with dataframe and still can't make this... I got a dataframe like this:
A B C D1 D2 D3
1 2 3 5 6 7
I need it to look like:
A B C DA D
1 2 3 D1 5
1 2 3 D2 6
1 2 3 D3 7
I know I should use something like groupby but I still can't find good documentation.

This is wide_to_long
ydf=pd.wide_to_long(df,'D',i=['A','B','C'],j='DA').reset_index()
ydf
A B C DA D
0 1 2 3 1 5
1 1 2 3 2 6
2 1 2 3 3 7

Use melt:
df.melt(['A','B','C'], var_name='DA', value_name='D')
Output:
A B C DA D
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
Use set_index and stack
df.set_index(['A','B','C']).stack().reset_index()
Output:
A B C level_3 0
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
And, you can do housekeeping by renaming column headers etc....

How can I unpivot or stack a pandas dataframe in the way that I asked?

I have a python pandas dataframe.
For example here is my data:
id A_1 A_2 B_1 B_2
0 j2 1 5 10 8
1 j3 2 6 11 9
2 j4 3 7 12 10
I want it to look like this:
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2
Can you help me please. Thank you so much!

Use wide_to_long with DataFrame.sort_values:
df = (pd.wide_to_long(df, ['A','B'], i='id', j='Other', sep='_')
.sort_values('id')
.reset_index())
print (df)
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2 j3 1 2 11
3 j3 2 6 9
4 j4 1 3 12
5 j4 2 7 10

We can also use DataFrame.melt + Series.str.split to performance a DataFrame.pivot_table:
df2=df.melt('id')
df2[['columns','Other']]=df2['variable'].str.split('_',expand=True)
new_df= ( df2.pivot_table(columns='columns',index=['id','Other'],values='value')
.reset_index()
.rename_axis(columns=None) )
print(new_df)
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2 j3 1 2 11
3 j3 2 6 9
4 j4 1 3 12
5 j4 2 7 10

sumproduct in different columns between dates

Im trying to sum between two dates across columns. If I had a start date input in Sheet1!F1 and an end date input in Sheet1!F2 and I needed to multiply column B times column E.
I can do sumproduct(Sheet1!B2:B14,Sheet1!E2:E14) which would result in 48 based on the example table below. However, I need to include date parameters so I could choose between dates 2/1/15 and 6/1/15 which should result in 20.
A B C D E
Date Value1 Value2 Value3 Value4
1/1/2015 1 2 3 4
2/1/2015 1 2 3 4
3/1/2015 1 2 3 4
4/1/2015 1 2 3 4
5/1/2015 1 2 3 4
6/1/2015 1 2 3 4
7/1/2015 1 2 3 4
8/1/2015 1 2 3 4
9/1/2015 1 2 3 4
10/1/2015 1 2 3 4
11/1/2015 1 2 3 4
12/1/2015 1 2 3 4

Try,
=SUMPRODUCT((Sheet1!A2:A14>=Sheet1!F1)*(Sheet1!A2:A14<=Sheet1!F2)*Sheet1!B2:B14*Sheet1!E2:E14)

excel:compare 2 columns and copy data on other columns

need help.. i trying to compare 2 columns and copy data in other columns..
Columns:
A B C D
1 3 10
2 4 20
3 1 30
4 2 40
5 0 50
i want to compare column A to B to find its duplicate and copy data from column C if column A has a duplicate at column B...
Result must be:
A B C D
1 3 10 0
2 4 20 40
3 6 30 10
4 2 40 20
5 0 50 0
thanks in advance...

An answer as I understand the question (assuming the change in col B is just a typo):
Input
A B C D
1 3 10
2 4 20
3 6 30
4 2 40
5 0 50
Output
A B C D
1 3 10 0
2 4 20 40
3 6 30 10
4 2 40 20
5 0 50 0
Formula in D2 (filled down): =IF(COUNTIF(B$2:B$6, $A2)>0, VLOOKUP($A2,$B$2:$C$6, 2, FALSE), 0).
COUNTIF(B$2:B$6, $A2) returns the number of times the value in A2 appears in the array B2:B6. If this value is greater than 0 (meaning that A2 is in B2:B6), the IF() function looks looks up A2 in col B and returns the value in the 2nd row (col C); if A2 is not in B2:B6, the formula returns 0.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Dataframe find out duplicate values in column based on other columns, and then add label in to it - python-3.x

Related

Grouped Sum on complicated calculated fields in other column

Should I stack, pivot, or groupby?

How can I unpivot or stack a pandas dataframe in the way that I asked?

sumproduct in different columns between dates

excel:compare 2 columns and copy data on other columns

Categories

Resources