Dataframe find out duplicate values in column based on other columns, and then add label in to it - python-3.x

Given the following data frame:
import pandas as pd
d=pd.DataFrame({'ID':[1,1,1,1,2,2,2,2],
'values':['a','b','a','a','a','a','b','b']})
d
ID values
0 1 a
1 1 b
2 1 a
3 1 a
4 2 a
5 2 a
6 2 b
7 2 b
The data I want to get is:
ID values count label(values + ID)
0 1 a 3 a11
1 1 b 1 b11
2 1 a 3 a12
3 1 a 3 a13
4 2 a 2 a21
5 2 a 2 a22
6 2 b 2 b21
7 2 b 2 b22
Thank you so much!!!!!!!!!!!!!!!!!!!!

Seems like you need transform count + cumcount
d['count']=d.groupby(['ID','values'])['values'].transform('count')
d['label']=d['values']+d.ID.astype(str)+d.groupby(['ID','values']).cumcount().add(1).astype(str)
d
Out[511]:
ID values count label
0 1 a 3 a11
1 1 b 1 b11
2 1 a 3 a12
3 1 a 3 a13
4 2 a 2 a21
5 2 a 2 a22
6 2 b 2 b21
7 2 b 2 b22

You want to group by ID and values. Within each group, you are interested in two things: the number of members in the group (count) and the occurrence within the group (order):
df['order'] = df.groupby(['ID', 'values']).cumcount() + 1
df['count'] = df.groupby(['ID', 'values']).transform('count')
You can then concatenate their string values, along with the values using sum:
df['label'] = df[['values', 'ID', 'order']].astype(str).sum(axis=1)
Which leads to:
ID values order count label
0 1 a 1 3 a11
1 1 b 1 1 b11
2 1 a 2 3 a12
3 1 a 3 3 a13
4 2 a 1 2 a21
5 2 a 2 2 a22
6 2 b 1 2 b21
7 2 b 2 2 b22

Related

Grouped Sum on complicated calculated fields in other column

I have an excel sheet with data (Sheet1). First number is a secuencial number representing a number of month.
Sheet1 <month, year, data1, data2>
[first row: titles]
1 1 data11 data12
2 1 data21 data22
3 1 data31 data32
4 1 data41 data42
5 1 data51 data52
6 1 data61 data62
7 1 data71 data72
8 1 data81 data82
9 1 data91 data92
10 1 data101 data102
11 1 data111 data112
12 1 data121 data122
13 2 data131 data132
14 2 data141 data142
Sheet2
[month, year, formule]
1 1 sheet1!C2-3*sheet1!B1
2 1 sheet1!C3-3*sheet1!B2
3 1 sheet1!C4-3*sheet1!B3
4 1 sheet1!C5-3*sheet1!B4
5 1 sheet1!C6-3*sheet1!B5
6 1 sheet1!C7-3*sheet1!B6
7 1 sheet1!C8-3*sheet1!B7
8 1 sheet1!C9-3*sheet1!B8
9 1 sheet1!C10-3*sheet1!B9
10 1 sheet1!C11-3*sheet1!B10
11 1 sheet1!C12-3*sheet1!B11
12 1 sheet1!C13-3*sheet1!B12
13 2 sheet1!C14-3*sheet1!B13
14 2 sheet1!C15-3*sheet1!B114
Sheet3
[year, Sum of column C in sheet2 grouped by year]
Firts row <year,formule>
1 =SUMIF(sheet2!B$2:B$15, A2, sheet!C$2:C$15)
2 =SUMIF(sheet2!B$2:B$15, A3, sheet!C$2:C$15)
My question, Can I remove and do the calculation in Sheet3
I can if the column C of sheet2 is moved to sheet1 but I don't want to put many columns in sheet1 because Sheet2 has many columns. If we can remove Sheet2, we removing a lot of formula (in this example 14 + 2 formules -> only 2 formules)
Thanks
Solved: The year is in column 2 then
=SUMPRODUCT((Sheet1!B$2:B$424=Sheet3!B2)*(Formula using $2:$424 in each column of the mensual formula))

Should I stack, pivot, or groupby?

I'm still learning how to play with dataframe and still can't make this... I got a dataframe like this:
A B C D1 D2 D3
1 2 3 5 6 7
I need it to look like:
A B C DA D
1 2 3 D1 5
1 2 3 D2 6
1 2 3 D3 7
I know I should use something like groupby but I still can't find good documentation.
This is wide_to_long
ydf=pd.wide_to_long(df,'D',i=['A','B','C'],j='DA').reset_index()
ydf
A B C DA D
0 1 2 3 1 5
1 1 2 3 2 6
2 1 2 3 3 7
Use melt:
df.melt(['A','B','C'], var_name='DA', value_name='D')
Output:
A B C DA D
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
Use set_index and stack
df.set_index(['A','B','C']).stack().reset_index()
Output:
A B C level_3 0
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
And, you can do housekeeping by renaming column headers etc....

How can I unpivot or stack a pandas dataframe in the way that I asked?

I have a python pandas dataframe.
For example here is my data:
id A_1 A_2 B_1 B_2
0 j2 1 5 10 8
1 j3 2 6 11 9
2 j4 3 7 12 10
I want it to look like this:
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2
Can you help me please. Thank you so much!
Use wide_to_long with DataFrame.sort_values:
df = (pd.wide_to_long(df, ['A','B'], i='id', j='Other', sep='_')
.sort_values('id')
.reset_index())
print (df)
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2 j3 1 2 11
3 j3 2 6 9
4 j4 1 3 12
5 j4 2 7 10
We can also use DataFrame.melt + Series.str.split to performance a DataFrame.pivot_table:
df2=df.melt('id')
df2[['columns','Other']]=df2['variable'].str.split('_',expand=True)
new_df= ( df2.pivot_table(columns='columns',index=['id','Other'],values='value')
.reset_index()
.rename_axis(columns=None) )
print(new_df)
id Other A B
0 j2 1 1 10
1 j2 2 5 8
2 j3 1 2 11
3 j3 2 6 9
4 j4 1 3 12
5 j4 2 7 10

sumproduct in different columns between dates

Im trying to sum between two dates across columns. If I had a start date input in Sheet1!F1 and an end date input in Sheet1!F2 and I needed to multiply column B times column E.
I can do sumproduct(Sheet1!B2:B14,Sheet1!E2:E14) which would result in 48 based on the example table below. However, I need to include date parameters so I could choose between dates 2/1/15 and 6/1/15 which should result in 20.
A B C D E
Date Value1 Value2 Value3 Value4
1/1/2015 1 2 3 4
2/1/2015 1 2 3 4
3/1/2015 1 2 3 4
4/1/2015 1 2 3 4
5/1/2015 1 2 3 4
6/1/2015 1 2 3 4
7/1/2015 1 2 3 4
8/1/2015 1 2 3 4
9/1/2015 1 2 3 4
10/1/2015 1 2 3 4
11/1/2015 1 2 3 4
12/1/2015 1 2 3 4
Try,
=SUMPRODUCT((Sheet1!A2:A14>=Sheet1!F1)*(Sheet1!A2:A14<=Sheet1!F2)*Sheet1!B2:B14*Sheet1!E2:E14)

excel:compare 2 columns and copy data on other columns

need help.. i trying to compare 2 columns and copy data in other columns..
Columns:
A B C D
1 3 10
2 4 20
3 1 30
4 2 40
5 0 50
i want to compare column A to B to find its duplicate and copy data from column C if column A has a duplicate at column B...
Result must be:
A B C D
1 3 10 0
2 4 20 40
3 6 30 10
4 2 40 20
5 0 50 0
thanks in advance...
An answer as I understand the question (assuming the change in col B is just a typo):
Input
A B C D
1 3 10
2 4 20
3 6 30
4 2 40
5 0 50
Output
A B C D
1 3 10 0
2 4 20 40
3 6 30 10
4 2 40 20
5 0 50 0
Formula in D2 (filled down): =IF(COUNTIF(B$2:B$6, $A2)>0, VLOOKUP($A2,$B$2:$C$6, 2, FALSE), 0).
COUNTIF(B$2:B$6, $A2) returns the number of times the value in A2 appears in the array B2:B6. If this value is greater than 0 (meaning that A2 is in B2:B6), the IF() function looks looks up A2 in col B and returns the value in the 2nd row (col C); if A2 is not in B2:B6, the formula returns 0.

Resources