Add count of element for each groups within a list in pandas - python-3.x

I have a dataframe such as:
The_list=["A","B","D"]
Groups Values
G1 A
G1 B
G1 C
G1 D
G2 A
G2 B
G2 A
G3 A
G3 D
G4 Z
G4 D
G4 E
G4 C
And I would like to add for each Groups the number of Values element that are within The_list, and add this number within a New_column
Here I should then get;
Groups Values New_column
G1 A 3
G1 B 3
G1 C 3
G1 D 3
G2 A 2
G2 B 2
G2 A 2
G3 A 1
G3 D 1
G4 Z 0
G4 D 0
G4 E 0
G4 C 0
Thanks a lot for your help
Here is the table in dict format if it can helps:
{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G2', 5: 'G2', 6: 'G2', 7: 'G3', 8: 'G3', 9: 'G4', 10: 'G4', 11: 'G4', 12: 'G4'}, 'Values': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'A', 5: 'B', 6: 'A', 7: 'A', 8: 'D', 9: 'Z', 10: 'D', 11: 'E', 12: 'C'}}

In your case do transform after isin check
df['new'] = df['Values'].isin(The_list).groupby(df['Groups']).transform('sum')
Out[37]:
0 3
1 3
2 3
3 3
4 3
5 3
6 3
7 2
8 2
9 1
10 1
11 1
12 1
Name: Values, dtype: int64

Related

Remove duplicate within column depending on 2 lists in python

I have a dataframe such as :
Groups NAME LETTER
G1 Canis_lupus A
G1 Canis_lupus B
G1 Canis_lupus F
G1 Cattus_cattus C
G1 Cattus_cattus C
G2 Canis_lupus C
G2 Zebra_fish A
G2 Zebra_fish D
G2 Zebra-fish B
G2 Cattus_cattus D
G2 Cattus_cattus E
and the idea is that I would like within Groups to keep only two duplicated NAME where LETTER is within the list1=['A','B','C'] and list2=['D','E','F']
When there is for instance duplicate having A and B, I keep the A in the alphabet order
In the example I should then get :
Groups NAME LETTER
G1 Canis_lupus A
G1 Canis_lupus F
G1 Cattus_cattus C
G2 Canis_lupus C
G2 Zebra_fish A
G2 Zebra_fish D
G2 Cattus_cattus D
Here is tha dataframe
{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G1', 5: 'G2', 6: 'G2', 7: 'G2', 8: 'G2', 9: 'G2', 10: 'G2'}, 'NAME': {0: 'Canis_lupus', 1: 'Canis_lupus', 2: 'Canis_lupus', 3: 'Cattus_cattus', 4: 'Cattus_cattus', 5: 'Canis_lupus', 6: 'Zebra_fish', 7: 'Zebra_fish', 8: 'Zebra-fish', 9: 'Cattus_cattus', 10: 'Cattus_cattus'}, 'LETTER': {0: 'A', 1: 'B', 2: 'F', 3: 'C', 4: 'C', 5: 'C', 6: 'A', 7: 'D', 8: 'B', 9: 'D', 10: 'E'}}
Idea is first sorting by LETTER if necessary and then filter by both lists with Series.isin, remove duplicates by DataFrame.drop_duplicates and last join together by concat:
#sorting per groups
df = df.sort_values(['Groups','LETTER'])
#sorting by one column
#df = df.sort_values('LETTER')
list1=['A','B','C']
list2=['D','E','F']
df1 = df[df['LETTER'].isin(list1)].drop_duplicates(['Groups','NAME'])
df2 = df[df['LETTER'].isin(list2)].drop_duplicates(['Groups','NAME'])
df = pd.concat([df1, df2]).sort_index(ignore_index=True)
print (df)
Groups NAME LETTER
0 G1 Canis_lupus A
1 G1 Canis_lupus F
2 G1 Cattus_cattus C
3 G2 Canis_lupus C
4 G2 Zebra_fish A
5 G2 Zebra_fish D
6 G2 Cattus_cattus D
Idea with mapping values to new column with merge dictionaries, similar like another solution, only also removed rows if no match both lists by DataFrame.dropna, last remove helper column and sorting:
d = {**dict.fromkeys(list1, 'a'),
**dict.fromkeys(list2, 'b')}
df = (df.assign(new = df.LETTER.map(d))
.dropna(subset=['new'])
.drop_duplicates(subset=['Groups', 'NAME', 'new'])
.sort_index(ignore_index=True)
.drop('new', 1)
)
print (df)
Groups NAME LETTER
0 G1 Canis_lupus A
1 G1 Canis_lupus F
2 G1 Cattus_cattus C
3 G2 Canis_lupus C
4 G2 Zebra_fish A
5 G2 Zebra_fish D
6 G2 Cattus_cattus D
Create a list_id column to identify which list a particular letter belongs to. Then just drop the duplicates using the subset parameter.
condlist = [df.LETTER.isin(list1),
df.LETTER.isin(list2)]
choicelist = [
'list1',
'list2'
]
df['list_id'] = np.select(condlist, choicelist)
df = df.sort_values('LETTER').drop_duplicates(
subset=['Groups', 'NAME', 'list_id']).drop('list_id', 1).sort_values(['Groups', 'NAME'])
OUTPUT:
Groups NAME LETTER
0 G1 Canis_lupus A
2 G1 Canis_lupus F
3 G1 Cattus_cattus C
5 G2 Canis_lupus C
9 G2 Cattus_cattus D
6 G2 Zebra_fish A
7 G2 Zebra_fish D

Assigning a single key to all similar products/rows in a data frame based on product description and one other key

Based on 3 keys/columns uniqueid , uniqueid2 and uniqueid3 I need to generate a column new_key that will tag all associated products/rows with a single key.
```python
df = pd.DataFrame({'uniqueid': {0: 'a', 1: 'b', 2: 'b', 3: 'c',
4: 'd', 5: 'd', 6: 'e', 7: 'e',8:'g',9:'g',10:'h',11:'l',12:'m'},
'uniqueid2': {0: 'a', 1: 'b', 2: 'b', 3: 'c',
4: 'd', 5: 'd', 6: 'e', 7: 'e',8:'g',9:'g',10:'h',11:'l',12:'l'},
'uniqueid3': {0: 'z', 1: 'y', 2: 'x', 3: 'y',
4: 'x', 5: 'v', 6: 'x', 7: 'u',8:'h',9:'i',10:'k',11:'k',12:'n'}})
```
Data that I have based on columns uniqueid ,uniqueid2 and uniqueid3. I need to create new_key as already there. Here in this dummy data all the rows except first belong to a same product based on associations in column 1 and column2.
But I am unsure on how to proceed further. Quick help needed please
Expected Output:
[1]: https://i.stack.imgur.com/yAl56.png
This will give you the correct output, but I'm not sure this is exactly what you want to do in order to generate the new_key column. This solution checks uniqueid2 to see if all values are unique within each uniqueid group as well as the entire uniqueid2 column..
import pandas as pd
import numpy as np
df = pd.DataFrame({'uniqueid': {0: 'a', 1: 'b', 2: 'b', 3: 'c',
4: 'd', 5: 'd', 6: 'e', 7: 'e',8:'g',9:'g',10:'h',11:'l'},
'uniqueid2': {0: 'z', 1: 'y', 2: 'x', 3: 'y',
4: 'x', 5: 'v', 6: 'x', 7: 'u',8:'h',9:'i',10:'k',11:'k'}})
df['m1'] = (df.groupby('uniqueid2')['uniqueid2'].transform('count') == 1)
df['m2'] = (df.groupby('uniqueid')['m1'].transform(sum))
df['m3'] = (df.groupby('uniqueid')['uniqueid2'].transform('size'))
df['m4'] = (df.groupby('uniqueid')['uniqueid'].transform('count') == 1)
df['new_key'] = np.where((df['m2'] == df['m3']) | df['m4'], df['uniqueid'], 'b')
df
Out[13]:
uniqueid uniqueid2 m1 m2 m3 m4 new_key
0 a z True 1.0 1 True a
1 b y False 0.0 2 False b
2 b x False 0.0 2 False b
3 c y False 0.0 1 True c
4 d x False 1.0 2 False b
5 d v True 1.0 2 False b
6 e x False 1.0 2 False b
7 e u True 1.0 2 False b
8 g h True 2.0 2 False g
9 g i True 2.0 2 False g
10 h k False 0.0 1 True h
11 l k False 0.0 1 True l
I kept m1, m2 and m3, so that you could see the progression of the logic. You can drop these columns with:
df = df.drop(['m1','m2','m3'], axis=1)
This looks like a networkx problem, lets try:
import networkx as nx
G = nx.Graph()
#get first value of uniqueid based on uniqueid2
s = df.groupby('uniqueid2')['uniqueid'].transform('first')
#get connected components from unique id and the above variable s
G.add_edges_from(df[['uniqueid']].assign(k=s).to_numpy().tolist())
cc = list(nx.connected_components(G))
#[{'a'}, {'b', 'c', 'd', 'e'}, {'g'}, {'h', 'l'}]
idx = [dict.fromkeys(y,x) for x, y in enumerate(cc)]
d={k: v for d in idx for k, v in d.items()}
df['new_key'] = s.groupby(s.map(d)).transform('first')
print(df)
uniqueid uniqueid2 new_key
0 a z a
1 b y b
2 b x b
3 c y b
4 d x b
5 d v b
6 e x b
7 e u b
8 g h g
9 g i g
10 h k h
11 l k h

Increment Count column by 1 base on a another column

I have this data frame where i need to create a count column base on my distance column. I grouped the result by the model column. What i anticipate to get is an increment by 1 on the next count row each time the distance is 100. For example, here is what I have so far but no success yet with the increment
import pandas as pd
df = pd.DataFrame(
[['A', '34', 3], ['A', '55', 5], ['A', '100', 7], ['A', '0', 1],['A', '55', 5],
['B', '90', 3], ['B', '0', 1], ['B', '1', 3], ['B', '21', 1],['B', '0', 1],
['C', '9', 7], ['C', '100', 4], ['C', '50', 1], ['C', '100', 6],['C', '22', 4]],
columns=['Model', 'Distance', 'v1'])
df = df.groupby(['Model']).apply(lambda row: callback(row) if row['Distance'] is not None else callback(row)+1)
print(df)
import numpy as np
(
df.groupby('Model')
.apply(lambda x: x.assign(Bount=x.Count + x.Distance.shift()
.eq('100').replace(False, np.nan)
.ffill().fillna(0)
)
)
.reset_index(level=0, drop=True)
)
Result with your code solution
Model Distance v1 Count
A 34 3 1.0
A 55 5 2.0
A 100 7 3.0
A 0 1 5.0
A 55 5 6.0
B 90 3 1.0
B 0 1 2.0
B 1 3 3.0
B 21 1 4.0
B 0 1 5.0
C 9 7 1.0
C 100 4 2.0
C 50 1 4.0
C 100 6 5.0
C 22 4 6.0
My expected result is:
Model Distance v1 Count
A 34 3 1.0
A 55 5 2.0
A 100 7 3.0
A 0 1 5.0
A 55 5 6.0
B 90 3 1.0
B 0 1 2.0
B 1 3 3.0
B 21 1 4.0
B 0 1 5.0
C 9 7 1.0
C 100 4 2.0
C 50 1 4.0
C 100 6 5.0
C 22 4 7.0
Take a look at the C group that are two distance equal to 100
Setup:
df = pd.DataFrame({'Model': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'B', 8: 'B', 9: 'B', 10: 'C', 11: 'C', 12: 'C', 13: 'C', 14: 'C'},
'Distance': {0: '34', 1: '55', 2: '100', 3: '0', 4: '55', 5: '90', 6: '0', 7: '1', 8: '21', 9: '0', 10: '9', 11: '23', 12: '100', 13: '33', 14: '23'},
'v1': {0: 3, 1: 5, 2: 7, 3: 1, 4: 5, 5: 3, 6: 1, 7: 3, 8: 1, 9: 1, 10: 7, 11: 4, 12: 1, 13: 6, 14: 4},
'Count': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 1, 6: 2, 7: 3, 8: 4, 9: 5, 10: 1, 11: 2, 12: 3, 13: 4, 14: 5}})
If the logic needs to be applied across the Model column, you can use a shift, compare and add 1 for eligibal rows:
df.loc[df.Distance.shift().eq('100'), 'Count'] += 1
If the logic needs to be applied at per Model group, then you can use a groupby:
(
df.groupby('Model')
.apply(lambda x: x.assign(Count=x.Distance.shift().eq('100') + x.Count))
.reset_index(level=0, drop=True)
)
Based on #StringZ's updates, below is the updated solution:
(
df.groupby('Model')
.apply(lambda x: x.assign(Count=x.Count + x.Distance.shift()
.eq('100').replace(False, np.nan)
.ffill().fillna(0)
)
)
.reset_index(level=0, drop=True)
)
Model Distance v1 Count
0 A 34 3 1.0
1 A 55 5 2.0
2 A 100 7 3.0
3 A 0 1 5.0
4 A 55 5 6.0
5 B 90 3 1.0
6 B 0 1 2.0
7 B 1 v3 3.0
8 B 21 1 4.0
9 B 0 1 5.0
10 C 9 7 1.0
11 C 23 4 2.0
12 C 100 1 3.0
13 C 33 6 5.0
14 C 23 4 6.0

using duplicates values from one column to remove entire row in pandas dataframe

I have the data in the .csv file uploaded in the following link
Click here for the data
In this file, I have the following columns:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)
I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage.
Since it is slightly difficult to explain using words alone, I attached a picture here.
In this picture, the row with highlighted in red boxes will be be removed.
I used df.duplicates() but it does not work.
It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first aggregation function in pandas
Create the dataframe and call it df
data = {'Team': {0: 'Brazil', 1: 'Brazil', 2: 'Brazil', 3: 'Brazil', 4: 'Brazil', 5: 'Brazil', 6: 'Brazil', 7: 'Brazil', 8: 'Brazil', 9: 'Brazil'},
'Group': {0: 'Group E', 1: 'Group E', 2: 'Group E', 3: 'Group E', 4: 'Group E', 5: 'Group E', 6: 'Group E', 7: 'Group E', 8: 'Group E', 9: 'Group E'},
'Model': {0: 'ELO', 1: 'ELO', 2: 'ELO', 3: 'ELO', 4: 'ELO', 5: 'ELO', 6: 'ELO', 7: 'ELO', 8: 'ELO', 9: 'ELO'},
'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
'GpWinner': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.0, 6: 0.2, 7: 0.2, 8: 0.2, 9: 0.0},
'GpRunnerup': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.2, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.2},
'3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}
df = pd.DataFrame(data)
# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']
Method 1
# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')
# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')
# Output the dataframe in the orginal order
df[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
1 Brazil Group E ELO 1 4 0.2 0.0 0 0
2 Brazil Group E ELO 2 4 0.2 0.0 0 0
3 Brazil Group E ELO 3 4 0.2 0.0 0 0
4 Brazil Group E ELO 4 4 0.2 0.0 0 0
Method 2
df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
2 Brazil Group E ELO 1 4 0.2 0.0 0 0
4 Brazil Group E ELO 2 4 0.2 0.0 0 0
7 Brazil Group E ELO 3 4 0.2 0.0 0 0
8 Brazil Group E ELO 4 4 0.2 0.0 0 0
I am just creating a mini-dataset based on your dataset here with Team, SimStage and Points.
import pandas as pd
namesDf = pd.DataFrame()
namesDf['Team'] = ['Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil']
namesDf['SimStage'] = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
namesDf['Points'] = [4, 4, 4, 4, 4, 1, 2, 4, 4, 1]
Now, for each Sim Stage, you want the highest Point value. So, I first group them by Team and Sim Stage and then Sort them by Points.
namesDf = namesDf.groupby(['Team', 'SimStage'], as_index = False).apply(lambda x: x.sort_values(['Points'], ascending = False)).reset_index(drop = True)
This will make my dataframe look like this, notice the change in Sim Stage with value 3:
Team SimStage Points
0 Brazil 0 4
1 Brazil 0 4
2 Brazil 1 4
3 Brazil 1 4
4 Brazil 2 4
5 Brazil 2 1
6 Brazil 3 4
7 Brazil 3 2
8 Brazil 4 4
9 Brazil 4 1
And now I remove the duplicates by keeping the first instance of every team and sim stage.
namesDf = namesDf.drop_duplicates(subset=['Team', 'SimStage'], keep = 'first')
Final result:
Team SimStage Points
0 Brazil 0 4
2 Brazil 1 4
4 Brazil 2 4
6 Brazil 3 4
8 Brazil 4 4

Trouble pivoting in pandas (spread in R)

I'm having some issues with the pd.pivot() or pivot_table() functions in pandas.
I have this:
df = pd.DataFrame({'site_id': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5:
'c',6: 'a', 7: 'a', 8: 'b', 9: 'b', 10: 'c', 11: 'c'},
'dt': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1,6: 2, 7: 2, 8: 2, 9: 2, 10: 2, 11: 2},
'eu': {0: 'FGE', 1: 'WSH', 2: 'FGE', 3: 'WSH', 4: 'FGE', 5: 'WSH',6: 'FGE', 7: 'WSH', 8: 'FGE', 9: 'WSH', 10: 'FGE', 11: 'WSH'},
'kw': {0: '8', 1: '5', 2: '3', 3: '7', 4: '1', 5: '5',6: '2', 7: '3', 8: '5', 9: '7', 10: '2', 11: '5'}})
df
Out[140]:
dt eu kw site_id
0 1 FGE 8 a
1 1 WSH 5 a
2 1 FGE 3 b
3 1 WSH 7 b
4 1 FGE 1 c
5 1 WSH 5 c
6 2 FGE 2 a
7 2 WSH 3 a
8 2 FGE 5 b
9 2 WSH 7 b
10 2 FGE 2 c
11 2 WSH 5 c
I want this:
dt site_id FGE WSH
1 a 8 5
1 b 3 7
1 c 1 5
2 a 2 3
2 b 5 7
2 c 2 5
I've tried everything!
df.pivot_table(index = ['site_id','dt'], values = 'kw', columns = 'eu')
or
df.pivot(index = ['site_id','dt'], values = 'kw', columns = 'eu')
should have worked. I also tried unstack():
df.set_index(['dt','site_id','eu']).unstack(level = -1)
Your last try (with unstack) works fine for me, I'm not sure why it gave you a problem. FWIW, I think it's more readable to use the index names rather than levels, so I did it like this:
>>> df.set_index(['dt','site_id','eu']).unstack('eu')
kw
eu FGE WSH
dt site_id
1 a 8 5
b 3 7
c 1 5
2 a 2 3
b 5 7
c 2 5
But again, your way looks fine to me and is pretty much the same as what #piRSquared did (except their answer adds some more code to get rid of the multi-index).
I think the problem with pivot is that you can only pass a single variable, not a list? Anyway, this works for me:
>>> df.set_index(['dt','site_id']).pivot(columns='eu')
For pivot_table, the main issue is that 'kw' is an object/character and pivot_table will attempt to aggregate with numpy.mean by default. You probably got the error message: "DataError: No numeric types to aggregate".
But there are a couple of workarounds. First, you could just convert to a numeric type and then use your same pivot_table command
>>> df['kw'] = df['kw'].astype(int)
>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu')
Alternatively you could change the aggregation function:
>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu',
aggfunc=sum )
That's using the fact that strings can be summed (concatentated) even though you can't take a mean of them. Really, you can use most functions here (including lambdas) that operate on strings.
Note, however, that pivot_table's aggfunc requires some sort of reduction operation here even though you only have a single value per cell, so there actually isn't anything to reduce! But there is a check in the code that requires a reduction operation, so you have to do one.
df.set_index(['dt', 'site_id', 'eu']).kw \
.unstack().rename_axis(None, 1).reset_index()

Resources