Grouping and Multiindexing a pandas dataframe - python-3.x

Suppose I have a dataframe as follows
In [6]: df.head()
Out[6]:
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
I have a dictionary as follows:
army = {'Majors' : 'Nighthawks', 'Captains' : 'Dragoons'}
and I want that it and should have a multi-index in the shape of ["army","company"] only.
How will I proceed?

If I understand correctly:
You can use map to find values in a dictionary (using dictionary comprehension to swap key/value pairs since they are backwards):
army = {'Majors': 'Nighthawks', 'Captains': 'Dragoons'}
df.assign(army=df.regiment.map({k:v for v, k in army.items()})).set_index(['army', 'company'], drop=True)
regiment name preTestScore postTestScore
army company
Majors 1st Nighthawks Miller 4 25
1st Nighthawks Jacobson 24 94
2nd Nighthawks Ali 31 57
2nd Nighthawks Milner 2 62
Captains 1st Dragoons Cooze 3 70

Related

Excel RANGE VLOOKUP

STUDENT TIME SCORE WANT
JOHN 1 68 146
JOHN 2 78 146
JOHN 3 77 146
JOHN 4 91 146
JOHN 5 96 146
JAMES 1 66 119
JAMES 2 53 119
JAMES 3 80 119
JAMES 4 96 119
JAMES 5 50 119
JAMES 6 94 119
I have data COLUMNS 'STUDENT' AND 'TIME' AND 'SCORE' and wish to create 'WANT' and the rule which for I will need VLOOKUP is this: WANT = the sum of the SCORE values at TIMES 1 and 2, so I WISH TO USE VLOOKUP to find the 'SCORE' values for each 'STUDENT' at TIMES 1 and 2 and take the sum.
You can try SUMIFS() in this way.
=SUM(SUMIFS($C$2:$C$12,$B$2:$B$12,{1,2},$A$2:$A$12,A2))
It may need to array entry for older versions of excel. Array entry by CTRL+SHIFT+ENTER.
Assuming your dataset is ordered by "student name" (with unique student names), then "time", you could use :
Classical way, in F2 :
=IF(AND(B2=1,B3=2,A2=A3),C2+C3,IF(AND(B2=2,B1=1,A2=A1),C2+C1,OFFSET($F$1,MATCH(A2,A$2:A2,0),0)))
Greedy way (Office365 needed), in E2 :
=SUM(FILTER($A$2:$C$12;($B$2:$B$12<=2)*($A$2:$A$12=A2)))-3
Reference :

Analysis on dataframe with python

I want to be able to calculate the average 'goal','shot',and 'miss' per shooterName to use for further analysis and visualization
The code below gives me the count of the 3 attributes(shot,goal,miss) in the 'event' column sorted by 'shooterName'
Dataframe columns:
season period time teamCode event goal xCord yCord xCordAdjusted yCordAdjusted ... playerPositionThatDidEvent timeSinceFaceoff playerNumThatDidEvent shooterPlayerId shooterName shooterLeftRight shooterTimeOnIce shooterTimeOnIceSinceFaceoff shotDistance
Corresponding data
2020 1 16 PHI SHOT 0 -74 29 74 -29 ... C 16 11 8478439.0 Travis Konecny R 16 16 32.649655
2020 1 34 PIT SHOT 0 49 -25 49 -25 ... C 34 9 8478542.0 Evan Rodrigues R 34 34 47.169906
2020 1 65 PHI SHOT 0 -52 -31 52 31 ... L 65 86 8480797.0 Joel Farabee L 31 31 48.270074
2020 1 171 PIT SHOT 0 43 39 43 39 ... C 42 9 8478542.0 Evan Rodrigues R 42 42 60.307545
2020 1 209 PHI MISS 0 -46 33 46 -33 ... D 38 5 8479026.0 Philippe Myers R 38 38 54.203321
Current code:
dft['count'] = df.groupby(['shooterName', 'event'])['event'].agg(['count'])
dft
Current Output:
shooterName event count
A.J. Greer GOAL 1
MISS 6
SHOT 29
Aaron Downey GOAL 1
MISS 4
SHOT 35
Zenon Konopka GOAL 8
MISS 57
SHOT 176
Desired Output:
shooterName event count %totalshooterNameevents
A.J. Greer GOAL 1 .0277
MISS 6 .1666
SHOT 29 .805
Aaron Downey GOAL 1 .025
MISS 4 .1
SHOT 35 .875
Zenon Konopka GOAL 8 .0331
MISS 57 .236
SHOT 176 .7302
Something similar to this. My end goal is to be able to calculate each 'event' attribute as a percentage of the total 'event' by 'shooterName'. Below I added a column '%totalshooterNameevents' which is 'simply goal', 'shot', and 'miss' calculated by the sum of the 'goal, shot, and miss' per each 'shooterName'
Update
Try:
dft = df.groupby(['shooterName', 'event'])['event'].agg(['count']).reset_index()
dft['%total'] = dft.groupby('shooterName')['count'].apply(lambda x: x / sum(x))
print(dft)
# Output
shooterName event count %total
0 A.J. Greer GOAL 1 0.027778
1 A.J. Greer MISS 6 0.166667
2 A.J. Greer SHOT 29 0.805556
3 Aaron Downey GOAL 1 0.025000
4 Aaron Downey MISS 4 0.100000
5 Aaron Downey SHOT 35 0.875000
6 Zenon Konopka GOAL 8 0.033195
7 Zenon Konopka MISS 57 0.236515
8 Zenon Konopka SHOT 176 0.730290
Without sample, it's difficult to guess what you want. Try:
import pandas as pd
import numpy as np
# Setup a Minimal Reproducible Example
np.random.seed(2021)
df = pd.DataFrame({'shooterName': np.random.choice(list('AB'), 20),
'event': np.random.choice(['shot', 'goal', 'miss'], 20)})
# Create an empty dataframe?
dft = pd.DataFrame(index=df['shooterName'].unique())
# Do stuff
grp = df.groupby('shooterName')
dft['count'] = grp.count()
dft = dft.join(grp['event'].value_counts().unstack('event')
.div(dft['count'], axis=0))
Output:
>>> dft
count goal miss shot
A 12 0.416667 0.250 0.333333
B 8 0.500000 0.375 0.125000

duplicating pandas dataframe rows n times where n is conditional upon a cell value [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 1 year ago.
Consider a pandas dataframe where the column value of some rows is a list of values:
df1 = { 'Name':['Jack','John','Paul','Todd','Chris'],
'ID':[62,47,55,74,31],
'Subjects':['A','A',['A','B','C'],['A','B'],'C']}
df1 = pd.DataFrame(df1)
becoming
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 [A, B, C]
3 Todd 74 [A, B]
4 Chris 31 C
I need to transform those rows where df1.Subjects is a list, so that the list is exploded and distributed across row copies, such that dataframe becomes something like:
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 A
3 Todd 74 A
4 Chris 31 C
5 Paul 55 B
6 Paul 55 C
7 Todd 74 B
where the index is not so heavily important, but df1.ID should be preserved when making its row copies.
Use explode and merge:
>>> pd.merge(df1.drop(columns='Subjects'),
df1['Subjects'].explode(),
left_index=True, right_index=True, how='outer') \
.reset_index(drop=True)
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 A
3 Paul 55 B
4 Paul 55 C
5 Todd 74 A
6 Todd 74 B
7 Chris 31 C

How to compare row by row in a dataframe

I have a data frame that has a name and the URL ID of the name. For example:
Abc 123
Abc.com 123
Def 345
Pqr 123
PQR.com 123
Here due to data extraction error, at times different names have same ID. I want to clean the table such that if the names are different and ID is same, then the record should remain the same. If the names are similar and ID is also same, the name should be changed to one. To be clear,
The expected output should be:
Abc.com 123
Abc.com 123
Def 354
PQR.com 123
PQR.com 123
That is, the last one was data entry error..and both were the same name(The first word of the string was the same). So they are changed to one name looking at ID.
But first and second records even though they had a similar ID to the last ones their names did not match and were completely different.
I am not able to understand how to achieve this.
Request some guidance here. Thanks in advance.
Note: The size of the dataset is almost 16 million of such records.
Idea is use fuzzy matching lib fuzzywuzzy for ratio of all combinations of Names by cross join by DataFrame.merge and removed rows with same names in both columns by DataFrame.query, also was added new column by lengths of data by Series.str.len:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
df1['len'] = df1['Name_x'].str.len()
print (df1)
Name_x ID Name_y ratio len
1 Abc 123 BCD 0 3
2 BCD 123 Abc 0 3
6 Pqr 789 PQR.com 20 3
7 PQR.com 789 Pqr 20 7
Then filter rows by treshold and boolean indexing. Then is necessary choose which value is necessary, one possible solution is get longer text. So is uses DataFrameGroupBy.idxmax with DataFrame.loc and then DataFrame.set_index for Series:
N = 15
df2 = df1[df1['ratio'].gt(N)]
s = df2.loc[df2.groupby('ID')['len'].idxmax()].set_index('ID')['Name_x']
print (s)
ID
789 PQR.com
Name: Name_x, dtype: object
Last Series.map by ID and replace non matched values by original with Series.fillna:
df['Name'] = df['ID'].map(s).fillna(df['Name'])
print (df)
Name ID
0 Abc 123
1 BCD 123
2 Def 345
3 PQR.com 789
4 PQR.com 789
EDIT: If there is more valid strings per ID is is more complicated:
print (df)
Name ID
0 Air Ordnance 1578013421
1 Air-Ordnance.com 1578013421
2 Garreett 1578013421
3 Garrett 1578013421
First get fuzz.ratio like in solution before:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
print (df1)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
2 Air Ordnance 1578013421 Garreett 30
3 Air Ordnance 1578013421 Garrett 32
4 Air-Ordnance.com 1578013421 Air Ordnance 79
6 Air-Ordnance.com 1578013421 Garreett 25
7 Air-Ordnance.com 1578013421 Garrett 26
8 Garreett 1578013421 Air Ordnance 30
9 Garreett 1578013421 Air-Ordnance.com 25
11 Garreett 1578013421 Garrett 93
12 Garrett 1578013421 Air Ordnance 32
13 Garrett 1578013421 Air-Ordnance.com 26
14 Garrett 1578013421 Garreett 93
Then filter by threshold:
N = 50
df2 = df1[df1['ratio'].gt(N)]
print (df2)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
4 Air-Ordnance.com 1578013421 Air Ordnance 79
11 Garreett 1578013421 Garrett 93
14 Garrett 1578013421 Garreett 93
But for more precision is necessary specify, what strings are valid in list L, filter by list:
L = ['Air-Ordnance.com','Garrett']
df2 = df2.loc[df2['Name_x'].isin(L),['Name_x','Name_y','ID']].rename(columns={'Name_y':'Name'})
print (df2)
Name_x Name ID
4 Air-Ordnance.com Air Ordnance 1578013421
14 Garrett Garreett 1578013421
Last merge with left join to original and repalce missing values:
df = df.merge(df2, on=['Name','ID'], how='left')
df['Name'] = df.pop('Name_x').fillna(df['Name'])
print (df)
Name ID
0 Air-Ordnance.com 1578013421
1 Air-Ordnance.com 1578013421
2 Garrett 1578013421
3 Garrett 1578013421

How to do mean centering on a dataframe in pandas?

I want to use two transformation techniques on a data frame, mean centering and standardization. How can I perform the mean centering method on my dataframe?
I have performed standardization using StandardScaler() from sklearn.preprocessing.
from sklearn.preprocessing import StandardScaler()
standard.iloc[:,1:-1] = StandardScaler().fit_transform(standard.iloc[:,1:-1])
I am expecting a transformed data frame which is mean-centered
dataxx = {'Name':['Tom', 'gik','Tom','Tom','Terry','Jerry','Abel','Dula','Abel'],
'Age':[20, 21, 19, 18,88,89,95,96,97],'gg':[1, 1,1, 30, 30,30,40,40,40]}
dfxx = pd.DataFrame(dataxx)
dfxx["meancentered"] = dfxx.Age - dfxx.Age.mean()
Index
Name
Age
gg
meancentered
0
Tom
20
1
-40.333333
1
gik
21
1
-39.333333
2
Tom
19
1
-41.333333
3
Tom
18
30
-42.333333
4
Terry
88
30
27.666667
5
Jerry
89
30
28.666667
6
Abel
95
40
34.666667
7
Dula
96
40
35.666667
8
Abel
97
40
36.666667

Resources