Changing multiindex in a pandas series? - python-3.x

I have a dataframe like this:
mainid pidl pidw score
0 Austria 1 533
1 Canada 2 754
2 Canada 3 267
3 Austria 4 852
4 Taiwan 5 124
5 Slovakia 6 344
6 Spain 7 1556
7 Taiwan 8 127
I want to select top 5 pidw for each pidl.
When I have grouped by on column 'pidl' and then sorted the score in descending order in each group , i got the following series, s..
s= df.set_index(['pidl', 'pidw']).groupby('pidl')['score'].nlargest(5)
pidl pidl pidw score
Austria Austria 49 948
47 859
48 855
50 807
46 727
Belgium Belgium 15 2339
14 1861
45 1692
16 1626
46 1423
Name: score, dtype: float64
The result looks correct, but I wish I could remove a second 'pidl' from this series.
I have tried
s.reset_index('pidl')
to get 'ValueError: The name location occurs multiple times, use a level number'.
and
s.to_frame().reset_index()
ValueError: cannot insert pidl, already exists.
so I am not sure how to proceed about it.

Use group_keys=False parameter in DataFrame.groupby:
s= df.set_index(['pidl', 'pidw']).groupby('pidl', group_keys=False)['score'].nlargest(5)
print (s)
pidl pidw
Austria 4 852
1 533
Canada 2 754
3 267
Slovakia 6 344
Spain 7 1556
Taiwan 8 127
5 124
Name: score, dtype: int64
Or add Series.droplevel for remove first level (pandas count from 0, so used 0):
s= df.set_index(['pidl', 'pidw']).groupby('pidl')['score'].nlargest(5).droplevel(0)

Related

For every row with the same name, find the 5 lowest number Pandas Python

My data frame looks like this
Location week Number
Austria 1 154
Austria 2 140
Belgium 1 139
Bulgaria 2 110
Bulgaria 1 164
the solution should look like this
Location week Number
Austria 3 100
Austria 2 101
Austria 1 102
Bulgaria 2 100
Bulgaria 3 101
Bulgaria 1 102
this means that I need to display
Column 1 : I need to group the countries by name
Column 2 : Week (every country has 53 weeks assigned to them)
Column 3 : Show the numbers that occured in each of 53 weeks in an ascending order
I can not get my head around this
Sort the rows in the order your like (here by Location and Number) and take the first 5 rows per group with groupby+head:
df.sort_values(by=['Location', 'Number']).groupby('Location').head(5)
output:
Location week Number
0 Austria 3 100
1 Austria 2 101
2 Austria 1 102
3 Bulgaria 2 100
4 Bulgaria 3 101
5 Bulgaria 1 102
another way using .cumcount() and .loc
con = df.sort_values('Number',ascending=True).groupby('Location').cumcount()
df.loc[con.lt(5)]

How to find the max value in a column and return that name from another column

I want to find the building with the maximum number of floors and return that building's name.
I use:
dframe.loc[dframe[15].idxmax()] and I get this error: AttributeError: 'str' object has no attribute 'loc'
I also get TypeError: reduction operation 'argmax' not allowed for this dtype
The number of floors is in column 15 and the name of the building is in column 2. Any direction about how to approach this problem is helpful. Thanks!
Expected output would be the row with building name in column 2 where the max value is in column 15
Sample Data
0 1 2 3 4 5 6 7 8 9 ... 32 33 34 35 36 37 38 39 40
41
42 56 2018 HILTON SEATTLE NonResidential 7802920020 1301 6TH AVE SEATTLE WA 98101 47.60946 ... NaN 2689945 9178092 62538 6253815 0 356.6 2.8 Compliant No Issue
43 57 2018 5TH & PINE NonResidential 1975700200 1513 5TH AVE SEATTLE WA 98101 47.6113 ... 493 2671369 9114711 0 0 0 24.3 0.1 Compliant No Issue
44 58 2018 CENTURY SQUARE RETAIL NonResidential 1975700365 1525 4TH AVE SEATTLE WA 98101 47.61076 ... NaN 195653 667569 3756 375626 0 21.7 0.4 Compliant No Issue
46 60 2018 MANN BUILDING/WILD GINGER/TRIPLE DOOR NonResidential 1975700525 1401 3RD AVE SEATTLE WA 98101 47.60886 ... 5459 1338469 4566856 110816
input:
dframe[14].dtype
output:
dtype('O')
input:
dframe[14].astype(int)
input:
dframe[14].dtype
output:
dtype('int64')
input:
print(dframe.loc[dframe[14].idxmax()][2])

How can I find the sum of certain columns and find avg of other columns in python?

enter image description here
I've combined 10 excel files each with 1yr of NFL passing stats and there are certain columns (Games played, Completions, Attempts, etc) that I have summed but I'd need (Passer rating, and QBR) that I'd like to see the avg for.
df3 = df3.groupby(['Player'],as_index=False).agg({'GS':'sum' ,'Cmp': 'sum', 'Att': 'sum','Cmp%': 'sum','Yds': 'sum','TD': 'sum','TD%': 'sum', 'Int': 'sum', 'Int%': 'sum','Y/A': 'sum', 'AY/A': 'sum','Y/C': 'sum','Y/G':'sum','Rate':'sum','QBR':'sum','Sk':'sum','Yds.1':'sum','NY/A': 'sum','ANY/A': 'sum','Sk%':'sum','4QC':'sum','GWD': 'sum'})
Quick note: don't attach photos of your code, dataset, errors, etc. Provide the actual code, the actual dataset (or a sample of the dataset), etc, so that users can reproduce the error, issue, etc. No one is really going to take the time to manufacture your dataset from a photo (or I should say rarely, as I did do that...because I love working with sports data, and I could grab it realitively quickly).
But to get averages in stead of the sum, you would use 'mean'. Also, in your code, why are you summing percentages?
import pandas as pd
df = pd.DataFrame()
for season in range(2010, 2020):
url = 'https://www.pro-football-reference.com/years/{season}/passing.htm'.format(season=season)
df = df.append(pd.read_html(url)[0], sort=False)
df = df[df['Rk'] != 'Rk']
df = df.reset_index(drop=True)
df['Player'] = df.Player.str.replace('[^a-zA-Z .]', '')
df['Player'] = df['Player'].str.strip()
strCols = ['Player','Tm', 'Pos', 'QBrec']
numCols = [ x for x in df.columns if x not in strCols ]
df[['QB_wins','QB_loss', 'QB_ties']] = df['QBrec'].str.split('-', expand=True)
df[numCols] = df[numCols].apply(pd.to_numeric)
df3 = df.groupby(['Player'],as_index=False).agg({'GS':'sum', 'TD':'sum', 'QBR':'mean'})
Output:
print (df3)
Player GS TD QBR
0 A.J. Feeley 3 1 27.300000
1 A.J. McCarron 4 6 NaN
2 Aaron Rodgers 142 305 68.522222
3 Ace Sanders 4 1 100.000000
4 Adam Podlesh 0 0 0.000000
5 Albert Wilson 7 1 99.700000
6 Alex Erickson 6 0 NaN
7 Alex Smith 121 156 55.122222
8 Alex Tanney 0 1 42.900000
9 Alvin Kamara 9 0 NaN
10 Andrew Beck 6 0 NaN
11 Andrew Luck 86 171 62.766667
12 Andy Dalton 133 204 53.375000
13 Andy Lee 0 0 0.000000
14 Anquan Boldin 32 0 11.600000
15 Anthony Miller 4 0 81.200000
16 Antonio Andrews 10 1 100.000000
17 Antonio Brown 55 1 29.300000
18 Antonio Morrison 4 0 NaN
19 Antwaan Randle El 0 2 100.000000
20 Arian Foster 13 1 100.000000
21 Armanti Edwards 0 0 41.466667
22 Austin Davis 10 13 38.150000
23 B.J. Daniels 0 0 NaN
24 Baker Mayfield 29 49 53.200000
25 Ben Roethlisberger 130 236 66.833333
26 Bernard Scott 1 0 5.600000
27 Bilal Powell 12 0 17.700000
28 Billy Volek 0 0 89.400000
29 Blaine Gabbert 48 48 37.687500
.. ... ... ... ...
329 Tim Boyle 0 0 NaN
330 Tim Masthay 0 1 5.700000
331 Tim Tebow 16 17 42.733333
332 Todd Bouman 1 2 57.400000
333 Todd Collins 1 0 0.800000
334 Tom Brady 156 316 72.755556
335 Tom Brandstater 0 0 0.000000
336 Tom Savage 9 5 38.733333
337 Tony Pike 0 0 2.500000
338 Tony Romo 72 141 71.185714
339 Travaris Cadet 1 0 NaN
340 Travis Benjamin 8 0 1.700000
341 Travis Kelce 15 0 1.800000
342 Trent Edwards 3 2 98.100000
343 Tress Way 0 0 NaN
344 Trevone Boykin 0 1 66.400000
345 Trevor Siemian 25 30 40.750000
346 Troy Smith 6 5 38.500000
347 Tyler Boyd 14 0 2.800000
348 Tyler Bray 0 0 0.000000
349 Tyler Palko 4 2 56.600000
350 Tyler Thigpen 1 2 30.233333
351 Tyreek Hill 13 0 0.000000
352 Tyrod Taylor 46 54 51.242857
353 Vince Young 11 14 50.850000
354 Will Grier 2 0 NaN
355 Willie Snead 4 1 100.000000
356 Zach Mettenberger 10 12 24.600000
357 Zach Pascal 13 0 NaN
358 Zay Jones 15 0 0.000000

create lag features based on multiple columns

i have a time series dataset. i need to extract the lag features. i am using below code but got all NAN's
df.groupby(['week','id1','id2','id3'],as_index=False)['value'].shift(1)
input
week,id1,id2,id3,value
1,101,123,001,45
1,102,231,004,89
1,203,435,099,65
2,101,123,001,48
2,102,231,004,75
2,203,435,099,90
output
week,id1,id2,id3,value,t-1
1,101,123,001,45,NAN
1,102,231,004,89,NAN
1,203,435,099,65,NAN
2,101,123,001,48,45
2,102,231,004,75,89
2,203,435,099,90,65
You want to shift to the next week so remove 'week' from the grouping:
df['t-1'] = df.groupby(['id1','id2','id3'],as_index=False)['value'].shift()
# week id1 id2 id3 value t-1
#0 1 101 123 1 45 NaN
#1 1 102 231 4 89 NaN
#2 1 203 435 99 65 NaN
#3 2 101 123 1 48 45.0
#4 2 102 231 4 75 89.0
#5 2 203 435 99 90 65.0
That's error prone to missing weeks. In this case we can merge after changing the week, which ensures it is the prior week regardless of missing weeks.
df2 = df.assign(week=df.week+1).rename(columns={'value': 't-1'})
df = df.merge(df2, on=['week', 'id1', 'id2', 'id3'], how='left')
Another way to bring and rename many columns would be to use the suffixes argument in the merge. This will rename all overlapping columns (that are not keys) in the right DataFrame.
df.merge(df.assign(week=df.week+1), # Manally lag
on=['week', 'id1', 'id2', 'id3'],
how='left',
suffixes=['', '_lagged'] # Right df columns -> _lagged
)
# week id1 id2 id3 value value_lagged
#0 1 101 123 1 45 NaN
#1 1 102 231 4 89 NaN
#2 1 203 435 99 65 NaN
#3 2 101 123 1 48 45.0
#4 2 102 231 4 75 89.0
#5 2 203 435 99 90 65.0

Conditional date join in python Pandas

I have two pandas dataframes matches with columns (match_id, team_id,date, ...) and teams_att with columns (id, team_id, date, overall_rating, ...).
I want to join the two dataframes on matches.team_id = teams_att.team_id and teams_att.date closest to matches.date
Example
matches
match_id team_id date
1 101 2012-05-17
2 101 2014-07-11
3 102 2010-05-21
4 102 2017-10-24
teams_att
id team_id date overall_rating
1 101 2010-02-22 67
2 101 2011-02-22 69
3 101 2012-02-20 73
4 101 2013-09-17 79
5 101 2014-09-10 74
6 101 2015-08-30 82
7 102 2015-03-21 42
8 102 2016-03-22 44
Desired results
match_id team_id matches.date teams_att.date overall_rating
1 101 2012-05-17 2012-02-20 73
2 101 2014-07-11 2014-09-10 74
3 102 2010-05-21 2015-03-21 42
4 102 2017-10-24 2016-03-22 44
You can use merge_asof with by and direction parameters:
pd.merge_asof(matches.sort_values('date'),
teams_att.sort_values('date'),
on='date', by='team_id',
direction='nearest')
Output:
match_id team_id date id overall_rating
0 3 102 2010-05-21 7 42
1 1 101 2012-05-17 3 73
2 2 101 2014-07-11 5 74
3 4 102 2017-10-24 8 44
We using merge_asof (Please check Scott's answer, that is the right way for solving this type problem :-) cheers )
g1=df1.groupby('team_id')
g=df.groupby('team_id')
l=[]
for x in [101,102]:
l.append(pd.merge_asof(g.get_group(x),g1.get_group(x),on='date',direction ='nearest'))
pd.concat(l)
Out[405]:
match_id team_id_x date id team_id_y overall_rating
0 1 101 2012-05-17 3 101 73
1 2 101 2014-07-11 5 101 74
0 3 102 2010-05-21 7 102 42
1 4 102 2017-10-24 8 102 44

Resources