Merge matching rows in excel & summarizing matching columns - excel

Looking to merge some data and summarize the results. Bene poking around google but haven't found anything that will match up duplicates and summarize.
The left side of the table is what I'm starting with, I would like the output on the right side.
Street Name Widgets Sprockets Nuts Bolts Street Name Widgets Sprockets Nuts Bolts
123 Any street ACB Co 10 248 2 50 123 Any street ACB Co 10 846 10 78
123 Any street Bob's plumbing 25 22 2 7 123 Any street Bob's plumbing 25 22 2 7
456 Another st Bill's cars 55 5 456 456 Another st Bill's cars 62 878 13 55
123 Any street ACB Co 54 4 6 789 789 Ave Shelley and co 5 2 2 78
456 Another st Bill's cars 7 878 8 55 789 Ave Divers down 7 90 10 11
789 Ave Shelley and co 5 2 2 78 456 Another st ACB Co 6 50 5
123 Any street ACB Co 544 4 22
456 Another st ACB Co 6 50 5
789 Ave Divers down 6 90 9 4
789 Ave Divers down 1 1 7

Use Pivot Tables an set the layout to tabular.
Details can be found here: https://www.youtube.com/watch?v=LkFPBn7sgEc

Related

Python: Groupby and sum respective rows and update dataframe column

Input df:
Store Category Item tot_table
11 AA Apple 13.5
11 AA Orange 13.5
11 BB Potato 11.5
11 BB Carrot 11.5
12 AA Apple 10
12 BB Potato 9
12 BB Carrot 9
Need to perform df.groupby('Store')['tot_table'].unique().sum() , but this line of code doesn't work out.
Expected output df:
Store Category Item split_table tot_table
11 AA Apple 13.5 25
11 AA Orange 13.5 25
11 BB Potato 11.5 25
11 BB Carrot 11.5 25
12 AA Apple 10 19
12 BB Potato 9 19
12 BB Carrot 9 19
You can use groupby.transform with unique/sum:
df['tot_table'] = (df.groupby('Store')['tot_table']
.transform(lambda s: s.unique().sum())
)
output:
Store Category Item tot_table
0 11 AA Apple 25.0
1 11 AA Orange 25.0
2 11 BB Potato 25.0
3 11 BB Carrot 25.0
4 12 AA Apple 19.0
5 12 BB Potato 19.0
6 12 BB Carrot 19.0

Changing multiindex in a pandas series?

I have a dataframe like this:
mainid pidl pidw score
0 Austria 1 533
1 Canada 2 754
2 Canada 3 267
3 Austria 4 852
4 Taiwan 5 124
5 Slovakia 6 344
6 Spain 7 1556
7 Taiwan 8 127
I want to select top 5 pidw for each pidl.
When I have grouped by on column 'pidl' and then sorted the score in descending order in each group , i got the following series, s..
s= df.set_index(['pidl', 'pidw']).groupby('pidl')['score'].nlargest(5)
pidl pidl pidw score
Austria Austria 49 948
47 859
48 855
50 807
46 727
Belgium Belgium 15 2339
14 1861
45 1692
16 1626
46 1423
Name: score, dtype: float64
The result looks correct, but I wish I could remove a second 'pidl' from this series.
I have tried
s.reset_index('pidl')
to get 'ValueError: The name location occurs multiple times, use a level number'.
and
s.to_frame().reset_index()
ValueError: cannot insert pidl, already exists.
so I am not sure how to proceed about it.
Use group_keys=False parameter in DataFrame.groupby:
s= df.set_index(['pidl', 'pidw']).groupby('pidl', group_keys=False)['score'].nlargest(5)
print (s)
pidl pidw
Austria 4 852
1 533
Canada 2 754
3 267
Slovakia 6 344
Spain 7 1556
Taiwan 8 127
5 124
Name: score, dtype: int64
Or add Series.droplevel for remove first level (pandas count from 0, so used 0):
s= df.set_index(['pidl', 'pidw']).groupby('pidl')['score'].nlargest(5).droplevel(0)

add column from a dataframe to another dataframe with same rows

I have a dataframe (df) that contains 30 000 rows
id Name Age
1 Joey 22
2 Anna 34
3 Jon 33
4 Amy 30
5 Kay 22
And Another dataframe (df2) that contains same columns but with some Ids missing
id Name Age Sport
Jon 33 Tennis
5 Kay 22 Football
Joey 22 Basketball
4 Amy 30 Running
Anna 42 Dancing
I want the missing IDs to appear in df2 with the correspondant name
df2:
id Name Age Sport
3 Jon 33 Tennis
5 Kay 22 Football
1 Joey 22 Basketball
4 Amy 30 Running
2 Anna 42 Dancing
Can someone help ? I am new to pandas and dataframe
you can use .map with .fillna
df2['id'] = df2['id'].replace('',np.nan,regex=True)\
.fillna(df2['Name'].map(df1.set_index('Name')['id'])).astype(int)
print(df2)
id Name Age Sport
0 3 Jon 33 Tennis
1 5 Kay 22 Football
2 1 Joey 22 Basketball
3 4 Amy 30 Running
4 2 Anna 42 Dancing
First, join the two dataframes with pd.merge based on your keys. I suppose the keys are 'Name' and 'Age' in this case. Then replace the null id values in df2, using np.where and .isnull() to find the null values.
df3 = pd.merge(df2, df1, on=['name', 'age'], how='left')
df2['id'] = np.where(df3.id_x.isnull(), df3.id_y, df3.id_x).astype(int)
id name age sport
0 1 Joey 22 Tennis
1 2 Anna 34 Football
2 3 Jon 33 Basketball
3 4 Amy 30 Running
4 5 Kay 22 Dancing

How to find the max value in a column and return that name from another column

I want to find the building with the maximum number of floors and return that building's name.
I use:
dframe.loc[dframe[15].idxmax()] and I get this error: AttributeError: 'str' object has no attribute 'loc'
I also get TypeError: reduction operation 'argmax' not allowed for this dtype
The number of floors is in column 15 and the name of the building is in column 2. Any direction about how to approach this problem is helpful. Thanks!
Expected output would be the row with building name in column 2 where the max value is in column 15
Sample Data
0 1 2 3 4 5 6 7 8 9 ... 32 33 34 35 36 37 38 39 40
41
42 56 2018 HILTON SEATTLE NonResidential 7802920020 1301 6TH AVE SEATTLE WA 98101 47.60946 ... NaN 2689945 9178092 62538 6253815 0 356.6 2.8 Compliant No Issue
43 57 2018 5TH & PINE NonResidential 1975700200 1513 5TH AVE SEATTLE WA 98101 47.6113 ... 493 2671369 9114711 0 0 0 24.3 0.1 Compliant No Issue
44 58 2018 CENTURY SQUARE RETAIL NonResidential 1975700365 1525 4TH AVE SEATTLE WA 98101 47.61076 ... NaN 195653 667569 3756 375626 0 21.7 0.4 Compliant No Issue
46 60 2018 MANN BUILDING/WILD GINGER/TRIPLE DOOR NonResidential 1975700525 1401 3RD AVE SEATTLE WA 98101 47.60886 ... 5459 1338469 4566856 110816
input:
dframe[14].dtype
output:
dtype('O')
input:
dframe[14].astype(int)
input:
dframe[14].dtype
output:
dtype('int64')
input:
print(dframe.loc[dframe[14].idxmax()][2])

How can I find the sum of certain columns and find avg of other columns in python?

enter image description here
I've combined 10 excel files each with 1yr of NFL passing stats and there are certain columns (Games played, Completions, Attempts, etc) that I have summed but I'd need (Passer rating, and QBR) that I'd like to see the avg for.
df3 = df3.groupby(['Player'],as_index=False).agg({'GS':'sum' ,'Cmp': 'sum', 'Att': 'sum','Cmp%': 'sum','Yds': 'sum','TD': 'sum','TD%': 'sum', 'Int': 'sum', 'Int%': 'sum','Y/A': 'sum', 'AY/A': 'sum','Y/C': 'sum','Y/G':'sum','Rate':'sum','QBR':'sum','Sk':'sum','Yds.1':'sum','NY/A': 'sum','ANY/A': 'sum','Sk%':'sum','4QC':'sum','GWD': 'sum'})
Quick note: don't attach photos of your code, dataset, errors, etc. Provide the actual code, the actual dataset (or a sample of the dataset), etc, so that users can reproduce the error, issue, etc. No one is really going to take the time to manufacture your dataset from a photo (or I should say rarely, as I did do that...because I love working with sports data, and I could grab it realitively quickly).
But to get averages in stead of the sum, you would use 'mean'. Also, in your code, why are you summing percentages?
import pandas as pd
df = pd.DataFrame()
for season in range(2010, 2020):
url = 'https://www.pro-football-reference.com/years/{season}/passing.htm'.format(season=season)
df = df.append(pd.read_html(url)[0], sort=False)
df = df[df['Rk'] != 'Rk']
df = df.reset_index(drop=True)
df['Player'] = df.Player.str.replace('[^a-zA-Z .]', '')
df['Player'] = df['Player'].str.strip()
strCols = ['Player','Tm', 'Pos', 'QBrec']
numCols = [ x for x in df.columns if x not in strCols ]
df[['QB_wins','QB_loss', 'QB_ties']] = df['QBrec'].str.split('-', expand=True)
df[numCols] = df[numCols].apply(pd.to_numeric)
df3 = df.groupby(['Player'],as_index=False).agg({'GS':'sum', 'TD':'sum', 'QBR':'mean'})
Output:
print (df3)
Player GS TD QBR
0 A.J. Feeley 3 1 27.300000
1 A.J. McCarron 4 6 NaN
2 Aaron Rodgers 142 305 68.522222
3 Ace Sanders 4 1 100.000000
4 Adam Podlesh 0 0 0.000000
5 Albert Wilson 7 1 99.700000
6 Alex Erickson 6 0 NaN
7 Alex Smith 121 156 55.122222
8 Alex Tanney 0 1 42.900000
9 Alvin Kamara 9 0 NaN
10 Andrew Beck 6 0 NaN
11 Andrew Luck 86 171 62.766667
12 Andy Dalton 133 204 53.375000
13 Andy Lee 0 0 0.000000
14 Anquan Boldin 32 0 11.600000
15 Anthony Miller 4 0 81.200000
16 Antonio Andrews 10 1 100.000000
17 Antonio Brown 55 1 29.300000
18 Antonio Morrison 4 0 NaN
19 Antwaan Randle El 0 2 100.000000
20 Arian Foster 13 1 100.000000
21 Armanti Edwards 0 0 41.466667
22 Austin Davis 10 13 38.150000
23 B.J. Daniels 0 0 NaN
24 Baker Mayfield 29 49 53.200000
25 Ben Roethlisberger 130 236 66.833333
26 Bernard Scott 1 0 5.600000
27 Bilal Powell 12 0 17.700000
28 Billy Volek 0 0 89.400000
29 Blaine Gabbert 48 48 37.687500
.. ... ... ... ...
329 Tim Boyle 0 0 NaN
330 Tim Masthay 0 1 5.700000
331 Tim Tebow 16 17 42.733333
332 Todd Bouman 1 2 57.400000
333 Todd Collins 1 0 0.800000
334 Tom Brady 156 316 72.755556
335 Tom Brandstater 0 0 0.000000
336 Tom Savage 9 5 38.733333
337 Tony Pike 0 0 2.500000
338 Tony Romo 72 141 71.185714
339 Travaris Cadet 1 0 NaN
340 Travis Benjamin 8 0 1.700000
341 Travis Kelce 15 0 1.800000
342 Trent Edwards 3 2 98.100000
343 Tress Way 0 0 NaN
344 Trevone Boykin 0 1 66.400000
345 Trevor Siemian 25 30 40.750000
346 Troy Smith 6 5 38.500000
347 Tyler Boyd 14 0 2.800000
348 Tyler Bray 0 0 0.000000
349 Tyler Palko 4 2 56.600000
350 Tyler Thigpen 1 2 30.233333
351 Tyreek Hill 13 0 0.000000
352 Tyrod Taylor 46 54 51.242857
353 Vince Young 11 14 50.850000
354 Will Grier 2 0 NaN
355 Willie Snead 4 1 100.000000
356 Zach Mettenberger 10 12 24.600000
357 Zach Pascal 13 0 NaN
358 Zay Jones 15 0 0.000000

Resources