I would like to add a column with normalized rankings to a pandas dataframe. The process is as follows:
Import the pandas package first.
#import packages
import pandas as pd
Define a pandas dataframe.
# Create dataframe
data = {'name': ['Jason', 'Jason', 'Tina', 'Tina', 'Tina'],
'reports': [4, 24, 31, 2, 3],
'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data)
After the dataframe is created, I want to add an extra column to the dataframe. This column contains the rank based on the values in the coverage column for every name seperately.
df['coverageRank'] = df.groupby('name')['coverage'].rank()
print (df)
coverage name reports coverageRank
0 25 Jason 4 1.0
1 94 Jason 24 2.0
2 57 Tina 31 1.0
3 62 Tina 2 2.0
4 70 Tina 3 3.0
I now want to normalize the values in the ranking column.
The desired output is
coverage name reports coverageRank
0 25 Jason 4 0.500000
1 94 Jason 24 1.000000
2 57 Tina 31 0.333333
3 62 Tina 2 0.666667
4 70 Tina 3 1.000000
Does someone know a way to do this without using an explicit for-loop?
You can use transform for Series with same size as original df and then divide by div:
a = df.groupby('name')['coverage'].transform('size')
print (a)
0 2
1 2
2 3
3 3
4 3
Name: coverage, dtype: int64
df['coverageRank'] = df.groupby('name')['coverage'].rank().div(a)
print (df)
coverage name reports coverageRank
0 25 Jason 4 0.500000
1 94 Jason 24 1.000000
2 57 Tina 31 0.333333
3 62 Tina 2 0.666667
4 70 Tina 3 1.000000
Another solution with apply:
df['coverageRank'] = df.groupby('name')['coverage'].apply(lambda x: x.rank() / len(x))
print (df)
coverage name reports coverageRank
0 25 Jason 4 0.500000
1 94 Jason 24 1.000000
2 57 Tina 31 0.333333
3 62 Tina 2 0.666667
4 70 Tina 3 1.000000
Related
I have this store_df DataFrame:
store_id date sales
0 1 2023-1-2 11
1 2 2023-1-3 22
2 3 2023-1-4 33
3 1 2023-1-5 44
4 2 2023-1-6 55
5 3 2023-1-7 66
6 1 2023-1-8 77
7 2 2023-1-9 88
8 3 2023-1-10 99
I am not able to solve this in the interview.
This was the exact question asked :
Create a dataset with 3 columns – store_id, date, sales Create 3 Store_id Each store_id has 3 consecutive dates Sales are recorded for 9 rows We are considering the same 9 dates across all stores Sales can be any random number
Write a function that fetches the previous day’s sales as output once we give store_id & date as input
The question can be handled in multiple ways.
If you want to just get the previous row per group, assuming that the values are consecutive and sorted by increasing dates, use a groupby.shift:
store_df['prev_day_sales'] = store_df.groupby('store_id')['sales'].shift()
Output:
store_id date sales prev_day_sales
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 55.0
8 3 2023-01-04 99 66.0
If, you really want to get the previous day's value (not the previous available day), use a merge:
store_df['date'] = pd.to_datetime(store_df['date'])
store_df.merge(store_df.assign(date=lambda d: d['date'].add(pd.Timedelta('1D'))),
on=['store_id', 'date'], suffixes=(None, '_prev_day'), how='left'
)
Note. This makes it easy to handle other deltas, like business days (replace pd.Timedelta('1D') with pd.offsets.BusinessDay(1)).
Example (with a different input):
store_id date sales sales_prev_day
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 NaN # there is no data for 2023-01-04
8 3 2023-01-04 99 66.0
enter image description here
I've combined 10 excel files each with 1yr of NFL passing stats and there are certain columns (Games played, Completions, Attempts, etc) that I have summed but I'd need (Passer rating, and QBR) that I'd like to see the avg for.
df3 = df3.groupby(['Player'],as_index=False).agg({'GS':'sum' ,'Cmp': 'sum', 'Att': 'sum','Cmp%': 'sum','Yds': 'sum','TD': 'sum','TD%': 'sum', 'Int': 'sum', 'Int%': 'sum','Y/A': 'sum', 'AY/A': 'sum','Y/C': 'sum','Y/G':'sum','Rate':'sum','QBR':'sum','Sk':'sum','Yds.1':'sum','NY/A': 'sum','ANY/A': 'sum','Sk%':'sum','4QC':'sum','GWD': 'sum'})
Quick note: don't attach photos of your code, dataset, errors, etc. Provide the actual code, the actual dataset (or a sample of the dataset), etc, so that users can reproduce the error, issue, etc. No one is really going to take the time to manufacture your dataset from a photo (or I should say rarely, as I did do that...because I love working with sports data, and I could grab it realitively quickly).
But to get averages in stead of the sum, you would use 'mean'. Also, in your code, why are you summing percentages?
import pandas as pd
df = pd.DataFrame()
for season in range(2010, 2020):
url = 'https://www.pro-football-reference.com/years/{season}/passing.htm'.format(season=season)
df = df.append(pd.read_html(url)[0], sort=False)
df = df[df['Rk'] != 'Rk']
df = df.reset_index(drop=True)
df['Player'] = df.Player.str.replace('[^a-zA-Z .]', '')
df['Player'] = df['Player'].str.strip()
strCols = ['Player','Tm', 'Pos', 'QBrec']
numCols = [ x for x in df.columns if x not in strCols ]
df[['QB_wins','QB_loss', 'QB_ties']] = df['QBrec'].str.split('-', expand=True)
df[numCols] = df[numCols].apply(pd.to_numeric)
df3 = df.groupby(['Player'],as_index=False).agg({'GS':'sum', 'TD':'sum', 'QBR':'mean'})
Output:
print (df3)
Player GS TD QBR
0 A.J. Feeley 3 1 27.300000
1 A.J. McCarron 4 6 NaN
2 Aaron Rodgers 142 305 68.522222
3 Ace Sanders 4 1 100.000000
4 Adam Podlesh 0 0 0.000000
5 Albert Wilson 7 1 99.700000
6 Alex Erickson 6 0 NaN
7 Alex Smith 121 156 55.122222
8 Alex Tanney 0 1 42.900000
9 Alvin Kamara 9 0 NaN
10 Andrew Beck 6 0 NaN
11 Andrew Luck 86 171 62.766667
12 Andy Dalton 133 204 53.375000
13 Andy Lee 0 0 0.000000
14 Anquan Boldin 32 0 11.600000
15 Anthony Miller 4 0 81.200000
16 Antonio Andrews 10 1 100.000000
17 Antonio Brown 55 1 29.300000
18 Antonio Morrison 4 0 NaN
19 Antwaan Randle El 0 2 100.000000
20 Arian Foster 13 1 100.000000
21 Armanti Edwards 0 0 41.466667
22 Austin Davis 10 13 38.150000
23 B.J. Daniels 0 0 NaN
24 Baker Mayfield 29 49 53.200000
25 Ben Roethlisberger 130 236 66.833333
26 Bernard Scott 1 0 5.600000
27 Bilal Powell 12 0 17.700000
28 Billy Volek 0 0 89.400000
29 Blaine Gabbert 48 48 37.687500
.. ... ... ... ...
329 Tim Boyle 0 0 NaN
330 Tim Masthay 0 1 5.700000
331 Tim Tebow 16 17 42.733333
332 Todd Bouman 1 2 57.400000
333 Todd Collins 1 0 0.800000
334 Tom Brady 156 316 72.755556
335 Tom Brandstater 0 0 0.000000
336 Tom Savage 9 5 38.733333
337 Tony Pike 0 0 2.500000
338 Tony Romo 72 141 71.185714
339 Travaris Cadet 1 0 NaN
340 Travis Benjamin 8 0 1.700000
341 Travis Kelce 15 0 1.800000
342 Trent Edwards 3 2 98.100000
343 Tress Way 0 0 NaN
344 Trevone Boykin 0 1 66.400000
345 Trevor Siemian 25 30 40.750000
346 Troy Smith 6 5 38.500000
347 Tyler Boyd 14 0 2.800000
348 Tyler Bray 0 0 0.000000
349 Tyler Palko 4 2 56.600000
350 Tyler Thigpen 1 2 30.233333
351 Tyreek Hill 13 0 0.000000
352 Tyrod Taylor 46 54 51.242857
353 Vince Young 11 14 50.850000
354 Will Grier 2 0 NaN
355 Willie Snead 4 1 100.000000
356 Zach Mettenberger 10 12 24.600000
357 Zach Pascal 13 0 NaN
358 Zay Jones 15 0 0.000000
I have this dataframe (df) in python:
Cumulative sales
0 12
1 28
2 56
3 87
I want to create a new column in which I whould have the the number of new sales (N-(N-1)) as below:
Cumulative sales New Sales
0 12 12
1 28 16
2 56 28
3 87 31
You can do
df['new sale']=df.Cumulativesales.diff().fillna(df.Cumulativesales)
df
Cumulativesales new sale
0 12 12.0
1 28 16.0
2 56 28.0
3 87 31.0
Do this:
df['New_sales'] = df['Cumlative_sales'].diff()
df.fillna(df.iloc[0]['Cumlative_sales'], inplace=True)
print(df)
Output:
Cumlative_sales New_sales
0 12 12.0
1 28 16.0
2 56 28.0
3 87 31.0
I want to join two data frames specific indices as per the map (dictionary) I have created. What is an efficient way to do this?
Data:
df = pd.DataFrame({"a":[10, 34, 24, 40, 56, 44],
"b":[95, 63, 74, 85, 56, 43]})
print(df)
a b
0 10 95
1 34 63
2 24 74
3 40 85
4 56 56
5 44 43
df1 = pd.DataFrame({"c":[1, 2, 3, 4],
"d":[5, 6, 7, 8]})
print(df1)
c d
0 1 5
1 2 6
2 3 7
3 4 8
d = {
(1,0):0.67,
(1,2):0.9,
(2,1):0.2,
(2,3):0.34
(4,0):0.7,
(4,2):0.5
}
Desired Output:
a b c d ratio
0 34 63 1 5 0.67
1 34 63 3 7 0.9
...
5 56 56 3 7 0.5
I'm able to achieve this but it takes a lot of time since my original data frames' map has about 4.7M rows to map. I'd love to know if there is a way to MERGE, JOIN or CONCAT these data frames on different indices.
My Approach:
matched_rows = []
for key in d.keys():
s = df.iloc[key[0]].tolist() + df1.iloc[key[1]].tolist() + [d[key]]
matched_rows.append(s)
df_matched = pd.DataFrame(matched_rows, columns = df.columns.tolist() + df1.columns.tolist() + ['ratio']
I would highly appreciate your help. Thanks a lot in advance.
Create Series and then DaatFrame by dictioanry, DataFrame.join both and last remove first 2 columns by positions:
df = (pd.Series(d).reset_index(name='ratio')
.join(df, on='level_0')
.join(df1, on='level_1')
.iloc[:, 2:])
print (df)
ratio a b c d
0 0.67 34 63 1 5
1 0.90 34 63 3 7
2 0.20 24 74 2 6
3 0.34 24 74 4 8
4 0.70 56 56 1 5
5 0.50 56 56 3 7
And then if necessary reorder columns:
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
print (df)
a b c d ratio
0 34 63 1 5 0.67
1 34 63 3 7 0.90
2 24 74 2 6 0.20
3 24 74 4 8 0.34
4 56 56 1 5 0.70
5 56 56 3 7 0.50
So the data frame is
computer status count
A on 45
off 44
B on 34
off 32
rmt_off 12
C on 23
off 23
rmt_off 2
I performed
df.set_index('status').T
which gave me
status on off on off rmt_off on off rmt_off
computer A B C
Count 45 45 34 32 12 23 23 2
Expected Output:
Computer On off Rmt_off
A 45 45 NaN
B 34 32 12
C 23 23 2
How to make the values to be presented like this?
Is there any built-in functions available?
Use unstack if MultiIndex one column DataFrame:
print (df.index)
MultiIndex(levels=[['A', 'B', 'C'], ['off', 'on', 'rmt_off']],
labels=[[0, 0, 1, 1, 1, 2, 2, 2], [1, 0, 1, 0, 2, 1, 0, 2]],
names=['computer', 'status'])
print (df['count'].unstack())
status off on rmt_off
computer
A 44.0 45.0 NaN
B 32.0 34.0 12.0
C 23.0 23.0 2.0
EDIT: Need replace empty strings to NaNs with forward filling, last use pivot:
df['computer'] = df['computer'].mask(df['computer'] == '').ffill()
df = df.pivot('computer','status', 'count')
Try to fix your dataframe by replace and ffill, then we can apply pivot to your original df, change the long format to wide
df=df.replace('',np.nan).ffill()
df.pivot(*df.columns)
Out[437]:
status off on rmt_off
computer
A 44.0 45.0 NaN
B 32.0 34.0 12.0
C 23.0 23.0 2.0
Try df.unstack(level=1).
df.unstack(level=1)
Out[84]:
count
off on
A NaN NaN
B NaN NaN
C NaN NaN