How to replace a specific column data to its z-scores in a data set in python? - python-3.x

I have a numeric dataset and I want to calculate the z score for 'KM' column and replace the original values with the z score values. I'm new to python and please help.
KM CC Doors Gears Quarterly_Tax Weight Guarantee_Period
46986 2000 3 5 210 1165 3
72937 2000 3 5 210 1165 3
38500 2000 3 5 210 1170 3
31461 1800 3 6 100 1185 12
32189 1800 3 6 100 1185 3
23000 1800 3 6 100 1185 3
18739 1800 3 6 100 1185 3
34000 1800 3 5 100 1185 3
21716 1600 3 5 85 1105 18
64359 1600 3 5 85 1105 3
67660 1600 3 5 85 1105 3
43905 1600 3 5 100 1170 3

Something like this should do it for you
from scipy import stats
df["KM"] = df["KM"].apply(stats.zscore)

Related

how to map two dataframes on condition while having different rows

I have two dataframes that need to be mapped (or joined?) based on some condition. These are the dataframes:
df_1
img_names img_array
0 1_rel 253
1 1_rel_right 255
2 1_rel_top 250
3 4_rel 180
4 4_rel_right 182
5 4_rel_top 189
6 7_rel 217
7 7_rel_right 183
8 7_rel_top 196
df_2
List_No time
0 1 38
1 4 23
2 7 32
After mapping I would like to get the following dataframe:
df_3
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32
Basically, df_2's each row is populated 3 times to match the number of rows in df_1 and the mapping (if we can say so) is done by the split string in each row of df_1's img_name column. The names of row elements in img_names may have different names, but each of them always starts with the some number (1,4,7 in this case) and an undescore, etc. So I need to split the correspongding number in each row and map it with the row elements of List_No.
I hope the example above is clear.
Thank you.
Looks like you can just extract the digit parts and merge:
df_1['List_No'] = df_1['img_names'].str.split('_').str[0].astype(int)
df_3 = df_1.merge(df_2, on='List_No')
Output:
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32
An alternative to #QuangHoang's answer (which I believe you should pick, as it is more robust). This uses the map method, and assumes every value in df2's time is in df1:
df1.assign(
List_No=df1.img_names.str.extract(r"(\d)", expand=False).astype(int),
time=lambda x: x.List_No.map(df2["time"]),
)
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32

How can I find the sum of certain columns and find avg of other columns in python?

enter image description here
I've combined 10 excel files each with 1yr of NFL passing stats and there are certain columns (Games played, Completions, Attempts, etc) that I have summed but I'd need (Passer rating, and QBR) that I'd like to see the avg for.
df3 = df3.groupby(['Player'],as_index=False).agg({'GS':'sum' ,'Cmp': 'sum', 'Att': 'sum','Cmp%': 'sum','Yds': 'sum','TD': 'sum','TD%': 'sum', 'Int': 'sum', 'Int%': 'sum','Y/A': 'sum', 'AY/A': 'sum','Y/C': 'sum','Y/G':'sum','Rate':'sum','QBR':'sum','Sk':'sum','Yds.1':'sum','NY/A': 'sum','ANY/A': 'sum','Sk%':'sum','4QC':'sum','GWD': 'sum'})
Quick note: don't attach photos of your code, dataset, errors, etc. Provide the actual code, the actual dataset (or a sample of the dataset), etc, so that users can reproduce the error, issue, etc. No one is really going to take the time to manufacture your dataset from a photo (or I should say rarely, as I did do that...because I love working with sports data, and I could grab it realitively quickly).
But to get averages in stead of the sum, you would use 'mean'. Also, in your code, why are you summing percentages?
import pandas as pd
df = pd.DataFrame()
for season in range(2010, 2020):
url = 'https://www.pro-football-reference.com/years/{season}/passing.htm'.format(season=season)
df = df.append(pd.read_html(url)[0], sort=False)
df = df[df['Rk'] != 'Rk']
df = df.reset_index(drop=True)
df['Player'] = df.Player.str.replace('[^a-zA-Z .]', '')
df['Player'] = df['Player'].str.strip()
strCols = ['Player','Tm', 'Pos', 'QBrec']
numCols = [ x for x in df.columns if x not in strCols ]
df[['QB_wins','QB_loss', 'QB_ties']] = df['QBrec'].str.split('-', expand=True)
df[numCols] = df[numCols].apply(pd.to_numeric)
df3 = df.groupby(['Player'],as_index=False).agg({'GS':'sum', 'TD':'sum', 'QBR':'mean'})
Output:
print (df3)
Player GS TD QBR
0 A.J. Feeley 3 1 27.300000
1 A.J. McCarron 4 6 NaN
2 Aaron Rodgers 142 305 68.522222
3 Ace Sanders 4 1 100.000000
4 Adam Podlesh 0 0 0.000000
5 Albert Wilson 7 1 99.700000
6 Alex Erickson 6 0 NaN
7 Alex Smith 121 156 55.122222
8 Alex Tanney 0 1 42.900000
9 Alvin Kamara 9 0 NaN
10 Andrew Beck 6 0 NaN
11 Andrew Luck 86 171 62.766667
12 Andy Dalton 133 204 53.375000
13 Andy Lee 0 0 0.000000
14 Anquan Boldin 32 0 11.600000
15 Anthony Miller 4 0 81.200000
16 Antonio Andrews 10 1 100.000000
17 Antonio Brown 55 1 29.300000
18 Antonio Morrison 4 0 NaN
19 Antwaan Randle El 0 2 100.000000
20 Arian Foster 13 1 100.000000
21 Armanti Edwards 0 0 41.466667
22 Austin Davis 10 13 38.150000
23 B.J. Daniels 0 0 NaN
24 Baker Mayfield 29 49 53.200000
25 Ben Roethlisberger 130 236 66.833333
26 Bernard Scott 1 0 5.600000
27 Bilal Powell 12 0 17.700000
28 Billy Volek 0 0 89.400000
29 Blaine Gabbert 48 48 37.687500
.. ... ... ... ...
329 Tim Boyle 0 0 NaN
330 Tim Masthay 0 1 5.700000
331 Tim Tebow 16 17 42.733333
332 Todd Bouman 1 2 57.400000
333 Todd Collins 1 0 0.800000
334 Tom Brady 156 316 72.755556
335 Tom Brandstater 0 0 0.000000
336 Tom Savage 9 5 38.733333
337 Tony Pike 0 0 2.500000
338 Tony Romo 72 141 71.185714
339 Travaris Cadet 1 0 NaN
340 Travis Benjamin 8 0 1.700000
341 Travis Kelce 15 0 1.800000
342 Trent Edwards 3 2 98.100000
343 Tress Way 0 0 NaN
344 Trevone Boykin 0 1 66.400000
345 Trevor Siemian 25 30 40.750000
346 Troy Smith 6 5 38.500000
347 Tyler Boyd 14 0 2.800000
348 Tyler Bray 0 0 0.000000
349 Tyler Palko 4 2 56.600000
350 Tyler Thigpen 1 2 30.233333
351 Tyreek Hill 13 0 0.000000
352 Tyrod Taylor 46 54 51.242857
353 Vince Young 11 14 50.850000
354 Will Grier 2 0 NaN
355 Willie Snead 4 1 100.000000
356 Zach Mettenberger 10 12 24.600000
357 Zach Pascal 13 0 NaN
358 Zay Jones 15 0 0.000000

Compairing 4 graph in one graph

I have 4 dataframe with value count of number of occurance per month.
I want to compare all 4 value counts in one graph, so i can see visual difference between every month on these four years.
Like below
i like to have output like this image with years and month
newdf2018.Month.value_counts()
output
1 3451
2 3895
3 3408
4 3365
5 3833
6 3543
7 3333
8 3219
9 3447
10 2943
11 3296
12 2909
newdf2017.Month.value_counts()
1 2801
2 3048
3 3620
4 3014
5 3226
6 3962
7 3500
8 3707
9 3601
10 3349
11 3743
12 2002
newdf2016.Month.value_counts()
1 3201
2 2034
3 2405
4 3805
5 3308
6 3212
7 3049
8 3777
9 3275
10 3099
11 3775
12 2115
newdf2015.Month.value_counts()
1 2817
2 2604
3 2711
4 2817
5 2670
6 2507
7 3256
8 2195
9 3304
10 3238
11 2005
12 2008
Create dictionary of DataFrames and concat together, then use plot:
dfs = {2015:newdf2015, 2016:newdf2016, 2017:newdf2017, 2018:newdf2018}
df = pd.concat({k:v['Month'].value_counts() for k, v in dfs.items()}, axis=1)
df.plot.bar()

Transposing multi index dataframe in pandas

HID gen views
1 1 20
1 2 2532
1 3 276
1 4 1684
1 5 779
1 6 200
1 7 545
2 1 20
2 2 7478
2 3 750
2 4 7742
2 5 2643
2 6 208
2 7 585
3 1 21
3 2 4012
3 3 2019
3 4 1073
3 5 3372
3 6 8
3 7 1823
3 8 22
this is a sample section of a data frame, where HID and gen are indexes.
how can it be transformed like this
HID 1 2 3 4 5 6 7 8
1 20 2532 276 1684 779 200 545 nan
2 20 7478 750 7742 2643 208 585 nan
3 21 4012 2019 1073 3372 8 1823 22
Its called pivoting i.e
df.reset_index().pivot('HID','gen','views')
gen 1 2 3 4 5 6 7 8
HID
1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0
Use unstack:
df = df['views'].unstack()
If need also HID column add reset_index + rename_axis:
df = df['views'].unstack().reset_index().rename_axis(None, 1)
print (df)
HID 1 2 3 4 5 6 7 8
0 1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
1 2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
2 3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0

Return a median value in excel where there are multiple rows of data

In Excel how do I return a median value for a set of data where there are multiple rows and columns? I have a set of data where the first column contains a reference number and the second column contains a list of readings over a number of days. How do I calculate the median value for each reference number using a formula?
number volume
1 3072
1 2304
1 2016
1 2496
1 2144
1 2528
1 3312
1 3360
1 2976
1 2768
1 2688
1 3040
1 3008
1 2560
2 574
2 574
2 574
2 574
2 576
2 574
2 575
2 574
2 576
2 574
2 574
2 574
2 574
2 574
3 2880
3 2880
3 2912
3 2976
3 1536
3 288
3 2976
3 2944
3 2880
3 1536
3 2976
3 1536
3 2880
3 2880
4 2267
4 2267
4 2267
4 2267
4 2267
4 2267
4 2268
4 2267
4 2267
4 2267
4 2267
4 2267
5 800
5 800
5 1984
5 416
5 416
5 416
5 416
5 416
5 416
5 416
5 416
5 416
5 416
5 1984
6 800
6 832
6 832
6 832
6 800
6 832
6 832
6 832
6 832
6 832
6 832
6 832
6 832
6 832
The reference number is Column A and the reading is Column B. In this example I have used just six reference numbers but my real data has several hundred.
Consider the array formula:
=MEDIAN(IF($A$2:$A$83=ROWS($1:1),$B$2:$B$83))
pick a cell, enter the formula and copy down:
Array formulas must be entered with Ctrl + Shift + Enter rather than just the Enter key.
Try this array formula:
=MEDIAN(IF(A:A=1,B:B))
This is an array formula and must be confirmed with Ctrl-Shift-Enter.
For a non CSE Array formula, one entered normally, if you have 2010 or later then use this:
=AGGREGATE(17,6,(B:B/(A:A=1)),2)
Where 1 is the reference number. You can make it dynamic by adding a cell reference, so as that cell changes so will the answer.

Resources