Pandas Group By Multiple Colums and Calculate Standard Deviation - pandas-groupby

I have a pandas dataframe that contains statistics of basketball players from the NBA from multiple seasons and teams. It looks like this:
Year Team Player PTS/G
2018 Lakers Lebron James 27.6
2018 Lakers Kyle Kuzma 10.3
2019 Rockets James Harden 25.5
2019 Rockets Russel Westbrook 23.2
I want to create a new column called 'PTS Dev' that is the standard deviation of PTS/G for each team and year. Then, I plan on analyzing where a player is according to that deviation. This is my attempt to calculate that column:
final_data['PTS Dev'] = final_data.groupby('Team', 'Year')['PTS/G'].std()

Use groupby with transform
final_data['PTS Dev'] = final_data.groupby(['Team', 'Year'])['PTS/G'].transform('std')
final_data
Out[9]:
Year Team Player PTS/G PTS Dev
0 2018 Lakers Lebron James 27.6 12.232947
1 2018 Lakers Kyle Kuzma 10.3 12.232947
2 2019 Rockets James Harden 25.5 1.626346
3 2019 Rockets Russel Westbrook 23.2 1.626346

Related

Pivoting a table with duplicate index

I wanted to pivot this table:
Year County Sex rate
0 2006 Alameda Male 45.80
1 2006 Alameda Female 54.20
2 2006 Alpine Male 52.81
3 2006 Alpine Female 47.19
4 2006 Amador Male 49.97
5 2006 Amador female 50.30
My desired output is:
Year County Male Female
2006 Alameda 45.80 54.20
2006 Alameda 52.81 47.19
2006 Alpine 49.97 50.30
I tried doing this:
sex_rate=g.pivot(index="County",columns='Year',values='rate')
But I keep getting this error:
ValueError: Index contains duplicate entries, cannot reshape
Please help. I am new to python
I think you want index=['Year', 'County'], not just index='County'. And since you are passing two columns to index, you may want to use pivot_table instead of pivot:
df.pivot_table(index=['Year','County'],
columns='Sex', values='rate'
).reset_index()
Output:
Sex Year County Female Male
0 2006 Alameda 54.20 45.80
1 2006 Alpine 47.19 52.81
2 2006 Amador 50.30 49.97

Python percentage of 2 columns in new column based on condition

I have asked earlier this question and got some feedback however I am still stuck in some mystery where I am not able to calculate the percentage of 2 columns based on conditions. 2 columns are ‘tested population’ and ‘total population’ based on grouping ‘Year’ & ‘Gender’ and show it in new column as ‘percentage’…
Year Race Gender Tested population Total population
2017 Asian Male 345 567
2017 Hispanic Female 666 67899
2018 Native Male 333 35543
2018 Asian Female 665 78955
2019 Hispanic Female 4444 44356
2020 Native Male 3642 6799
2017 Asian Male 5467 7998
2018 Asian Female 5467 7998
2019 Hispanic Male 456 4567
Table
code
df = pd.DataFrame(alldata, columns=['Year', 'Gender', 'Tested population', 'Total population'])
df2 = df.groupby(['Year', 'Gender']).agg({'Tested population': 'sum'})
pop_pcts = df2.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
print(pop_pcts)
Output:
Tested population
Year Gender
2017 Female 10.280951
Male 89.719049
2018 Female 94.849188
Male 5.150812
2019 Female 90.693878
Male 9.306122
2020 Male 100.000000
Whereas i want data as in this format to show along with other columns as a new column 'Percentage' .
Year Race Gender Tested population Total population Percentage
2017 Asian Male 345 567 60.8466
2017 Hispanic Female 666 67899 0.98087
2018 Native Male 333 35543 0.93689
2018 Asian Female 665 78955 0.84225
2019 Hispanic Female 4444 44356 10.0189
2020 Native Male 3642 6799 53.5667
2019 Hispanic Male 456 4567 9.98467
I have gone through Pandas percentage of total with groupby
and not able to fix my issues, can someone help on this
df['Percentage'] = df['Tested population']/df['Total Population']
I believe you just need to add a column.

Summing a years worth of data that spans two years pandas

I have a DataFrame that contains data similar to this:
Name Date A B C
John 19/04/2018 10 11 8
John 20/04/2018 9 7 9
John 21/04/2018 22 15 22
… … … … …
John 16/04/2019 8 8 9
John 17/04/2019 10 11 18
John 18/04/2019 8 9 11
Rich 19/04/2018 18 7 6
… … … … …
Rich 18/04/2019 19 11 17
The data can start on any day and contains at least 365 days of data, sometimes more. What I want to end up with is a DataFrame like this:
Name Date Sum
John April 356
John May 276
John June 209
Rich April 452
I need to sum up all of the months to get a year’s worth of data (April - March) but I need to be able to handle taking part of April’s total (in this example) from 2018 and part from 2019. What I would also like to do is shift the days so they are consecutive and follow on in sequence so rather than:
John 16/04/2019 8 8 9 Tuesday
John 17/04/2019 10 11 18 Wednesday
John 18/04/2019 8 9 11 Thursday
John 19/04/2019 10 11 8 Thursday (was 19/04/2018)
John 20/04/2019 9 7 9 Friday (was 20/04/2018)
It becomes
John 16/04/2019 8 8 9 Tuesday
John 17/04/2019 10 11 18 Wednesday
John 18/04/2019 8 9 11 Thursday
John 19/04/2019 9 7 9 Friday (was 20/04/2018)
Prior to summing to get the final DataFrame. Is this possible?
Additional information requested in comments
Here is a link to the initial data set https://github.com/stottp/exampledata/blob/master/SOExample.csv and the required output would be:
Name Month Total
John March 11634
John April 11470
John May 11757
John June 10968
John July 11682
John August 11631
John September 11085
John October 11924
John November 11593
John December 11714
John January 11320
John February 10167
Rich March 11594
Rich April 12383
Rich May 12506
Rich June 11112
Rich July 11636
Rich August 11303
Rich September 10667
Rich October 10992
Rich November 11721
Rich December 11627
Rich January 11669
Rich February 10335
Let's see if I understood correctly. If you want to sum, I suppose you mean sum the values of columns ['A', 'B', 'C'] for each day and get the total value monthly.
If that's right, the first thing to to is set the ['Date'] column as the index so that the data frame is easier to work with:
df.set_index(df['Date'], inplace=True, drop=True)
del df['Date']
Next, you will want to add the new column ['Sum'] by re-sampling your data frame (from days to months) whilst summing the values of ['A', 'B', 'C']:
df['Sum'] = df['A'].resample('M').sum() + df['B'].resample('M').sum() + df['C'].resample('M').sum()
df['Sum'].head()
Out[37]:
Date
2012-11-30 1956265
2012-12-31 2972076
2013-01-31 2972565
2013-02-28 2696121
2013-03-31 2970687
Freq: M, dtype: int64
The last part about squashing February of 2018 and 2019 together as if they were a single month might yield from:
df['2019-02'].merge(df['2018-02'], how='outer', on=['Date', 'A', 'B', 'C'])
Test this last step and see if it works for you.
Cheers

How to create spark datasets from a file without using File reader

I have a data file that has 4 data sections. Header data, Summary data, Detail data and Footer data. Each section has a fixed number of columns.Each section is divided by two rows that just have a single "#" as the row content.But different sections have different of columns. Is there a way I can avoid creating new files and just use spark tsv(tab seperated foramt) module or any other module to read the file into 4 datasets directly.If I read the file directly then I am loosing the extra columns in the next data section. It only reads the from the file only those columns as the first row of the file.
#deptno dname location
10 Accounting New York
20 Research Dallas
30 Sales Chicago
40 Operations Boston
#
#
#grade losal hisal
1 700.00 1200.00
2 1201.00 1400.00
4 2001.00 3000.00
5 3001.00 99999.00
3 1401.00 2000.00
#
#
#ENAME DNAME JOB EMPNO HIREDATE LOC
ADAMS RESEARCH CLERK 7876 23-MAY-87 DALLAS
ALLEN SALES SALESMAN 7499 20-FEB-81 CHICAGO
BLAKE SALES MANAGER 7698 01-MAY-81 CHICAGO
CLARK ACCOUNTING MANAGER 7782 09-JUN-81 NEW YORK
FORD RESEARCH ANALYST 7902 03-DEC-81 DALLAS
JAMES SALES CLERK 7900 03-DEC-81 CHICAGO
JONES RESEARCH MANAGER 7566 02-APR-81 DALLAS
#
#
#Name Age Address
Paul 23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St
Output:
Dataset d1 :
#deptno dname location
10 Accounting New York
20 Research Dallas
30 Sales Chicago
40 Operations Boston
Dataset d2 :
#grade losal hisal
1 700.00 1200.00
2 1201.00 1400.00
4 2001.00 3000.00
5 3001.00 99999.00
3 1401.00 2000.00
Dataset d3 :
#ENAME DNAME JOB EMPNO HIREDATE LOC
ADAMS RESEARCH CLERK 7876 23-MAY-87 DALLAS
ALLEN SALES SALESMAN 7499 20-FEB-81 CHICAGO
BLAKE SALES MANAGER 7698 01-MAY-81 CHICAGO
CLARK ACCOUNTING MANAGER 7782 09-JUN-81 NEW YORK
FORD RESEARCH ANALYST 7902 03-DEC-81 DALLAS
JAMES SALES CLERK 7900 03-DEC-81 CHICAGO
JONES RESEARCH MANAGER 7566 02-APR-81 DALLAS
Dataset d4 :
#Name Age Address
Paul23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St

Lookup value in one dataframe and paste it into another dataframe

I have two dataframes in Python one big (car listings), one small (car base configuration prices). The small one looks like this:
Make Model MSRP
0 Acura ILX 27990
1 Acura MDX 43015
2 Acura MDX Sport Hybrid 51960
3 Acura NSX 156000
4 Acura RDX 35670
5 Acura RLX 54450
6 Acura TLX 31695
7 Alfa Romeo 4C 55900
8 Alfa Romeo Giulia 37995
… … … . …
391 Toyota Yaris 14895
392 Toyota Yaris iA 15950
393 Volkswagen Atlas 33500
394 Volkswagen Beetle 19795
395 Volkswagen CC 34475
396 Volkswagen GTI 24995
397 Volkswagen Golf 19575
398 Volkswagen Golf Alltrack 25850
399 Volkswagen Golf R 37895
400 Volkswagen Golf SportWagen 21580
401 Volkswagen Jetta 17680
402 Volkswagen Passat 22440
403 Volkswagen Tiguan 24890
404 Volkswagen Touareg 42705
405 Volkswagen e-Golf 28995
406 Volvo S60 33950
Now I want to paste the values from the MSRP column (far right column) based on matching the Make and Model columns into the big dataframe (car listings) that looks like the following:
makeName modelName trimName carYear mileage
0 BMW X5 sDrive35i 2017 0
1 BMW X5 sDrive35i 2017 3
2 BMW X5 sDrive35i 2017 0
3 Audi A4 Premium Plus2017 0
4 Kia Optima LX 2016 10
5 Kia Optima SX Turbo 2017 15
6 Kia Optima EX 2016 425
7 Rolls-Royce Ghost Series II 2017 15
… … … … … …
In the end I would like to have the following:
makeName modelName trimName carYear mileage MSRP
0 BMW X5 sDrive35i 2017 0 value from the other table
1 BMW X5 sDrive35i 2017 3 value from the other table
2 BMW X5 sDrive35i 2017 0 value from the other table
3 Audi A4 Premium Plus2017 0 value from the other table
4 Kia Optima LX 2016 10 value from the other table
5 Kia Optima SX Turbo 2017 15 value from the other table
6 Kia Optima EX 2016 425 value from the other table
7 Rolls-Royce Ghost Series II 2017 15 value from the other table
… … … … … …
I read the documentation regarding pd.concat, merge and join but I am not making any progress.
Can you guys help?
Thanks!
You can use merge to join the two dataframes together.
car_base.merge(car_listings, left_on=['makeName','modelName'], right_on=['Make','Model'])

Resources