Convert column headers into a new column and keep the values of each column - python-3.x

UPD: A dataframe has been pasted as an example at the bottom of the page.
My original xls file looks like this:
and I need two actions in order to make it look like the below:
Firstly, I need to fill in the empty row values with the values shown in the cell above them. That has been achieved with the following function:
def get_csv():
#Read csv file
df = pd.read_excel('test.xls')
df = df.fillna(method='ffill')
return df
Secondly, I have used stack with set_index:
df = (df.set_index(['Country', 'Gender', 'Arr-Dep'])
.stack()
.reset_index(name='Value')
.rename(columns={'level_3':'Year'}))
and I was wondering whether there is an easier way. Is there a library that transforms a dataframe, excel etc into the wanted format?
Original dataframe after excel import:
Country Gender Direction 1974 1975 1976
0 Austria Male IN 13728 8754 9695
1 NaN NaN OUT 17977 12271 9899
2 NaN Female IN 8541 6465 6447
3 NaN NaN OUT 8450 7190 6288
4 NaN Total IN 22269 15219 16142
5 NaN NaN OUT 26427 19461 16187
6 Belgium Male IN 2412 2245 2296
7 NaN NaN OUT 2800 2490 2413
8 NaN Female IN 2105 2022 2057
9 NaN NaN OUT 2100 2113 2004
10 NaN Total IN 4517 4267 4353
11 NaN NaN OUT 4900 4603 4417

Alternative solution is use melt, but if need same ordering of columns like stacked DataFrame is necessary add sort_values:
df1 = (df.ffill()
.melt(id_vars=['Country','Gender','Direction'],var_name="Date",value_name='Value')
)
print (df1)
Country Gender Direction Date Value
0 Austria Male IN 1974 13728
1 Austria Male OUT 1974 17977
2 Austria Female IN 1974 8541
3 Austria Female OUT 1974 8450
4 Austria Total IN 1974 22269
5 Austria Total OUT 1974 26427
6 Belgium Male IN 1974 2412
7 Belgium Male OUT 1974 2800
8 Belgium Female IN 1974 2105
9 Belgium Female OUT 1974 2100
10 Belgium Total IN 1974 4517
11 Belgium Total OUT 1974 4900
12 Austria Male IN 1975 8754
13 Austria Male OUT 1975 12271
14 Austria Female IN 1975 6465
15 Austria Female OUT 1975 7190
16 Austria Total IN 1975 15219
17 Austria Total OUT 1975 19461
18 Belgium Male IN 1975 2245
19 Belgium Male OUT 1975 2490
20 Belgium Female IN 1975 2022
21 Belgium Female OUT 1975 2113
22 Belgium Total IN 1975 4267
23 Belgium Total OUT 1975 4603
24 Austria Male IN 1976 9695
25 Austria Male OUT 1976 9899
26 Austria Female IN 1976 6447
27 Austria Female OUT 1976 6288
28 Austria Total IN 1976 16142
29 Austria Total OUT 1976 16187
30 Belgium Male IN 1976 2296
...
...
df1 = (df.ffill()
.melt(id_vars=['Country','Gender','Direction'],var_name="Date", value_name='Value')
.sort_values(['Country', 'Gender','Direction'])
.reset_index(drop=True))
print (df1)
Country Gender Direction Date Value
0 Austria Female IN 1974 8541
1 Austria Female IN 1975 6465
2 Austria Female IN 1976 6447
3 Austria Female OUT 1974 8450
4 Austria Female OUT 1975 7190
5 Austria Female OUT 1976 6288
6 Austria Male IN 1974 13728
7 Austria Male IN 1975 8754
8 Austria Male IN 1976 9695
9 Austria Male OUT 1974 17977
10 Austria Male OUT 1975 12271
11 Austria Male OUT 1976 9899
12 Austria Total IN 1974 22269
13 Austria Total IN 1975 15219
14 Austria Total IN 1976 16142
15 Austria Total OUT 1974 26427
16 Austria Total OUT 1975 19461
17 Austria Total OUT 1976 16187
18 Belgium Female IN 1974 2105
19 Belgium Female IN 1975 2022
20 Belgium Female IN 1976 2057
21 Belgium Female OUT 1974 2100
22 Belgium Female OUT 1975 2113
23 Belgium Female OUT 1976 2004
24 Belgium Male IN 1974 2412
25 Belgium Male IN 1975 2245
26 Belgium Male IN 1976 2296
27 Belgium Male OUT 1974 2800
28 Belgium Male OUT 1975 2490
29 Belgium Male OUT 1976 2413
30 Belgium Total IN 1974 4517
...
...

stack
I like your approach. I'd change it in a couple of ways.
use the specific method for foward filling ffill
rename the column axis prior to stacking to avoid the renaming of the column later (personal preference)
df.ffill().set_index(
['Country', 'Gender', 'Direction']
).rename_axis('Year', 1).stack().reset_index(name='Value')
Country Gender Direction Year Value
0 Austria Male IN 1974 13728
1 Austria Male IN 1975 8754
2 Austria Male IN 1976 9695
3 Austria Male OUT 1974 17977
4 Austria Male OUT 1975 12271
5 Austria Male OUT 1976 9899
...
Numpy
I wanted to put together a custom approach. This should be very fast.
def cstm_ffill(s):
i = np.flatnonzero(s.notna())
i = np.concatenate([[0], i, [len(s)]])
d = np.diff(i)
a = s.values[i[:-1].repeat(d)]
return a
def cstm_melt(df):
c = cstm_ffill(df.Country)
g = cstm_ffill(df.Gender)
d = cstm_ffill(df.Direction)
y = df.columns[3:].values
k = len(y)
i = np.column_stack([c, g, d])
v = np.column_stack([*map(df.get, y)]).ravel()
df_ = pd.DataFrame(
np.column_stack([i.repeat(k, axis=0), v, np.tile(y, len(i))]),
columns=['Country', 'Gender', 'Direction', 'Year', 'Value']
)
return df_
cstm_melt(df)
Country Gender Direction Year Value
0 Austria Male IN 13728 1974
1 Austria Male IN 8754 1975
2 Austria Male IN 9695 1976
3 Austria Male OUT 17977 1974
4 Austria Male OUT 12271 1975
5 Austria Male OUT 9899 1976
...

Related

Changing multiindex in a pandas series?

I have a dataframe like this:
mainid pidl pidw score
0 Austria 1 533
1 Canada 2 754
2 Canada 3 267
3 Austria 4 852
4 Taiwan 5 124
5 Slovakia 6 344
6 Spain 7 1556
7 Taiwan 8 127
I want to select top 5 pidw for each pidl.
When I have grouped by on column 'pidl' and then sorted the score in descending order in each group , i got the following series, s..
s= df.set_index(['pidl', 'pidw']).groupby('pidl')['score'].nlargest(5)
pidl pidl pidw score
Austria Austria 49 948
47 859
48 855
50 807
46 727
Belgium Belgium 15 2339
14 1861
45 1692
16 1626
46 1423
Name: score, dtype: float64
The result looks correct, but I wish I could remove a second 'pidl' from this series.
I have tried
s.reset_index('pidl')
to get 'ValueError: The name location occurs multiple times, use a level number'.
and
s.to_frame().reset_index()
ValueError: cannot insert pidl, already exists.
so I am not sure how to proceed about it.
Use group_keys=False parameter in DataFrame.groupby:
s= df.set_index(['pidl', 'pidw']).groupby('pidl', group_keys=False)['score'].nlargest(5)
print (s)
pidl pidw
Austria 4 852
1 533
Canada 2 754
3 267
Slovakia 6 344
Spain 7 1556
Taiwan 8 127
5 124
Name: score, dtype: int64
Or add Series.droplevel for remove first level (pandas count from 0, so used 0):
s= df.set_index(['pidl', 'pidw']).groupby('pidl')['score'].nlargest(5).droplevel(0)

How to extract years as an array from pandas df

How can I extract years as an array from pandas df date-time object? When trying
years = ([employees['Hiring Date'].dt.year]), the "years" comes as a list.
Use pandas.Series.to_numpy
from datetime import datetime
import pandas as pd
# data
data = {'Hiring Date': pd.bdate_range(datetime(1971, 1, 1), freq='2m', periods=40).tolist()}
# dataframe
employees = pd.DataFrame(data)
# extract years
years = employees['Hiring Date'].dt.year.to_numpy()
print(years)
[1971 1971 1971 1971 1971 1971 1972 1972 1972 1972 1972 1972 1973 1973
1973 1973 1973 1973 1974 1974 1974 1974 1974 1974 1975 1975 1975 1975
1975 1975 1976 1976 1976 1976 1976 1976 1977 1977 1977 1977]
print(type(years))
<class 'numpy.ndarray'>
print(years.shape)
(40,)
This could not work for you since I might have misread your post but I would try:
years = df['hiring date'].values
And then:
years = np.array(years)

How can I find the sum of certain columns and find avg of other columns in python?

enter image description here
I've combined 10 excel files each with 1yr of NFL passing stats and there are certain columns (Games played, Completions, Attempts, etc) that I have summed but I'd need (Passer rating, and QBR) that I'd like to see the avg for.
df3 = df3.groupby(['Player'],as_index=False).agg({'GS':'sum' ,'Cmp': 'sum', 'Att': 'sum','Cmp%': 'sum','Yds': 'sum','TD': 'sum','TD%': 'sum', 'Int': 'sum', 'Int%': 'sum','Y/A': 'sum', 'AY/A': 'sum','Y/C': 'sum','Y/G':'sum','Rate':'sum','QBR':'sum','Sk':'sum','Yds.1':'sum','NY/A': 'sum','ANY/A': 'sum','Sk%':'sum','4QC':'sum','GWD': 'sum'})
Quick note: don't attach photos of your code, dataset, errors, etc. Provide the actual code, the actual dataset (or a sample of the dataset), etc, so that users can reproduce the error, issue, etc. No one is really going to take the time to manufacture your dataset from a photo (or I should say rarely, as I did do that...because I love working with sports data, and I could grab it realitively quickly).
But to get averages in stead of the sum, you would use 'mean'. Also, in your code, why are you summing percentages?
import pandas as pd
df = pd.DataFrame()
for season in range(2010, 2020):
url = 'https://www.pro-football-reference.com/years/{season}/passing.htm'.format(season=season)
df = df.append(pd.read_html(url)[0], sort=False)
df = df[df['Rk'] != 'Rk']
df = df.reset_index(drop=True)
df['Player'] = df.Player.str.replace('[^a-zA-Z .]', '')
df['Player'] = df['Player'].str.strip()
strCols = ['Player','Tm', 'Pos', 'QBrec']
numCols = [ x for x in df.columns if x not in strCols ]
df[['QB_wins','QB_loss', 'QB_ties']] = df['QBrec'].str.split('-', expand=True)
df[numCols] = df[numCols].apply(pd.to_numeric)
df3 = df.groupby(['Player'],as_index=False).agg({'GS':'sum', 'TD':'sum', 'QBR':'mean'})
Output:
print (df3)
Player GS TD QBR
0 A.J. Feeley 3 1 27.300000
1 A.J. McCarron 4 6 NaN
2 Aaron Rodgers 142 305 68.522222
3 Ace Sanders 4 1 100.000000
4 Adam Podlesh 0 0 0.000000
5 Albert Wilson 7 1 99.700000
6 Alex Erickson 6 0 NaN
7 Alex Smith 121 156 55.122222
8 Alex Tanney 0 1 42.900000
9 Alvin Kamara 9 0 NaN
10 Andrew Beck 6 0 NaN
11 Andrew Luck 86 171 62.766667
12 Andy Dalton 133 204 53.375000
13 Andy Lee 0 0 0.000000
14 Anquan Boldin 32 0 11.600000
15 Anthony Miller 4 0 81.200000
16 Antonio Andrews 10 1 100.000000
17 Antonio Brown 55 1 29.300000
18 Antonio Morrison 4 0 NaN
19 Antwaan Randle El 0 2 100.000000
20 Arian Foster 13 1 100.000000
21 Armanti Edwards 0 0 41.466667
22 Austin Davis 10 13 38.150000
23 B.J. Daniels 0 0 NaN
24 Baker Mayfield 29 49 53.200000
25 Ben Roethlisberger 130 236 66.833333
26 Bernard Scott 1 0 5.600000
27 Bilal Powell 12 0 17.700000
28 Billy Volek 0 0 89.400000
29 Blaine Gabbert 48 48 37.687500
.. ... ... ... ...
329 Tim Boyle 0 0 NaN
330 Tim Masthay 0 1 5.700000
331 Tim Tebow 16 17 42.733333
332 Todd Bouman 1 2 57.400000
333 Todd Collins 1 0 0.800000
334 Tom Brady 156 316 72.755556
335 Tom Brandstater 0 0 0.000000
336 Tom Savage 9 5 38.733333
337 Tony Pike 0 0 2.500000
338 Tony Romo 72 141 71.185714
339 Travaris Cadet 1 0 NaN
340 Travis Benjamin 8 0 1.700000
341 Travis Kelce 15 0 1.800000
342 Trent Edwards 3 2 98.100000
343 Tress Way 0 0 NaN
344 Trevone Boykin 0 1 66.400000
345 Trevor Siemian 25 30 40.750000
346 Troy Smith 6 5 38.500000
347 Tyler Boyd 14 0 2.800000
348 Tyler Bray 0 0 0.000000
349 Tyler Palko 4 2 56.600000
350 Tyler Thigpen 1 2 30.233333
351 Tyreek Hill 13 0 0.000000
352 Tyrod Taylor 46 54 51.242857
353 Vince Young 11 14 50.850000
354 Will Grier 2 0 NaN
355 Willie Snead 4 1 100.000000
356 Zach Mettenberger 10 12 24.600000
357 Zach Pascal 13 0 NaN
358 Zay Jones 15 0 0.000000

How do I create a new column in pandas which is the sum of another column based on a condition?

I am trying to get the result column to be the sum of the value column for all rows in the data frame where the country is equal to the country in that row, and the date is on or before the date in that row.
Date Country ValueResult
01/01/2019 France 10 10
03/01/2019 England 9 9
03/01/2019 Germany 7 7
22/01/2019 Italy 2 2
07/02/2019 Germany 10 17
17/02/2019 England 6 15
25/02/2019 England 5 20
07/03/2019 France 3 13
17/03/2019 England 3 23
27/03/2019 Germany 3 20
15/04/2019 France 6 19
04/05/2019 England 3 26
07/05/2019 Germany 5 25
21/05/2019 Italy 5 7
05/06/2019 Germany 8 33
21/06/2019 England 3 29
24/06/2019 England 7 36
14/07/2019 France 1 20
16/07/2019 England 5 41
30/07/2019 Germany 6 39
18/08/2019 France 6 26
04/09/2019 England 3 44
08/09/2019 Germany 9 48
15/09/2019 Italy 7 14
05/10/2019 Germany 2 50
I have tried the below code but it sums up the entire column
df['result'] = df.loc[(df['Country'] == df['Country']) & (df['Date'] >= df['Date']), 'Value'].sum()
as your dates are ordered you could do:
df['Result'] = df.grouby('Coutry').Value.cumsum()

pd.merge is not working as usual

all,
I have two dataframes: allHoldings and Longswap
allHoldings
prime_broker_id country_name position_type
0 CS UNITED STATES LONG
1 ML UNITED STATES LONG
2 CS AUSTRIA SHORT
3 HSBC FRANCE LONG
4 CITI UNITED STATES SHORT
11 DB UNITED STATES SHORT
12 JPM UNITED STATES SHORT
13 CS ITALY SHORT
14 CITI TAIWAN SHORT
15 CITI UNITED KINGDOM LONG
16 DB FRANCE LONG
17 ML SOUTH KOREA LONG
18 CS AUSTRIA SHORT
19 CS JAPAN LONG
26 HSBC FRANCE SHORT
and Longswap
prime_broker_id country_name longSpread
0 ML AUSTRALIA 30.0
1 ML AUSTRIA 30.0
2 ML BELGIUM 30.0
3 ML BRAZIL 50.0
4 ML CANADA 20.0
5 ML CHILE 50.0
6 ML CHINA - A 75.0
7 ML CZECH REPUBLIC 45.0
8 ML DENMARK 30.0
9 ML EGYPT 45.0
10 ML FINLAND 30.0
11 ML FRANCE 30.0
12 ML GERMANY 30.0
13 ML HONG KONG 30.0
14 ML HUNGARY 45.0
15 ML INDIA 75.0
16 ML INDONESIA 75.0
17 ML IRELAND 30.0
18 ML ISRAEL 45.0
19 ML ITALY 30.0
20 ML JAPAN 30.0
21 ML SOUTH KOREA 50.0
22 ML LUXEMBOURG 30.0
23 ML MALAYSIA 75.0
24 ML MEXICO 50.0
25 ML NETHERLANDS 30.0
26 ML NEW ZEALAND 30.0
27 ML NORWAY 30.0
28 ML PHILIPPINES 75.0
I have left joined many dataframes before but i am still puzzled as to why it is not working for this example.
Here is my code:
allHoldings=pd.merge(allHoldings, Longswap, how='left', left_on = ['prime_broker_id','country_name'], right_on=['prime_broker_id','country_name'])
my results are
prime_broker_id country_name position_type longSpread
0 CS UNITED STATES LONG NaN
1 ML UNITED STATES LONG NaN
2 CS AUSTRIA SHORT NaN
3 HSBC FRANCE LONG NaN
4 CITI UNITED STATES SHORT NaN
5 DB UNITED STATES SHORT NaN
6 JPM UNITED STATES SHORT NaN
7 CS ITALY SHORT NaN
as you can see the longSpread column is a NaN which does not make any sense. From the longSwap dataframe, this column should be populated.
I am not sure why the left join is not working here.
Any Help is appreciated.
here is the answer to delete the whitespace and make left join successful
allHoldings.prime_broker_id.str.strip()
array(['CS', 'ML', 'HSBC', 'CITI', 'DB', 'JPM', 'WFPBS'], dtype=object)

Resources