Make a proper data frame from a pandas crosstab output - python-3.x

I have a multi-indexed output after pandas crosstab function which is shown below
sports cricket football tennis
nationality
IND 180 18 1
UK 10 30 10
US 5 30 65
From the above, I would like to prepare below df.
Expected output:
nationality cricket football tennis
IND 180 18 1
UK 10 30 10
US 5 30 65
I tried the below code which is giving the wrong data frame.
df_tab.reset_index().iloc[:, 1:]
sports cricket football tennis
IND 180 18 1
UK 10 30 10
US 5 30 65

If need also index and columns names together, first column is index, all another are columns (but looks same):
df = df_tab.rename_axis(index = None, columns= df_tab.index.name)
print (df)
nationality cricket football tennis
IND 180 18 1
UK 10 30 10
US 5 30 65
print (df.index)
Index(['IND', 'UK', 'US'], dtype='object')
If need print DataFrame without index:
print (df_tab.reset_index().to_string(index=False))
nationality cricket football tennis
IND 180 18 1
UK 10 30 10
US 5 30 65
EDIT: In DataFrame is always necessary index, so if need column from nationality use:
df = df_tab.reset_index().rename_axis(columns = None)

Related

Using FuzzyWuzzy with pandas

I am trying to calculate the similarity between cities in my dataframe, and 1 static city name. (eventually I want to iterate through a dataframe and choose the best matching city name from that data frame, but I am testing my code on this simplified scenario).
I am using fuzzywuzzy token set ratio.
For some reason it calculates the first row correctly, and it seems it assigns the same value for all rows.
code:
from fuzzywuzzy import fuzz
test_df= pd.DataFrame( {"City" : ["Amsterdam","Amsterdam","Rotterdam","Zurich","Vienna","Prague"]})
test_df = test_df.assign(Score = lambda d: fuzz.token_set_ratio("amsterdam",test_df["City"]))
print (test_df.shape)
test_df.head()
Result:
City Score
0 Amsterdam 100
1 Amsterdam 100
2 Rotterdam 100
3 Zurich 100
4 Vienna 100
If I do the comparison one by one it works:
print (fuzz.token_set_ratio("amsterdam","Amsterdam"))
print (fuzz.token_set_ratio("amsterdam","Rotterdam"))
print (fuzz.token_set_ratio("amsterdam","Zurich"))
print (fuzz.token_set_ratio("amsterdam","Vienna"))
Results:
100
67
13
13
Thank you in advance!
I managed to solve it via iterating through the rows:
for index,row in test_df.iterrows():
test_df.loc[index, "Score"] = fuzz.token_set_ratio("amsterdam",test_df.loc[index,"City"])
The result is:
City Country Code Score
0 Amsterdam NL 100
1 Amsterdam NL 100
2 Rotterdam NL 67
3 Zurich NL 13
4 Vienna NL 13

taking top 3 in a groupby, and lumping rest into 'other category'

I am currently doing a groupby in pandas like this:
df.groupby(['grade'])['students'].nunique())
and the result I get is this:
grade
grade 1 12
grade 2 8
grade 3 30
grade 4 2
grade 5 600
grade 6 90
Is there a way to get the output such that I see the groups of the top 3, and everything else is classified under other?
this is what I am looking for
grade
grade 3 30
grade 5 600
grade 6 90
other (3 other grades) 22
I think you can add a helper column in the df and call it something like "Grouping".
name the top 3 rows with its original name and name the remaining as "other" and then just group by the "Grouping" column.
Can't do much without the actual input data, but if this is your starting dataframe (df) after your groupby -
grade unique
0 grade_1 12
1 grade_2 8
2 grade_3 30
3 grade_4 2
4 grade_5 600
5 grade_6 90
You can do a few more steps to get to your table -
ddf = df.nlargest(3, 'unique')
ddf = ddf.append({'grade': 'Other', 'unique':df['unique'].sum()-ddf['unique'].sum()}, ignore_index=True)
grade unique
0 grade_5 600
1 grade_6 90
2 grade_3 30
3 Other 22

reshape dataframe time series

[![enter image description here][1]][1]I have a dataframe for a weather data in certain shape and i want to transform it, but struggling on it.
My dataframe looks like that :
city temp_day1, temp_day2, temp_day3 ...., hum_day1, hum_day2, hum_day4, ..., condition
city_1 12 13 20 44 44.5 good 44
city_1 12 13 20 44 44.5
bad 44
city_2 14 04 33 44 44.5
good 44
I want to transforme it to
city_1 city_2 .....
day. temperature humidity condition ... temperature humidity condition
1 12 44 good . 12 13
20 44 44.5
2 13 44 .5 bad .
3 20 NaN bad .
4 NaN 44 .
some day dont have temperature values and humidity values
Thanks for your help
Use wide_to_long with DataFrame.unstack and last DataFrame.swaplevel and DataFrame.sort_index:
df1 = (pd.wide_to_long(df,
stubnames=['temp','hum'],
i='city',
j='day',
sep='_',
suffix='\w+')
.unstack(0)
.swaplevel(1,0, axis=1)
.sort_index(axis=1))
print (df1)
city city_1
hum temp
day
day1 44.0 12.0
day2 44.5 13.0
day3 NaN 20.0
day4 44.0 NaN
Alternative solution:
df1 = df.set_index('city')
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack([0,1]).unstack([0,1])
If need extract numbers from index:
df1 = (pd.wide_to_long(df,
stubnames=['temp','hum'],
i='city',
j='day',
sep='_',
suffix='\w+')
.unstack(0)
.swaplevel(1,0, axis=1)
.sort_index(axis=1))
df1.index = df1.index.str.extract('(\d+)', expand=False)
print (df1)
city city_1
hum temp
day
1 44.0 12.0
2 44.5 13.0
3 NaN 20.0
4 44.0 NaN
EDIT:
Solution with real data:
df1 = df.set_index(['condition', 'ACTIVE', 'mode', 'apply', 'spy', 'month'], append=True)
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack([0,1]).unstack([0,-2])
If need remove unnecessary levels in MultiIndex:
df1 = df1.reset_index(level=['condition', 'ACTIVE', 'mode', 'apply', 'spy', 'month'], drop=True)
You can use pandas transpose method like this: df.T
This will turn your dataframe into one row. If you create multiple columns, you can slice it with indexing and assing each slice to independent columns.

Calculate Percentage using Pandas DataFrame

Of all the Medals won by these 5 countries across all olympics,
what is the percentage medals won by each one of them?
i have combined all excel file in one using panda dataframe but now stuck with finding percentage
Country Gold Silver Bronze Total
0 USA 10 13 11 34
1 China 2 2 4 8
2 UK 1 0 1 2
3 Germany 12 16 8 36
4 Australia 2 0 0 2
0 USA 9 9 7 25
1 China 2 4 5 11
2 UK 0 1 0 1
3 Germany 11 12 6 29
4 Australia 1 0 1 2
0 USA 9 15 13 37
1 China 5 2 4 11
2 UK 1 0 0 1
3 Germany 10 13 7 30
4 Australia 2 1 0 3
Combined data sheet
Code that i have tried till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.DataFrame()
for f in ['E:\\olympics\\Olympics-2002.xlsx','E:\\olympics\\Olympics-
2006.xlsx','E:\\olympics\\Olympics-2010.xlsx',
'E:\\olympics\\Olympics-2014.xlsx','E:\\olympics\\Olympics-
2018.xlsx']:
data = pd.read_excel(f,'Sheet1')
df = df.append(data)
df.to_excel("E:\\olympics\\combineddata.xlsx")
data = pd.read_excel("E:\\olympics\\combineddata.xlsx")
print(data)
final_Data={}
for i in data['Country']:
x=i
t1=(data[(data.Country==x)].Total).tolist()
print("Name of Country=",i, int(sum(t1)))
final_Data.update({i:int(sum(t1))})
t3=data.groupby('Country').Total.sum()
t2= df['Total'].sum()
t4= t3/t2*100
print(t3)
print(t2)
print(t4)
this how is got the answer....Now i need to pull that in plot i want to put it pie
Let's assume you have created the DataFrame as 'df'. Then you can do the following to first group by and then calculate percentages.
df = df.groupby('Country').sum()
df['Gold_percent'] = (df['Gold'] / df['Gold'].sum()) * 100
df['Silver_percent'] = (df['Silver'] / df['Silver'].sum()) * 100
df['Bronze_percent'] = (df['Bronze'] / df['Bronze'].sum()) * 100
df['Total_percent'] = (df['Total'] / df['Total'].sum()) * 100
df.round(2)
print (df)
The output will be as follows:
Gold Silver Bronze ... Silver_percent Bronze_percent Total_percent
Country ...
Australia 5 1 1 ... 1.14 1.49 3.02
China 9 8 13 ... 9.09 19.40 12.93
Germany 33 41 21 ... 46.59 31.34 40.95
UK 2 1 1 ... 1.14 1.49 1.72
USA 28 37 31 ... 42.05 46.27 41.38
I am not having the exact dataset what you have . i am explaining with similar dataset .Try to add a column with sum of medals across rows.then find the percentage by dividing all the row by sum of entire column.
i am posting this as model check this
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'ExshowroomPrice': [21000,26000,28000,34000],'RTOPrice': [2200,250,2700,3500]}
df = pd.DataFrame(cars, columns = ['Brand', 'ExshowroomPrice','RTOPrice'])
Brand ExshowroomPrice RTOPrice
0 Honda Civic 21000 2200
1 Toyota Corolla 26000 250
2 Ford Focus 28000 2700
3 Audi A4 34000 3500
df['percentage']=(df.ExshowroomPrice +df.RTOPrice) * 100
/(df.ExshowroomPrice.sum() +df.RTOPrice.sum())
print(df)
Brand ExshowroomPrice RTOPrice percentage
0 Honda Civic 21000 2200 19.719507
1 Toyota Corolla 26000 250 22.311942
2 Ford Focus 28000 2700 26.094348
3 Audi A4 34000 3500 31.874203
hope its clear

Grouping and Multiindexing a pandas dataframe

Suppose I have a dataframe as follows
In [6]: df.head()
Out[6]:
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
I have a dictionary as follows:
army = {'Majors' : 'Nighthawks', 'Captains' : 'Dragoons'}
and I want that it and should have a multi-index in the shape of ["army","company"] only.
How will I proceed?
If I understand correctly:
You can use map to find values in a dictionary (using dictionary comprehension to swap key/value pairs since they are backwards):
army = {'Majors': 'Nighthawks', 'Captains': 'Dragoons'}
df.assign(army=df.regiment.map({k:v for v, k in army.items()})).set_index(['army', 'company'], drop=True)
regiment name preTestScore postTestScore
army company
Majors 1st Nighthawks Miller 4 25
1st Nighthawks Jacobson 24 94
2nd Nighthawks Ali 31 57
2nd Nighthawks Milner 2 62
Captains 1st Dragoons Cooze 3 70

Resources