I have three columns pandas dataframe; the name of the country, year and value. The year starts from 1960 to 2020 for each country.
The data looks like that;
Country Name
Year
value
USA
1960
12
Italy
1960
8
Spain
1960
5
Italy
1961
35
USA
1961
50
I would like to gather same country names. How can I do it? I could not succeed using groupby()รง Groupby() always requires functions like sum().
Country Name
Year
value
USA
1960
12
USA
1961
50
Italy
1960
8
Italy
1961
35
Spain
1960
5
Spain
1960
5
I have a dataset as below, in this case, I want to count the number of fruits in each country and output as a column in the dataset.
I tried to use groupby,
df=df.groupby('Country')['Fruits'].count(),
but in this case I am not getting the expected results as the groupby just outputs the count and not the entire dataframe/dataset.
It would be helpful if someone can suggest a better way to do this.
Dataset
Country Fruits Price Sold Weather
India Mango 200 Market sunny
India Apple 250 Shops sunny
India Banana 50 Market winter
India Grapes 150 Road sunny
Germany Apple 350 Supermarket Autumn
Germany Mango 500 Supermarket Rainy
Germany Kiwi 200 Online Spring
Japan Kaki 300 Online sunny
Japan melon 200 Supermarket sunny
Expected Output
Country Fruits Price Sold Weather Number
India Mango 200 Market sunny 4
India Apple 250 Shops sunny 4
India Banana 50 Market winter 4
India Grapes 150 Road sunny 4
Germany Apple 350 Supermarket Autumn 3
Germany Mango 500 Supermarket Rainy 3
Germany Kiwi 200 Online Spring 3
Japan Kaki 300 Online sunny 2
Japan melon 200 Supermarket sunny 2
Thank you:)
You are looking for transform:
df['count'] = df.groupby('Country')['Fruits'].transform('size')
Country Fruits Price Sold Weather count
0 India Mango 200 Market sunny 4
1 India Apple 250 Shops sunny 4
2 India Banana 50 Market winter 4
3 India Grapes 150 Road sunny 4
4 Germany Apple 350 Supermarket Autumn 3
5 Germany Mango 500 Supermarket Rainy 3
6 Germany Kiwi 200 Online Spring 3
7 Japan Kaki 300 Online sunny 2
8 Japan melon 200 Supermarket sunny 2
I have two data frames as shown below
df1:
Sports Expected_%
Cricket 70
Football 20
Tennis 10
df2:
Region Sports Count Percentage
North Cricket 800 75
North Football 50 5
North Tennis 150 20
South Cricket 1300 65
South Football 550 27.5
South Tennis 150 7.5
Expected Output:
Region Sports Count Percentage Expected_% Expected_count
North Cricket 800 75 70 700
North Football 50 5 20 200
North Tennis 150 20 10 100
South Cricket 1300 65 70 1400
South Football 550 27.5 20 400
South Tennis 150 7.5 10 200
Explanation:
Expected_% for Cricket = 70
Total Count for North = 1000
Expected_Count for North = 1000*70/100 = 700
Use DataFrame.merge with left join for new column, then use GroupBy.transform with sum for new Series, multiple by new column and divide by 100:
df = df2.merge(df1, on='Sports', how='left')
summed = df.groupby('Region')['Count'].transform('sum')
df['Expected_count'] = summed.mul(df['Expected_%']).div(100)
print (df)
Region Sports Count Percentage Expected_% Expected_count
0 North Cricket 800 75.0 70 700.0
1 North Football 50 5.0 20 200.0
2 North Tennis 150 20.0 10 100.0
3 South Cricket 1300 65.0 70 1400.0
4 South Football 550 27.5 20 400.0
5 South Tennis 150 7.5 10 200.0
Or use Series.map for new column:
df2['Expected_%']= df2['Sports'].map(df1.set_index('Sports')['Expected_%'])
summed = df2.groupby('Region')['Count'].transform('sum')
df2['Expected_count'] = summed.mul(df2['Expected_%']).div(100)
print (df2)
Region Sports Count Percentage Expected_% Expected_count
0 North Cricket 800 75.0 70 700.0
1 North Football 50 5.0 20 200.0
2 North Tennis 150 20.0 10 100.0
3 South Cricket 1300 65.0 70 1400.0
4 South Football 550 27.5 20 400.0
5 South Tennis 150 7.5 10 200.0
Another way:
map_dict = dict(df1.values)
df2['Percentage'] = df2.groupby('Region').apply(lambda x: (x['Count'].sum() * x['Sports'].map(map_dict))).div(100).values
I'd like to fill missing values with conditions relative to the country:
For example, I'd want to replace China's missing values with the mean of Age and for USA it's the median of Age. For now, I don't want to touch of EU's missing values.
How could I do realise it ?
Below the dataframe
import pandas as pd
data = [['USA', ], ['EU', 15], ['China', 35],
['USA', 45], ['EU', 30], ['China', ],
['USA', 28], ['EU', 26], ['China', 78],
['USA', 65], ['EU', 53], ['China', 66],
['USA', 32], ['EU', ], ['China', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Country', 'Age'])
df.head(10)
Country Age
0 USA NaN
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China NaN
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU NaN
Thank you
Not sure if this is the best way to do it but it is one way to do it
age_series = df['Age'].copy()
df.loc[(df['Country'] == 'China') & (df['Age'].isnull()), 'Age'] = age_series.mean()
df.loc[(df['Country'] == 'USA') & (df['Age'].isnull()), 'Age'] = age_series.median()
Note that I copied the age column before hand so that you get the median of the original age series not after calculating the mean for the US. This is the final results
Country Age
0 USA 33.500000
1 EU 15.000000
2 China 35.000000
3 USA 45.000000
4 EU 30.000000
5 China 40.583333
6 USA 28.000000
7 EU 26.000000
8 China 78.000000
9 USA 65.000000
10 EU 53.000000
11 China 66.000000
12 USA 32.000000
13 EU NaN
14 China 14.000000
May be you can try this
df['Age']=(np.where((df['Country'] == 'China') & (df['Age'].isnull()),df['Age'].mean()
,np.where((df['Country'] == 'USA') & (df['Age'].isnull()),df['Age'].median(),df['Age']))).round()
Output
Country Age
0 USA 34.0
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China 41.0
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU 53.0
11 China 66.0
12 USA 32.0
13 EU NaN
14 China 14.0
IIUC, we can create a function to handle this as it's not easily automated (although I may be wrong)
the idea is to pass in the country name & fill type (i.e mean median) you can extend the function to add in your agg types.
it returns a data frame that modifies yours, so you can use this to assign it back to your col
def missing_values(dataframe,country,fill_type):
"""
takes 3 arguments, dataframe, country & fill_type:
fill_type is the method used to fill `NA` values, mean, median, etc.
"""
fill_dict = dataframe.loc[dataframe['Country'] == country]\
.groupby("Country")["Age"].agg(
["mean", "median"]).to_dict(orient='index')
dataframe.loc[dataframe['Country'] == country, 'Age'] \
= dataframe['Age'].fillna(fill_dict[country][fill_type])
return dataframe
print(missing_values(df,'China','mean')
Country Age
0 USA NaN
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00
print(missing_values(df,'USA','median'))
Country Age
0 USA 38.50
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00
all,
I have two dataframes: allHoldings and Longswap
allHoldings
prime_broker_id country_name position_type
0 CS UNITED STATES LONG
1 ML UNITED STATES LONG
2 CS AUSTRIA SHORT
3 HSBC FRANCE LONG
4 CITI UNITED STATES SHORT
11 DB UNITED STATES SHORT
12 JPM UNITED STATES SHORT
13 CS ITALY SHORT
14 CITI TAIWAN SHORT
15 CITI UNITED KINGDOM LONG
16 DB FRANCE LONG
17 ML SOUTH KOREA LONG
18 CS AUSTRIA SHORT
19 CS JAPAN LONG
26 HSBC FRANCE SHORT
and Longswap
prime_broker_id country_name longSpread
0 ML AUSTRALIA 30.0
1 ML AUSTRIA 30.0
2 ML BELGIUM 30.0
3 ML BRAZIL 50.0
4 ML CANADA 20.0
5 ML CHILE 50.0
6 ML CHINA - A 75.0
7 ML CZECH REPUBLIC 45.0
8 ML DENMARK 30.0
9 ML EGYPT 45.0
10 ML FINLAND 30.0
11 ML FRANCE 30.0
12 ML GERMANY 30.0
13 ML HONG KONG 30.0
14 ML HUNGARY 45.0
15 ML INDIA 75.0
16 ML INDONESIA 75.0
17 ML IRELAND 30.0
18 ML ISRAEL 45.0
19 ML ITALY 30.0
20 ML JAPAN 30.0
21 ML SOUTH KOREA 50.0
22 ML LUXEMBOURG 30.0
23 ML MALAYSIA 75.0
24 ML MEXICO 50.0
25 ML NETHERLANDS 30.0
26 ML NEW ZEALAND 30.0
27 ML NORWAY 30.0
28 ML PHILIPPINES 75.0
I have left joined many dataframes before but i am still puzzled as to why it is not working for this example.
Here is my code:
allHoldings=pd.merge(allHoldings, Longswap, how='left', left_on = ['prime_broker_id','country_name'], right_on=['prime_broker_id','country_name'])
my results are
prime_broker_id country_name position_type longSpread
0 CS UNITED STATES LONG NaN
1 ML UNITED STATES LONG NaN
2 CS AUSTRIA SHORT NaN
3 HSBC FRANCE LONG NaN
4 CITI UNITED STATES SHORT NaN
5 DB UNITED STATES SHORT NaN
6 JPM UNITED STATES SHORT NaN
7 CS ITALY SHORT NaN
as you can see the longSpread column is a NaN which does not make any sense. From the longSwap dataframe, this column should be populated.
I am not sure why the left join is not working here.
Any Help is appreciated.
here is the answer to delete the whitespace and make left join successful
allHoldings.prime_broker_id.str.strip()
array(['CS', 'ML', 'HSBC', 'CITI', 'DB', 'JPM', 'WFPBS'], dtype=object)