How can I use "groupby()" for gathering country names? - pandas-groupby

I have three columns pandas dataframe; the name of the country, year and value. The year starts from 1960 to 2020 for each country.
The data looks like that;
Country Name
Year
value
USA
1960
12
Italy
1960
8
Spain
1960
5
Italy
1961
35
USA
1961
50
I would like to gather same country names. How can I do it? I could not succeed using groupby()รง Groupby() always requires functions like sum().
Country Name
Year
value
USA
1960
12
USA
1961
50
Italy
1960
8
Italy
1961
35
Spain
1960
5
Spain
1960
5

Related

Groupby and print entire dataframe in Pandas

I have a dataset as below, in this case, I want to count the number of fruits in each country and output as a column in the dataset.
I tried to use groupby,
df=df.groupby('Country')['Fruits'].count(),
but in this case I am not getting the expected results as the groupby just outputs the count and not the entire dataframe/dataset.
It would be helpful if someone can suggest a better way to do this.
Dataset
Country Fruits Price Sold Weather
India Mango 200 Market sunny
India Apple 250 Shops sunny
India Banana 50 Market winter
India Grapes 150 Road sunny
Germany Apple 350 Supermarket Autumn
Germany Mango 500 Supermarket Rainy
Germany Kiwi 200 Online Spring
Japan Kaki 300 Online sunny
Japan melon 200 Supermarket sunny
Expected Output
Country Fruits Price Sold Weather Number
India Mango 200 Market sunny 4
India Apple 250 Shops sunny 4
India Banana 50 Market winter 4
India Grapes 150 Road sunny 4
Germany Apple 350 Supermarket Autumn 3
Germany Mango 500 Supermarket Rainy 3
Germany Kiwi 200 Online Spring 3
Japan Kaki 300 Online sunny 2
Japan melon 200 Supermarket sunny 2
Thank you:)
You are looking for transform:
df['count'] = df.groupby('Country')['Fruits'].transform('size')
Country Fruits Price Sold Weather count
0 India Mango 200 Market sunny 4
1 India Apple 250 Shops sunny 4
2 India Banana 50 Market winter 4
3 India Grapes 150 Road sunny 4
4 Germany Apple 350 Supermarket Autumn 3
5 Germany Mango 500 Supermarket Rainy 3
6 Germany Kiwi 200 Online Spring 3
7 Japan Kaki 300 Online sunny 2
8 Japan melon 200 Supermarket sunny 2

Function that returns one value from two columns

I have a DataFrame df:
Country Currency
1 China YEN
2 USA USD
3 Russia USD
4 Germany EUR
5 Nigeria NGN
6 Nigeria USD
7 China CNY
8 USA EUR
9 Nigeria EUR
10 Sweden SEK
I want to make a function that reads both of these columns, by column name, and returns a value that indicates if the currency is a local currency or not.
Result would look like this:
Country Currency LCY?
1 China YEN 0
2 USA USD 1
3 Russia USD 0
4 Germany EUR 1
5 Nigeria NGN 1
6 Nigeria USD 0
7 China CNY 1
8 USA EUR 0
9 Nigeria EUR 0
10 Sweden SEK 1
I tried this, but it didn't work:
LOCAL_CURRENCY = {'China':'CNY',
'USA':'USD',
'Russia':'RUB',
'Germany':'EUR',
'Nigeria':'NGN',
'Sweden':'SEK'}
def f(x,y):
if x in LOCAL_CURRENCY and y in LOCAL_CURRENCY:
return (1)
else:
return (0)
Any thoughts?
You can use map and compare:
df['LCY'] = df['Country'].map(LOCAL_CURRENCY).eq(df['Currency']).astype(int)
Output:
Country Currency LCY
1 China YEN 0
2 USA USD 1
3 Russia USD 0
4 Germany EUR 1
5 Nigeria NGN 1
6 Nigeria USD 0
7 China CNY 1
8 USA EUR 0
9 Nigeria EUR 0
10 Sweden SEK 1

How to fill missing values relative to a value from another column

I'd like to fill missing values with conditions relative to the country:
For example, I'd want to replace China's missing values with the mean of Age and for USA it's the median of Age. For now, I don't want to touch of EU's missing values.
How could I do realise it ?
Below the dataframe
import pandas as pd
data = [['USA', ], ['EU', 15], ['China', 35],
['USA', 45], ['EU', 30], ['China', ],
['USA', 28], ['EU', 26], ['China', 78],
['USA', 65], ['EU', 53], ['China', 66],
['USA', 32], ['EU', ], ['China', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Country', 'Age'])
df.head(10)
Country Age
0 USA NaN
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China NaN
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU NaN
Thank you
Not sure if this is the best way to do it but it is one way to do it
age_series = df['Age'].copy()
df.loc[(df['Country'] == 'China') & (df['Age'].isnull()), 'Age'] = age_series.mean()
df.loc[(df['Country'] == 'USA') & (df['Age'].isnull()), 'Age'] = age_series.median()
Note that I copied the age column before hand so that you get the median of the original age series not after calculating the mean for the US. This is the final results
Country Age
0 USA 33.500000
1 EU 15.000000
2 China 35.000000
3 USA 45.000000
4 EU 30.000000
5 China 40.583333
6 USA 28.000000
7 EU 26.000000
8 China 78.000000
9 USA 65.000000
10 EU 53.000000
11 China 66.000000
12 USA 32.000000
13 EU NaN
14 China 14.000000
May be you can try this
df['Age']=(np.where((df['Country'] == 'China') & (df['Age'].isnull()),df['Age'].mean()
,np.where((df['Country'] == 'USA') & (df['Age'].isnull()),df['Age'].median(),df['Age']))).round()
Output
Country Age
0 USA 34.0
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China 41.0
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU 53.0
11 China 66.0
12 USA 32.0
13 EU NaN
14 China 14.0
IIUC, we can create a function to handle this as it's not easily automated (although I may be wrong)
the idea is to pass in the country name & fill type (i.e mean median) you can extend the function to add in your agg types.
it returns a data frame that modifies yours, so you can use this to assign it back to your col
def missing_values(dataframe,country,fill_type):
"""
takes 3 arguments, dataframe, country & fill_type:
fill_type is the method used to fill `NA` values, mean, median, etc.
"""
fill_dict = dataframe.loc[dataframe['Country'] == country]\
.groupby("Country")["Age"].agg(
["mean", "median"]).to_dict(orient='index')
dataframe.loc[dataframe['Country'] == country, 'Age'] \
= dataframe['Age'].fillna(fill_dict[country][fill_type])
return dataframe
print(missing_values(df,'China','mean')
Country Age
0 USA NaN
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00
print(missing_values(df,'USA','median'))
Country Age
0 USA 38.50
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00

How do I create a new column in pandas which is the sum of another column based on a condition?

I am trying to get the result column to be the sum of the value column for all rows in the data frame where the country is equal to the country in that row, and the date is on or before the date in that row.
Date Country ValueResult
01/01/2019 France 10 10
03/01/2019 England 9 9
03/01/2019 Germany 7 7
22/01/2019 Italy 2 2
07/02/2019 Germany 10 17
17/02/2019 England 6 15
25/02/2019 England 5 20
07/03/2019 France 3 13
17/03/2019 England 3 23
27/03/2019 Germany 3 20
15/04/2019 France 6 19
04/05/2019 England 3 26
07/05/2019 Germany 5 25
21/05/2019 Italy 5 7
05/06/2019 Germany 8 33
21/06/2019 England 3 29
24/06/2019 England 7 36
14/07/2019 France 1 20
16/07/2019 England 5 41
30/07/2019 Germany 6 39
18/08/2019 France 6 26
04/09/2019 England 3 44
08/09/2019 Germany 9 48
15/09/2019 Italy 7 14
05/10/2019 Germany 2 50
I have tried the below code but it sums up the entire column
df['result'] = df.loc[(df['Country'] == df['Country']) & (df['Date'] >= df['Date']), 'Value'].sum()
as your dates are ordered you could do:
df['Result'] = df.grouby('Coutry').Value.cumsum()

pd.merge is not working as usual

all,
I have two dataframes: allHoldings and Longswap
allHoldings
prime_broker_id country_name position_type
0 CS UNITED STATES LONG
1 ML UNITED STATES LONG
2 CS AUSTRIA SHORT
3 HSBC FRANCE LONG
4 CITI UNITED STATES SHORT
11 DB UNITED STATES SHORT
12 JPM UNITED STATES SHORT
13 CS ITALY SHORT
14 CITI TAIWAN SHORT
15 CITI UNITED KINGDOM LONG
16 DB FRANCE LONG
17 ML SOUTH KOREA LONG
18 CS AUSTRIA SHORT
19 CS JAPAN LONG
26 HSBC FRANCE SHORT
and Longswap
prime_broker_id country_name longSpread
0 ML AUSTRALIA 30.0
1 ML AUSTRIA 30.0
2 ML BELGIUM 30.0
3 ML BRAZIL 50.0
4 ML CANADA 20.0
5 ML CHILE 50.0
6 ML CHINA - A 75.0
7 ML CZECH REPUBLIC 45.0
8 ML DENMARK 30.0
9 ML EGYPT 45.0
10 ML FINLAND 30.0
11 ML FRANCE 30.0
12 ML GERMANY 30.0
13 ML HONG KONG 30.0
14 ML HUNGARY 45.0
15 ML INDIA 75.0
16 ML INDONESIA 75.0
17 ML IRELAND 30.0
18 ML ISRAEL 45.0
19 ML ITALY 30.0
20 ML JAPAN 30.0
21 ML SOUTH KOREA 50.0
22 ML LUXEMBOURG 30.0
23 ML MALAYSIA 75.0
24 ML MEXICO 50.0
25 ML NETHERLANDS 30.0
26 ML NEW ZEALAND 30.0
27 ML NORWAY 30.0
28 ML PHILIPPINES 75.0
I have left joined many dataframes before but i am still puzzled as to why it is not working for this example.
Here is my code:
allHoldings=pd.merge(allHoldings, Longswap, how='left', left_on = ['prime_broker_id','country_name'], right_on=['prime_broker_id','country_name'])
my results are
prime_broker_id country_name position_type longSpread
0 CS UNITED STATES LONG NaN
1 ML UNITED STATES LONG NaN
2 CS AUSTRIA SHORT NaN
3 HSBC FRANCE LONG NaN
4 CITI UNITED STATES SHORT NaN
5 DB UNITED STATES SHORT NaN
6 JPM UNITED STATES SHORT NaN
7 CS ITALY SHORT NaN
as you can see the longSpread column is a NaN which does not make any sense. From the longSwap dataframe, this column should be populated.
I am not sure why the left join is not working here.
Any Help is appreciated.
here is the answer to delete the whitespace and make left join successful
allHoldings.prime_broker_id.str.strip()
array(['CS', 'ML', 'HSBC', 'CITI', 'DB', 'JPM', 'WFPBS'], dtype=object)

Resources