I'd like to fill missing values with conditions relative to the country:
For example, I'd want to replace China's missing values with the mean of Age and for USA it's the median of Age. For now, I don't want to touch of EU's missing values.
How could I do realise it ?
Below the dataframe
import pandas as pd
data = [['USA', ], ['EU', 15], ['China', 35],
['USA', 45], ['EU', 30], ['China', ],
['USA', 28], ['EU', 26], ['China', 78],
['USA', 65], ['EU', 53], ['China', 66],
['USA', 32], ['EU', ], ['China', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Country', 'Age'])
df.head(10)
Country Age
0 USA NaN
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China NaN
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU NaN
Thank you
Not sure if this is the best way to do it but it is one way to do it
age_series = df['Age'].copy()
df.loc[(df['Country'] == 'China') & (df['Age'].isnull()), 'Age'] = age_series.mean()
df.loc[(df['Country'] == 'USA') & (df['Age'].isnull()), 'Age'] = age_series.median()
Note that I copied the age column before hand so that you get the median of the original age series not after calculating the mean for the US. This is the final results
Country Age
0 USA 33.500000
1 EU 15.000000
2 China 35.000000
3 USA 45.000000
4 EU 30.000000
5 China 40.583333
6 USA 28.000000
7 EU 26.000000
8 China 78.000000
9 USA 65.000000
10 EU 53.000000
11 China 66.000000
12 USA 32.000000
13 EU NaN
14 China 14.000000
May be you can try this
df['Age']=(np.where((df['Country'] == 'China') & (df['Age'].isnull()),df['Age'].mean()
,np.where((df['Country'] == 'USA') & (df['Age'].isnull()),df['Age'].median(),df['Age']))).round()
Output
Country Age
0 USA 34.0
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China 41.0
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU 53.0
11 China 66.0
12 USA 32.0
13 EU NaN
14 China 14.0
IIUC, we can create a function to handle this as it's not easily automated (although I may be wrong)
the idea is to pass in the country name & fill type (i.e mean median) you can extend the function to add in your agg types.
it returns a data frame that modifies yours, so you can use this to assign it back to your col
def missing_values(dataframe,country,fill_type):
"""
takes 3 arguments, dataframe, country & fill_type:
fill_type is the method used to fill `NA` values, mean, median, etc.
"""
fill_dict = dataframe.loc[dataframe['Country'] == country]\
.groupby("Country")["Age"].agg(
["mean", "median"]).to_dict(orient='index')
dataframe.loc[dataframe['Country'] == country, 'Age'] \
= dataframe['Age'].fillna(fill_dict[country][fill_type])
return dataframe
print(missing_values(df,'China','mean')
Country Age
0 USA NaN
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00
print(missing_values(df,'USA','median'))
Country Age
0 USA 38.50
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00
Related
I have a dataframe:
import pandas as pd
d = {
'Country': ["Austria", "Austria", "Belgium", "USA", "USA", "USA", "USA"],
'Number2020': [15, None, 18, 20, 22, None, 30],
'Number2021': [20, 25, 18, None, None, None, 32],
}
df = pd.DataFrame(data=d)
df
Country Number2020 Number2021
0 Austria 15.0 20.0
1 Austria NaN 25.0
2 Belgium 18.0 18.0
3 USA 20.0 NaN
4 USA 22.0 NaN
5 USA NaN NaN
6 USA 30.0 32.0
and I want to sum up the nan values per each country. E.g.
Country Count_nans
Austria 1
USA 4
I have filtered the dataframe to leave only the rows with nans .
df_nan = df[df.Number2021.isna() | df.Number2020.isna()]
Country Number2020 Number2021
1 Austria NaN 25.0
3 USA 20.0 NaN
4 USA 22.0 NaN
5 USA NaN NaN
So it looks like a groupby operation? I have tried this.
nasum2021 = df_nan['Number2021'].isna().sum()
df_nan['countNames2021'] = df_nan.groupby(['Number2021'])['Number2021'].transform('count').fillna(nasum2021)
df_nan
It gives me 1 nan for Austria but 3 for the United States while it should be 4. so that is not right.
In my real dataframe, I have some 10 years and around 30 countries. thank you!
Solution for processing all columns without Country - first convert it to index, test missing values and aggregate sum, last sum columns:
s = df.set_index('Country').isna().groupby('Country').sum().sum(axis=1)
print (s)
Country
Austria 1
Belgium 0
USA 4
dtype: int64
If need remove 0 values add boolean indexing:
s = s[s.ne(0)]
You could use:
df.filter(like='Number').isna().sum(1).groupby(df['Country']).sum()
output:
Country
Austria 1
Belgium 0
USA 4
dtype: int64
or, filtering the rows with NaN first to only count the countries with at least 1 NaN:
df[df.filter(like='Number').isna().any(1)].groupby('Country')['Country'].count()
output:
Country
Austria 1
USA 3
Name: Country, dtype: int64
You could use pandas.DataFrame.agg along with pandas.DataFrame.isna:
>>> df.groupby('Country').agg(lambda x: x.isna().sum()).sum(axis=1)
Country
Austria 1
Belgium 0
USA 4
dtype: int64
Use:
df.groupby('Country').apply(lambda x: x.isna().sum().sum())
Output:
Suppose that I have a dataframe:
>>> import pandas as pd
>>> import numpy as np
>>> rand = np.random.RandomState(42)
>>> data_points = 10
>>> dates = pd.date_range('2020-01-01', periods=data_points, freq='D')
>>> state_city = [('USA', 'Washington'), ('France', 'Paris'), ('Germany', 'Berlin')]
>>>
>>> df = pd.DataFrame()
>>> for _ in range(data_points):
... state, city = state_city[rand.choice(len(state_city))]
... df_row = pd.DataFrame(
... {
... 'time' : rand.choice(dates),
... 'state': state,
... 'city': city,
... 'val1': rand.randint(0, data_points),
... 'val2': rand.randint(0, data_points)
... }, index=[0]
... )
...
... df = pd.concat([df, df_row], ignore_index=True)
...
>>> df = df.sort_values(['time', 'state', 'city']).reset_index(drop=True)
>>> df.loc[rand.randint(0, data_points, size=rand.randint(1, 3)), ['state']] = pd.NA
>>> df.loc[rand.randint(0, data_points, size=rand.randint(1, 3)), ['city']] = pd.NA
>>> df.val1 = df.val1.where(df.val1 < 5, pd.NA)
>>> df.val2 = df.val2.where(df.val2 < 5, pd.NA)
>>>
>>> df
time state city val1 val2
0 2020-01-03 USA Washington 4 2
1 2020-01-04 France <NA> <NA> 1
2 2020-01-04 Germany Berlin <NA> 4
3 2020-01-05 Germany Berlin <NA> <NA>
4 2020-01-06 France Paris 1 4
5 2020-01-06 Germany Berlin 4 1
6 2020-01-08 Germany Berlin 4 3
7 2020-01-10 Germany Berlin 2 <NA>
8 2020-01-10 <NA> Washington <NA> <NA>
9 2020-01-10 <NA> Washington 2 <NA>
>>>
As you can see there are some values. I would like to impute values for state/city as much as I could. In order to do so I will generate dataframe that can help.
>>> known_state_city = df[['state', 'city']].dropna().drop_duplicates()
>>> known_state_city
state city
0 USA Washington
2 Germany Berlin
4 France Paris
OK, now we have all state/city combinations.
How to use known_state_city dataframe in order to fill empty state when city is known?
I can find empty states that have populated city with:
>>> df.loc[df.state.isna() & df.city.notna(), 'city']
8 Washington
9 Washington
Name: city, dtype: object
But how to replace Washington with state from known_state_city not breaking index values (8 and 9) in order to replace df.state values?
If I don't have all combinations in known_state_city, how to update state in df with what I have?
We can do fillna with map twice:
# fill empty state
df['state'] = df['state'].fillna(df['city'].map(known_state_city.set_index('city')['state']))
# fill empty city
df['city'] = df['city'].fillna(df['state'].map(known_state_city.set_index('state')['city']))
Output:
time state city val1 val2
0 2020-01-03 USA Washington 4.0 2.0
1 2020-01-04 France Paris NaN 1.0
2 2020-01-04 Germany Berlin NaN 4.0
3 2020-01-05 Germany Berlin NaN NaN
4 2020-01-06 France Paris 1.0 4.0
5 2020-01-06 Germany Berlin 4.0 1.0
6 2020-01-08 Germany Berlin 4.0 3.0
7 2020-01-10 Germany Berlin 2.0 NaN
8 2020-01-10 USA Washington NaN NaN
9 2020-01-10 USA Washington 2.0 NaN
I have a DataFrame df:
Country Currency
1 China YEN
2 USA USD
3 Russia USD
4 Germany EUR
5 Nigeria NGN
6 Nigeria USD
7 China CNY
8 USA EUR
9 Nigeria EUR
10 Sweden SEK
I want to make a function that reads both of these columns, by column name, and returns a value that indicates if the currency is a local currency or not.
Result would look like this:
Country Currency LCY?
1 China YEN 0
2 USA USD 1
3 Russia USD 0
4 Germany EUR 1
5 Nigeria NGN 1
6 Nigeria USD 0
7 China CNY 1
8 USA EUR 0
9 Nigeria EUR 0
10 Sweden SEK 1
I tried this, but it didn't work:
LOCAL_CURRENCY = {'China':'CNY',
'USA':'USD',
'Russia':'RUB',
'Germany':'EUR',
'Nigeria':'NGN',
'Sweden':'SEK'}
def f(x,y):
if x in LOCAL_CURRENCY and y in LOCAL_CURRENCY:
return (1)
else:
return (0)
Any thoughts?
You can use map and compare:
df['LCY'] = df['Country'].map(LOCAL_CURRENCY).eq(df['Currency']).astype(int)
Output:
Country Currency LCY
1 China YEN 0
2 USA USD 1
3 Russia USD 0
4 Germany EUR 1
5 Nigeria NGN 1
6 Nigeria USD 0
7 China CNY 1
8 USA EUR 0
9 Nigeria EUR 0
10 Sweden SEK 1
I am trying to get the result column to be the sum of the value column for all rows in the data frame where the country is equal to the country in that row, and the date is on or before the date in that row.
Date Country ValueResult
01/01/2019 France 10 10
03/01/2019 England 9 9
03/01/2019 Germany 7 7
22/01/2019 Italy 2 2
07/02/2019 Germany 10 17
17/02/2019 England 6 15
25/02/2019 England 5 20
07/03/2019 France 3 13
17/03/2019 England 3 23
27/03/2019 Germany 3 20
15/04/2019 France 6 19
04/05/2019 England 3 26
07/05/2019 Germany 5 25
21/05/2019 Italy 5 7
05/06/2019 Germany 8 33
21/06/2019 England 3 29
24/06/2019 England 7 36
14/07/2019 France 1 20
16/07/2019 England 5 41
30/07/2019 Germany 6 39
18/08/2019 France 6 26
04/09/2019 England 3 44
08/09/2019 Germany 9 48
15/09/2019 Italy 7 14
05/10/2019 Germany 2 50
I have tried the below code but it sums up the entire column
df['result'] = df.loc[(df['Country'] == df['Country']) & (df['Date'] >= df['Date']), 'Value'].sum()
as your dates are ordered you could do:
df['Result'] = df.grouby('Coutry').Value.cumsum()
all,
I have two dataframes: allHoldings and Longswap
allHoldings
prime_broker_id country_name position_type
0 CS UNITED STATES LONG
1 ML UNITED STATES LONG
2 CS AUSTRIA SHORT
3 HSBC FRANCE LONG
4 CITI UNITED STATES SHORT
11 DB UNITED STATES SHORT
12 JPM UNITED STATES SHORT
13 CS ITALY SHORT
14 CITI TAIWAN SHORT
15 CITI UNITED KINGDOM LONG
16 DB FRANCE LONG
17 ML SOUTH KOREA LONG
18 CS AUSTRIA SHORT
19 CS JAPAN LONG
26 HSBC FRANCE SHORT
and Longswap
prime_broker_id country_name longSpread
0 ML AUSTRALIA 30.0
1 ML AUSTRIA 30.0
2 ML BELGIUM 30.0
3 ML BRAZIL 50.0
4 ML CANADA 20.0
5 ML CHILE 50.0
6 ML CHINA - A 75.0
7 ML CZECH REPUBLIC 45.0
8 ML DENMARK 30.0
9 ML EGYPT 45.0
10 ML FINLAND 30.0
11 ML FRANCE 30.0
12 ML GERMANY 30.0
13 ML HONG KONG 30.0
14 ML HUNGARY 45.0
15 ML INDIA 75.0
16 ML INDONESIA 75.0
17 ML IRELAND 30.0
18 ML ISRAEL 45.0
19 ML ITALY 30.0
20 ML JAPAN 30.0
21 ML SOUTH KOREA 50.0
22 ML LUXEMBOURG 30.0
23 ML MALAYSIA 75.0
24 ML MEXICO 50.0
25 ML NETHERLANDS 30.0
26 ML NEW ZEALAND 30.0
27 ML NORWAY 30.0
28 ML PHILIPPINES 75.0
I have left joined many dataframes before but i am still puzzled as to why it is not working for this example.
Here is my code:
allHoldings=pd.merge(allHoldings, Longswap, how='left', left_on = ['prime_broker_id','country_name'], right_on=['prime_broker_id','country_name'])
my results are
prime_broker_id country_name position_type longSpread
0 CS UNITED STATES LONG NaN
1 ML UNITED STATES LONG NaN
2 CS AUSTRIA SHORT NaN
3 HSBC FRANCE LONG NaN
4 CITI UNITED STATES SHORT NaN
5 DB UNITED STATES SHORT NaN
6 JPM UNITED STATES SHORT NaN
7 CS ITALY SHORT NaN
as you can see the longSpread column is a NaN which does not make any sense. From the longSwap dataframe, this column should be populated.
I am not sure why the left join is not working here.
Any Help is appreciated.
here is the answer to delete the whitespace and make left join successful
allHoldings.prime_broker_id.str.strip()
array(['CS', 'ML', 'HSBC', 'CITI', 'DB', 'JPM', 'WFPBS'], dtype=object)