pd.merge is not working as usual

pd.merge is not working as usual - python-3.x

all,
I have two dataframes: allHoldings and Longswap
allHoldings
prime_broker_id country_name position_type
0 CS UNITED STATES LONG
1 ML UNITED STATES LONG
2 CS AUSTRIA SHORT
3 HSBC FRANCE LONG
4 CITI UNITED STATES SHORT
11 DB UNITED STATES SHORT
12 JPM UNITED STATES SHORT
13 CS ITALY SHORT
14 CITI TAIWAN SHORT
15 CITI UNITED KINGDOM LONG
16 DB FRANCE LONG
17 ML SOUTH KOREA LONG
18 CS AUSTRIA SHORT
19 CS JAPAN LONG
26 HSBC FRANCE SHORT
and Longswap
prime_broker_id country_name longSpread
0 ML AUSTRALIA 30.0
1 ML AUSTRIA 30.0
2 ML BELGIUM 30.0
3 ML BRAZIL 50.0
4 ML CANADA 20.0
5 ML CHILE 50.0
6 ML CHINA - A 75.0
7 ML CZECH REPUBLIC 45.0
8 ML DENMARK 30.0
9 ML EGYPT 45.0
10 ML FINLAND 30.0
11 ML FRANCE 30.0
12 ML GERMANY 30.0
13 ML HONG KONG 30.0
14 ML HUNGARY 45.0
15 ML INDIA 75.0
16 ML INDONESIA 75.0
17 ML IRELAND 30.0
18 ML ISRAEL 45.0
19 ML ITALY 30.0
20 ML JAPAN 30.0
21 ML SOUTH KOREA 50.0
22 ML LUXEMBOURG 30.0
23 ML MALAYSIA 75.0
24 ML MEXICO 50.0
25 ML NETHERLANDS 30.0
26 ML NEW ZEALAND 30.0
27 ML NORWAY 30.0
28 ML PHILIPPINES 75.0
I have left joined many dataframes before but i am still puzzled as to why it is not working for this example.
Here is my code:
allHoldings=pd.merge(allHoldings, Longswap, how='left', left_on = ['prime_broker_id','country_name'], right_on=['prime_broker_id','country_name'])
my results are
prime_broker_id country_name position_type longSpread
0 CS UNITED STATES LONG NaN
1 ML UNITED STATES LONG NaN
2 CS AUSTRIA SHORT NaN
3 HSBC FRANCE LONG NaN
4 CITI UNITED STATES SHORT NaN
5 DB UNITED STATES SHORT NaN
6 JPM UNITED STATES SHORT NaN
7 CS ITALY SHORT NaN
as you can see the longSpread column is a NaN which does not make any sense. From the longSwap dataframe, this column should be populated.
I am not sure why the left join is not working here.
Any Help is appreciated.

here is the answer to delete the whitespace and make left join successful
allHoldings.prime_broker_id.str.strip()
array(['CS', 'ML', 'HSBC', 'CITI', 'DB', 'JPM', 'WFPBS'], dtype=object)

Related

How can I use "groupby()" for gathering country names?

I have three columns pandas dataframe; the name of the country, year and value. The year starts from 1960 to 2020 for each country.
The data looks like that;
Country Name
Year
value
USA
1960
12
Italy
1960
8
Spain
1960
5
Italy
1961
35
USA
1961
50
I would like to gather same country names. How can I do it? I could not succeed using groupby()ç Groupby() always requires functions like sum().
Country Name
Year
value
USA
1960
12
USA
1961
50
Italy
1960
8
Italy
1961
35
Spain
1960
5
Spain
1960
5

How to fill missing values relative to a value from another column

I'd like to fill missing values with conditions relative to the country:
For example, I'd want to replace China's missing values with the mean of Age and for USA it's the median of Age. For now, I don't want to touch of EU's missing values.
How could I do realise it ?
Below the dataframe
import pandas as pd
data = [['USA', ], ['EU', 15], ['China', 35],
['USA', 45], ['EU', 30], ['China', ],
['USA', 28], ['EU', 26], ['China', 78],
['USA', 65], ['EU', 53], ['China', 66],
['USA', 32], ['EU', ], ['China', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Country', 'Age'])
df.head(10)
Country Age
0 USA NaN
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China NaN
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU NaN
Thank you

Not sure if this is the best way to do it but it is one way to do it
age_series = df['Age'].copy()
df.loc[(df['Country'] == 'China') & (df['Age'].isnull()), 'Age'] = age_series.mean()
df.loc[(df['Country'] == 'USA') & (df['Age'].isnull()), 'Age'] = age_series.median()
Note that I copied the age column before hand so that you get the median of the original age series not after calculating the mean for the US. This is the final results
Country Age
0 USA 33.500000
1 EU 15.000000
2 China 35.000000
3 USA 45.000000
4 EU 30.000000
5 China 40.583333
6 USA 28.000000
7 EU 26.000000
8 China 78.000000
9 USA 65.000000
10 EU 53.000000
11 China 66.000000
12 USA 32.000000
13 EU NaN
14 China 14.000000

May be you can try this
df['Age']=(np.where((df['Country'] == 'China') & (df['Age'].isnull()),df['Age'].mean()
,np.where((df['Country'] == 'USA') & (df['Age'].isnull()),df['Age'].median(),df['Age']))).round()
Output
Country Age
0 USA 34.0
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China 41.0
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU 53.0
11 China 66.0
12 USA 32.0
13 EU NaN
14 China 14.0

IIUC, we can create a function to handle this as it's not easily automated (although I may be wrong)
the idea is to pass in the country name & fill type (i.e mean median) you can extend the function to add in your agg types.
it returns a data frame that modifies yours, so you can use this to assign it back to your col
def missing_values(dataframe,country,fill_type):
"""
takes 3 arguments, dataframe, country & fill_type:
fill_type is the method used to fill `NA` values, mean, median, etc.
"""
fill_dict = dataframe.loc[dataframe['Country'] == country]\
.groupby("Country")["Age"].agg(
["mean", "median"]).to_dict(orient='index')
dataframe.loc[dataframe['Country'] == country, 'Age'] \
= dataframe['Age'].fillna(fill_dict[country][fill_type])
return dataframe
print(missing_values(df,'China','mean')
Country Age
0 USA NaN
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00
print(missing_values(df,'USA','median'))
Country Age
0 USA 38.50
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00

Take unstructured df list in pandas and giving the data structure in two columns

I downloaded data from the web and stored it in a df. I am new to python so some terms maybe incorrectly stated.
the df is below:
0 1 2 3
0 United States (105) United States (105) United States (105) United States (105)
1 NaN Alabama (0) Louisiana (2) Ohio (4)
2 NaN Alaska (0) Maine (0) Oklahoma (0)
3 NaN Arizona (0) Maryland (2) Oregon (0)
4 NaN Arkansas (0) Massachusetts (9) Pennsylvania (28)
5 NaN California (0) Michigan (1) Rhode Island (0)
6 NaN Colorado (0) Minnesota (0) South Carolina (0)
7 NaN Connecticut (3) Mississippi (0) South Dakota (0)
8 NaN Delaware (1) Missouri (1) Tennessee (0)
9 NaN Florida (0) Montana (0) Texas (0)
10 NaN Georgia (0) Nebraska (0) Utah (0)
11 NaN Hawaii (0) Nevada (0) Vermont (0)
12 NaN Idaho (0) New Hampshire (0) Virginia (1)
13 NaN Illinois (2) New Jersey (7) Washington (0)
14 NaN Indiana (0) New Mexico (0) Washington, D.C. (3)
15 NaN Iowa (2) New York (36) West Virginia (0)
16 NaN Kansas (0) North Carolina (1) Wisconsin (0)
17 NaN Kentucky (2) North Dakota (0) Wyoming (0)
18 Additional Countries / Territories Additional Countries / Territories Additional Countries / Territories Additional Countries / Territories
19 NaN Canada (1) Germany (1) Unknown (3)
20 NaN England (5) Ireland (6) NaN
As you can see the data is in a list and very unstructured. I want to make the data into two columns. One with the header 'location' that houses the names of the states and countries and one named 'number' that houses the number within the (). I want to remove duplicate values and NaN values but I believe I can do this if given proper direction on the rest.
I am lost as how to start.
Thank you!
Code used so far :
url = "http://www.baseball-almanac.com/players/birthplace.php?y=1876"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
table = soup.find_all('table')[6]
df = pd.read_html(str(table))
df = df[0]

You can make do with str.extract and dropna() followed by drop_duplicates:
pattern = '(?P<Country>[\w\s\.\,]*)\s+\((?P<value>\d+)\)'
(df.stack()
.str.extract(pattern, expand=True)
.dropna()
.drop_duplicates()
)
Gives (only head):
Country value
0 0 United States 105
1 1 Alabama 0
2 Louisiana 2
3 Ohio 4
2 1 Alaska 0
2 Maine 0
3 Oklahoma 0
3 1 Arizona 0
2 Maryland 2
3 Oregon 0
4 1 Arkansas 0
2 Massachusetts 9
3 Pennsylvania 28
5 1 California 0
For details on the regex pattern, paste the value of pattern here

How do I create a new column in pandas which is the sum of another column based on a condition?

I am trying to get the result column to be the sum of the value column for all rows in the data frame where the country is equal to the country in that row, and the date is on or before the date in that row.
Date Country ValueResult
01/01/2019 France 10 10
03/01/2019 England 9 9
03/01/2019 Germany 7 7
22/01/2019 Italy 2 2
07/02/2019 Germany 10 17
17/02/2019 England 6 15
25/02/2019 England 5 20
07/03/2019 France 3 13
17/03/2019 England 3 23
27/03/2019 Germany 3 20
15/04/2019 France 6 19
04/05/2019 England 3 26
07/05/2019 Germany 5 25
21/05/2019 Italy 5 7
05/06/2019 Germany 8 33
21/06/2019 England 3 29
24/06/2019 England 7 36
14/07/2019 France 1 20
16/07/2019 England 5 41
30/07/2019 Germany 6 39
18/08/2019 France 6 26
04/09/2019 England 3 44
08/09/2019 Germany 9 48
15/09/2019 Italy 7 14
05/10/2019 Germany 2 50
I have tried the below code but it sums up the entire column
df['result'] = df.loc[(df['Country'] == df['Country']) & (df['Date'] >= df['Date']), 'Value'].sum()

as your dates are ordered you could do:
df['Result'] = df.grouby('Coutry').Value.cumsum()

Python and pandas pivot table sum between dates

I have a pivot table which I have created using:
df = df[["Ref", # int64
"REGION", # object
"COUNTRY", # object
"Value_1", # float
"Value_2", # float
"Value_3", # float
"Type", # object
"Date", # float64 (may need to convert to date)
]]
table = pd.pivot_table(df, index=["Region", "County"],
values=["Value_1",
"Value_2",
"Value_3"],
columns=["Type"], aggfunc=[np.mean, np.sum, np.count_nonzero],
fill_value=0)
What I would like to do is add three columns to show mean, sum and nonzero of Value_1, Value_2 and Value_3 between these date ranges - <=1999, 2000-2005 and >=2006.
Is there a good way to do this using a pandas pivot table, or should I be using another method?
Df:
Ref REGION COUNTRY Type Value_2 Value_3 Value_1 Year
0 2 Yorkshire & The Humber England Private 25.0 NaN 25.0 1987
1 7 Yorkshire & The Humber England Voluntary/Charity 30.0 NaN 30.0 1990
2 9 Yorkshire & The Humber England Private 17.0 2.0 21.0 1991
3 10 Yorkshire & The Humber England Private 18.0 5.0 28.0 1992
4 14 Yorkshire & The Humber England Private 32.0 0.0 32.0 1990
5 17 Yorkshire & The Humber England Private 22.0 5.0 32.0 1987
6 18 Yorkshire & The Humber England Private 19.0 3.0 25.0 1987
7 19 Yorkshire & The Humber England Private 35.0 3.0 41.0 1990
8 23 Yorkshire & The Humber England Voluntary/Charity 25.0 NaN 25.0 1987
9 24 Yorkshire & The Humber England Private 31.0 2.0 35.0 1988
10 25 Yorkshire & The Humber England Voluntary/Charity 32.0 NaN 32.0 1987
11 29 Yorkshire & The Humber England Private 21.0 2.0 25.0 1987
12 30 Yorkshire & The Humber England Voluntary/Charity 17.0 1.0 19.0 1987
13 31 Yorkshire & The Humber England Private 27.0 3.0 33.0 2000
14 49 Yorkshire & The Humber England Private 12.0 3.0 18.0 1992
15 51 Yorkshire & The Humber England Private 19.0 4.0 27.0 1989
16 52 Yorkshire & The Humber England Private 11.0 NaN 11.0 1988
17 57 Yorkshire & The Humber England Private 28.0 2.0 32.0 1988
18 61 Yorkshire & The Humber England Private 20.0 5.0 30.0 1987
19 62 Yorkshire & The Humber England Private 36.0 2.0 40.0 1987
20 65 Yorkshire & The Humber England Voluntary/Charity 16.0 NaN 16.0 1988

First use cut with column Year and then aggregate by DataFrameGroupBy.agg:
lab = ['<=1999','2000-2005',' >=2006']
s = pd.cut(df['Year'], bins=[-np.inf, 1999, 2005, np.inf], labels=lab)
#if exist only date column
#s = pd.cut(df['Date'].dt.year, bins=[-np.inf, 1999, 2005, np.inf], labels=lab)
f = lambda x: np.count_nonzero(x)
table = (df.groupby(["REGION", "COUNTRY", s])
.agg({'Value_1':'mean', 'Value_2':'sum', 'Value_3':f})
.reset_index())
print (table)
REGION COUNTRY Year Value_1 Value_2 Value_3
0 Yorkshire & The Humber England <=1999 27.2 466.0 19.0
1 Yorkshire & The Humber England 2000-2005 33.0 27.0 1.0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pd.merge is not working as usual - python-3.x

here is the answer to delete the whitespace and make left join successful allHoldings.prime_broker_id.str.strip() array(['CS', 'ML', 'HSBC', 'CITI', 'DB', 'JPM', 'WFPBS'], dtype=object)

Related

How can I use "groupby()" for gathering country names?

How to fill missing values relative to a value from another column

Take unstructured df list in pandas and giving the data structure in two columns

How do I create a new column in pandas which is the sum of another column based on a condition?

Python and pandas pivot table sum between dates

Categories

Resources