Python and pandas pivot table sum between dates - python-3.x

I have a pivot table which I have created using:
df = df[["Ref", # int64
"REGION", # object
"COUNTRY", # object
"Value_1", # float
"Value_2", # float
"Value_3", # float
"Type", # object
"Date", # float64 (may need to convert to date)
]]
table = pd.pivot_table(df, index=["Region", "County"],
values=["Value_1",
"Value_2",
"Value_3"],
columns=["Type"], aggfunc=[np.mean, np.sum, np.count_nonzero],
fill_value=0)
What I would like to do is add three columns to show mean, sum and nonzero of Value_1, Value_2 and Value_3 between these date ranges - <=1999, 2000-2005 and >=2006.
Is there a good way to do this using a pandas pivot table, or should I be using another method?
Df:
Ref REGION COUNTRY Type Value_2 Value_3 Value_1 Year
0 2 Yorkshire & The Humber England Private 25.0 NaN 25.0 1987
1 7 Yorkshire & The Humber England Voluntary/Charity 30.0 NaN 30.0 1990
2 9 Yorkshire & The Humber England Private 17.0 2.0 21.0 1991
3 10 Yorkshire & The Humber England Private 18.0 5.0 28.0 1992
4 14 Yorkshire & The Humber England Private 32.0 0.0 32.0 1990
5 17 Yorkshire & The Humber England Private 22.0 5.0 32.0 1987
6 18 Yorkshire & The Humber England Private 19.0 3.0 25.0 1987
7 19 Yorkshire & The Humber England Private 35.0 3.0 41.0 1990
8 23 Yorkshire & The Humber England Voluntary/Charity 25.0 NaN 25.0 1987
9 24 Yorkshire & The Humber England Private 31.0 2.0 35.0 1988
10 25 Yorkshire & The Humber England Voluntary/Charity 32.0 NaN 32.0 1987
11 29 Yorkshire & The Humber England Private 21.0 2.0 25.0 1987
12 30 Yorkshire & The Humber England Voluntary/Charity 17.0 1.0 19.0 1987
13 31 Yorkshire & The Humber England Private 27.0 3.0 33.0 2000
14 49 Yorkshire & The Humber England Private 12.0 3.0 18.0 1992
15 51 Yorkshire & The Humber England Private 19.0 4.0 27.0 1989
16 52 Yorkshire & The Humber England Private 11.0 NaN 11.0 1988
17 57 Yorkshire & The Humber England Private 28.0 2.0 32.0 1988
18 61 Yorkshire & The Humber England Private 20.0 5.0 30.0 1987
19 62 Yorkshire & The Humber England Private 36.0 2.0 40.0 1987
20 65 Yorkshire & The Humber England Voluntary/Charity 16.0 NaN 16.0 1988

First use cut with column Year and then aggregate by DataFrameGroupBy.agg:
lab = ['<=1999','2000-2005',' >=2006']
s = pd.cut(df['Year'], bins=[-np.inf, 1999, 2005, np.inf], labels=lab)
#if exist only date column
#s = pd.cut(df['Date'].dt.year, bins=[-np.inf, 1999, 2005, np.inf], labels=lab)
f = lambda x: np.count_nonzero(x)
table = (df.groupby(["REGION", "COUNTRY", s])
.agg({'Value_1':'mean', 'Value_2':'sum', 'Value_3':f})
.reset_index())
print (table)
REGION COUNTRY Year Value_1 Value_2 Value_3
0 Yorkshire & The Humber England <=1999 27.2 466.0 19.0
1 Yorkshire & The Humber England 2000-2005 33.0 27.0 1.0

Related

Cumulative sum of rows in Python Pandas

I'm working on a dataframe which I get a value for each year and state :
0 State 1965 1966 1967 1968
1 Alabama 20.2 40 60.3 80
2 Alaska 10 15 18 20
3 Arizona 5 5 10 12
I need each value sum the last with the current one :
0 State 1965 1966 1967 1968
1 Alabama 20.2 60.2 120.5 200.5
2 Alaska 10 25 43 63
3 Arizona 5 10 20 32
I tried df['sum'] = df.sum(axis=1) and .cumsum but I don't know how to apply it to my problem, as I don't need a new column with the total sum.
Use DataFrame.cumsum with axis=1 and convert non numeric column State to index:
df = df.set_index('State').cumsum(axis=1)
print (df)
1965 1966 1967 1968
State
Alabama 20.2 60.2 120.5 200.5
Alaska 10.0 25.0 43.0 63.0
Arizona 5.0 10.0 20.0 32.0
Or select all columns without first and assign back:
df.iloc[:, 1:] = df.iloc[:, 1:].cumsum(axis=1)
print (df)
State 1965 1966 1967 1968
0
1 Alabama 20.2 60.2 120.5 200.5
2 Alaska 10.0 25.0 43.0 63.0
3 Arizona 5.0 10.0 20.0 32.0

Merge 'column attributes' of a single column into seperate columns, to lower the amount of dummy variables of that single column

if a column has for example 14 different [Unique Values]value_counts(), and they possess something in common,
in our example [when we groupby 'Loan.Purpose' with 'Interest.Rate' column, and compute mean of each [Unique Values]value_counts() based on Loan.Purpose mean() values], we get a certain common average rates for certain value_counts, for e.g :-('car','educational','major_purchase') attributes has the mean = 11.0, now i want to merge the above mentioned ('car','educational','major_purchase') [Unique Values]value_counts(), under column_name "LP_cem" because they have same mean, likewise i want to do the same with other value_counts(),
So that i can reduce the amount of dummy variables from 14 to 4.
basically, i want to merge the 14 different value_counts() under 3/4 columns based on their mean() and then create dummies out of those 3/4 columns
like this given below
LP_cem LP_chos LP_dm LP_hmvw LP_renewable_energy
0 0 0 1 0 0
1 0 0 1 0 0
2 0 0 1 0 0
3 0 0 1 0 0
4 0 1 0 0 0
raw_data['Loan.Purpose'].value_counts()
debt_consolidation 1306
credit_card 443
other 200
home_improvement 151
major_purchase 101
small_business 86
car 50
wedding 39
medical 30
moving 28
vacation 21
house 20
educational 15
renewable_energy 4
Name: Loan.Purpose, dtype: int64
i have clubbed the data from Loan.Purpose based on mean of the Interest.Rate
raw_data_8 = round(raw_data_5.groupby('Loan.Purpose')['Interest.Rate'].mean())
raw_data_8
Loan.Purpose
CHOS 15.0
DM 12.0
car 11.0
credit_card 13.0
debt_consolidation 14.0
educational 11.0
home_improvement 12.0
house 13.0
major_purchase 11.0
medical 12.0
moving 14.0
other 13.0
renewable_energy 10.0
small_business 13.0
vacation 12.0
wedding 12.0
Name: Interest.Rate, dtype: float64
now i want to club the values with same mean's together, i even tried the code but it is giving an error
for i in range(len(raw_data_5.index)):
if raw_data_5['Loan.Purpose'][i] in ['car','educational','major_purchase']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'cem'
if raw_data_5['Loan.Purpose'][i] in ['home_improvement','medical','vacation','wedding']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'hmvw'
if raw_data_5['Loan.Purpose'][i] in ['credit_care','house','other','small_business']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'chos'
if raw_data_5['Loan.Purpose'][i] in ['debt_consolidation','moving']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'dcm'
error = TypeError Traceback (most recent
call last)
<ipython-input-51-cf7ef2ae1efd> in <module>
----> 1 for i in range(raw_data_5.index):
2 if raw_data_5['Loan.Purpose'][i] in ['car','educational','major_purchase']:
3 raw_data_5.iloc[i, 'Loan.Purpose'] = 'cem'
4 if raw_data_5['Loan.Purpose'][i] in ['home_improvement','medical','vacation','wedding']:
5 raw_data_5.iloc[i, 'Loan.Purpose'] = 'hmvw'
TypeError: 'Int64Index' object cannot be interpreted as an integer
Interest.Rate Loan.Length Loan.Purpose
0 8.90 36.0 debt_consolidation
1 12.12 36.0 debt_consolidation
2 21.98 60.0 debt_consolidation
3 9.99 36.0 debt_consolidation
4 11.71 36.0 credit_card
5 15.31 36.0 other
6 7.90 36.0 debt_consolidation
7 17.14 60.0 credit_card
8 14.33 36.0 credit_card
10 19.72 36.0 moving
11 14.27 36.0 debt_consolidation
12 21.67 60.0 debt_consolidation
13 8.90 36.0 debt_consolidation
14 7.62 36.0 debt_consolidation
15 15.65 60.0 debt_consolidation
16 12.12 36.0 debt_consolidation
17 10.37 60.0 debt_consolidation
18 9.76 36.0 credit_card
19 9.99 60.0 debt_consolidation
20 21.98 36.0 debt_consolidation
21 19.05 60.0 credit_card
22 17.99 60.0 car
23 11.99 36.0 credit_card
24 16.82 60.0 vacation
25 7.90 36.0 debt_consolidation
26 14.42 36.0 debt_consolidation
27 15.31 36.0 debt_consolidation
28 8.59 36.0 other
29 7.90 36.0 debt_consolidation
30 21.00 60.0 debt_consolidation

How to fill missing values relative to a value from another column

I'd like to fill missing values with conditions relative to the country:
For example, I'd want to replace China's missing values with the mean of Age and for USA it's the median of Age. For now, I don't want to touch of EU's missing values.
How could I do realise it ?
Below the dataframe
import pandas as pd
data = [['USA', ], ['EU', 15], ['China', 35],
['USA', 45], ['EU', 30], ['China', ],
['USA', 28], ['EU', 26], ['China', 78],
['USA', 65], ['EU', 53], ['China', 66],
['USA', 32], ['EU', ], ['China', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Country', 'Age'])
df.head(10)
Country Age
0 USA NaN
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China NaN
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU NaN
Thank you
Not sure if this is the best way to do it but it is one way to do it
age_series = df['Age'].copy()
df.loc[(df['Country'] == 'China') & (df['Age'].isnull()), 'Age'] = age_series.mean()
df.loc[(df['Country'] == 'USA') & (df['Age'].isnull()), 'Age'] = age_series.median()
Note that I copied the age column before hand so that you get the median of the original age series not after calculating the mean for the US. This is the final results
Country Age
0 USA 33.500000
1 EU 15.000000
2 China 35.000000
3 USA 45.000000
4 EU 30.000000
5 China 40.583333
6 USA 28.000000
7 EU 26.000000
8 China 78.000000
9 USA 65.000000
10 EU 53.000000
11 China 66.000000
12 USA 32.000000
13 EU NaN
14 China 14.000000
May be you can try this
df['Age']=(np.where((df['Country'] == 'China') & (df['Age'].isnull()),df['Age'].mean()
,np.where((df['Country'] == 'USA') & (df['Age'].isnull()),df['Age'].median(),df['Age']))).round()
Output
Country Age
0 USA 34.0
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China 41.0
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU 53.0
11 China 66.0
12 USA 32.0
13 EU NaN
14 China 14.0
IIUC, we can create a function to handle this as it's not easily automated (although I may be wrong)
the idea is to pass in the country name & fill type (i.e mean median) you can extend the function to add in your agg types.
it returns a data frame that modifies yours, so you can use this to assign it back to your col
def missing_values(dataframe,country,fill_type):
"""
takes 3 arguments, dataframe, country & fill_type:
fill_type is the method used to fill `NA` values, mean, median, etc.
"""
fill_dict = dataframe.loc[dataframe['Country'] == country]\
.groupby("Country")["Age"].agg(
["mean", "median"]).to_dict(orient='index')
dataframe.loc[dataframe['Country'] == country, 'Age'] \
= dataframe['Age'].fillna(fill_dict[country][fill_type])
return dataframe
print(missing_values(df,'China','mean')
Country Age
0 USA NaN
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00
print(missing_values(df,'USA','median'))
Country Age
0 USA 38.50
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00

How to filter out values from a pandas data frame for which only one occurrence exists

I have a Pandas data frame with the following columns and values
Temp Time grain_size
0 335.0 25.0 14.8
1 335.0 30.0 18.7
2 335.0 35.0 22.1
3 187.6 25.0 9.8
4 227.0 25.0 14.2
5 227.0 30.0 16.2
6 118.5 25.0 8.7
The data frame given the variable name df that has three distinct value which are 335.0, 187.6, 227.0, and 118.5; however, the values 187.6 and 118.5 only occur once. I would like to filter the data frame such that it gets rid of values that only occur once so the final data frame looks like.
Temp Time grain_size
0 335.0 25.0 14.8
1 335.0 30.0 18.7
2 335.0 35.0 22.1
4 227.0 25.0 14.2
5 227.0 30.0 16.2
Obviously in this simple case I know the values that only occur once and I can simply user a filtering function to weed them out. However, I would like to automate the process so that Python will determine which values only occur once and autonomously filter them. How can I enable this functionality?
Using duplicated
df[df.Temp.duplicated(keep=False)]
Out[630]:
Temp Time grain_size
0 335.0 25.0 14.8
1 335.0 30.0 18.7
2 335.0 35.0 22.1
4 227.0 25.0 14.2
5 227.0 30.0 16.2
Try this
df['count']=df.groupby(['Temp']).transform(pd.Series.count)
df = df[df['count']>1]
df.drop(['count'],axis=1,inplace=True)
dict
This is a dict approach to the same thing done by WeNYoBen
seen = {}
for t in df.Temp:
seen[t] = t in seen
df[df.Temp.map(seen)]
Temp Time grain_size
0 335.0 25.0 14.8
1 335.0 30.0 18.7
2 335.0 35.0 22.1
4 227.0 25.0 14.2
5 227.0 30.0 16.2

pd.merge is not working as usual

all,
I have two dataframes: allHoldings and Longswap
allHoldings
prime_broker_id country_name position_type
0 CS UNITED STATES LONG
1 ML UNITED STATES LONG
2 CS AUSTRIA SHORT
3 HSBC FRANCE LONG
4 CITI UNITED STATES SHORT
11 DB UNITED STATES SHORT
12 JPM UNITED STATES SHORT
13 CS ITALY SHORT
14 CITI TAIWAN SHORT
15 CITI UNITED KINGDOM LONG
16 DB FRANCE LONG
17 ML SOUTH KOREA LONG
18 CS AUSTRIA SHORT
19 CS JAPAN LONG
26 HSBC FRANCE SHORT
and Longswap
prime_broker_id country_name longSpread
0 ML AUSTRALIA 30.0
1 ML AUSTRIA 30.0
2 ML BELGIUM 30.0
3 ML BRAZIL 50.0
4 ML CANADA 20.0
5 ML CHILE 50.0
6 ML CHINA - A 75.0
7 ML CZECH REPUBLIC 45.0
8 ML DENMARK 30.0
9 ML EGYPT 45.0
10 ML FINLAND 30.0
11 ML FRANCE 30.0
12 ML GERMANY 30.0
13 ML HONG KONG 30.0
14 ML HUNGARY 45.0
15 ML INDIA 75.0
16 ML INDONESIA 75.0
17 ML IRELAND 30.0
18 ML ISRAEL 45.0
19 ML ITALY 30.0
20 ML JAPAN 30.0
21 ML SOUTH KOREA 50.0
22 ML LUXEMBOURG 30.0
23 ML MALAYSIA 75.0
24 ML MEXICO 50.0
25 ML NETHERLANDS 30.0
26 ML NEW ZEALAND 30.0
27 ML NORWAY 30.0
28 ML PHILIPPINES 75.0
I have left joined many dataframes before but i am still puzzled as to why it is not working for this example.
Here is my code:
allHoldings=pd.merge(allHoldings, Longswap, how='left', left_on = ['prime_broker_id','country_name'], right_on=['prime_broker_id','country_name'])
my results are
prime_broker_id country_name position_type longSpread
0 CS UNITED STATES LONG NaN
1 ML UNITED STATES LONG NaN
2 CS AUSTRIA SHORT NaN
3 HSBC FRANCE LONG NaN
4 CITI UNITED STATES SHORT NaN
5 DB UNITED STATES SHORT NaN
6 JPM UNITED STATES SHORT NaN
7 CS ITALY SHORT NaN
as you can see the longSpread column is a NaN which does not make any sense. From the longSwap dataframe, this column should be populated.
I am not sure why the left join is not working here.
Any Help is appreciated.
here is the answer to delete the whitespace and make left join successful
allHoldings.prime_broker_id.str.strip()
array(['CS', 'ML', 'HSBC', 'CITI', 'DB', 'JPM', 'WFPBS'], dtype=object)

Resources