Pandas create percentile field based on groupby with level 1 - python-3.x

Given the following data frame:
import pandas as pd
df = pd.DataFrame({
('Group', 'group'): ['a','a','a','b','b','b'],
('sum', 'sum'): [234, 234,544,7,332,766]
})
I'd like to create a new field which calculates the percentile of each value of "sum" per group in "group". The trouble is, I have 2 header columns and cannot figure out how to avoid getting the error:
ValueError: level > 0 only valid with MultiIndex
when I run this:
df=df.groupby('Group',level=1).sum.rank(pct=True, ascending=False)
I need to keep the headers in the same structure.
Thanks in advance!

To group by the first column, ('Group', 'group'), and compute the rank for the ('sum', 'sum') column use:
In [106]: df['rank'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')]).rank(pct=True, ascending=False))
In [107]: df
Out[107]:
Group sum rank
group sum
0 a 234 0.833333
1 a 234 0.833333
2 a 544 0.333333
3 b 7 1.000000
4 b 332 0.666667
5 b 766 0.333333
Note that .rank(pct=True) computes a percentage rank, not a percentile. To compute a percentile you could use scipy.stats.percentileofscore.
import scipy.stats as stats
df['percentile'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')])
.apply(lambda ser: 100-pd.Series([stats.percentileofscore(ser, x, kind='rank')
for x in ser], index=ser.index)))
yields
Group sum rank percentile
group sum
0 a 234 0.833333 50.000000
1 a 234 0.833333 50.000000
2 a 544 0.333333 0.000000
3 b 7 1.000000 66.666667
4 b 332 0.666667 33.333333
5 b 766 0.333333 0.000000

Related

How to convert values of panda dataframe to columns

I have a dataset given below:
weekid type amount
1 A 10
1 B 20
1 C 30
1 D 40
1 F 50
2 A 70
2 E 80
2 B 100
I am trying to convert it to another panda frame based on total number of type values defined with:
import pandas as pd
import numpy as np
df=pd.read_csv(INPUT_FILE)
for type in df["type"].unique():
//todo
My aim is to get a data given below:
weekid type_A type_B type_C type_D type_E type_F
1 10 20 30 40 0 50
2 70 100 0 0 80 0
Is there any specific function that convert unique values as a column and fills the missing values as 0 for each weekId groups? I am wondering that how this conversion can be done efficiently?
You can use the following:
df = df.pivot(columns=['type'], values=['amount'])
df.fillna(0)
dfp.columns = dfp.columns.droplevel(0)
Given your input this yields:
type A B C D F
weekid
1 10.0 20.0 30.0 40.0 50.0
2 70.0 80.0 100.0 0.0 0.0

Pandas print missing value column names and count only

I am using the following code to print the missing value count and the column names.
#Looking for missing data and then handling it accordingly
def find_missing(data):
# number of missing values
count_missing = data_final.isnull().sum().values
# total records
total = data_final.shape[0]
# percentage of missing
ratio_missing = count_missing/total
# return a dataframe to show: feature name, # of missing and % of missing
return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing},
index=data.columns.values)
find_missing(data_final).head(5)
What I want to do is to only print those columns where there is a missing value as I have a huge data set of about 150 columns.
The data set looks like this
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
In the output I would just want to see :
missing_count missing_ratio
C 4 0.66
D 4 0.66
and not the columns A and B as there are no missing values there
Use DataFrame.isna with DataFrame.sum
to count by columns. We can also use DataFrame.isnull instead DataFrame.isna.
new_df = (df.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(df))
.loc[df.isna().any()] )
print(new_df)
We can also use pd.concat instead DataFrame.assign
count = df.isna().sum()
new_df = (pd.concat([count.rename('missing_count'),
count.div(len(df))
.rename('missing_ratio')],axis = 1)
.loc[count.ne(0)])
Output
missing_count missing_ratio
A 1 0.166667
C 4 0.666667
D 4 0.666667
IIUC, we can assign the missing and total count to two variables do some basic math and assign back to a df.
a = df.isnull().sum(axis=0)
b = np.round(df.isnull().sum(axis=0) / df.fillna(0).count(axis=0),2)
missing_df = pd.DataFrame({'missing_vals' : a,
'missing_ratio' : b})
print(missing_df)
missing_vals ratio
A 1 0.17
B 0 0.00
C 4 0.67
D 4 0.67
you can filter out columns that don't have any missing vals
missing_df = missing_df[missing_df.missing_vals.ne(0)]
print(missing_df)
missing_vals ratio
A 1 0.17
C 4 0.67
D 4 0.67
You can also use concat:
s = df.isnull().sum()
result = pd.concat([s,s/len(df)],1)
result.columns = ["missing_count","missing_ratio"]
print (result)
missing_count missing_ratio
A 1 0.166667
B 0 0.000000
C 4 0.666667
D 4 0.666667

how to compare strings in two data frames in pandas(python 3.x)?

I have two DFs like so:
df1:
ProjectCode ProjectName
1 project1
2 project2
3 projc3
4 prj4
5 prjct5
and df2 as
VillageName
v1
proj3
pro1
prjc3
project1
What I have to do is compare each ProjectName with VillageName and also add the percentage of matching. The percentage to be calculated as:
No. of matching characters/total characters * 100
The Village data i.e. df2 has more than 10 million records and the Project data i.e. df1 contains around 1200 records.
What I have done so far:
import pandas as pd
df1 = pd.read_excel("C:\\Users\\Desktop\\distinctVillage.xlsx")
df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
for idx, row in df.iteritems():
for idx1, row1 in df1.iteritems():
I don't know how to proceed with this. How to find substring and get third df having percentage match with each string. I think it is not feasible since each record from Project will have matching with each value of Village which will produce a huge result.
Is there any better way to find out which project names are matching with which village names and how good is the match?
Expected output:
ProjectName VillageName charactersMatching PercentageMatch
project1 v1 1 whateverPercent
project1 proj3 4 whateverPercent
The expected output can be changed depending on the feasibility and solution.
The following code assumes you don't care about repeated characters (since it's taking the set on both sides).
percentage_match = df1['ProjectName'].apply(lambda x: df2['VillageName'].apply(lambda y: len(set(y).intersection(set(x))) / len(set(x+y))))
Output:
0 1 2 3 4
ProjectCode
1 0.111111 0.444444 0.500000 0.444444 1.000000
2 0.000000 0.444444 0.333333 0.444444 0.777778
3 0.000000 0.833333 0.428571 0.833333 0.555556
4 0.000000 0.500000 0.333333 0.500000 0.333333
5 0.000000 0.375000 0.250000 0.571429 0.555556
If you want the 'best match' for each Project:
percentage_match.idxmax(axis = 1)
Output:
1 4
2 4
3 1
4 1
5 3

pandas apply custom function to every row of one column grouped by another column

I have a dataframe containing two columns: id and val.
df = pd.DataFrame ({'id': [1,1,1,2,2,2,3,3,3,3], 'val' : np.random.randn(10)})
id val
0 1 2.644347
1 1 0.378770
2 1 -2.107230
3 2 -0.043051
4 2 0.115948
5 2 0.054485
6 3 0.574845
7 3 -0.228612
8 3 -2.648036
9 3 0.569929
And I want to apply a custom function to every val according to id. Let's say I want to apply min-max scaling. This is how I would do it using a for loop:
df['scaled']=0
ids = df.id.drop_duplicates()
for i in range(len(ids)):
df1 = df[df.id==ids.iloc[i]]
df1['scaled'] = (df1.val-df1.val.min())/(df1.val.max()-df1.val.min())
df.loc[df.id==ids.iloc[i],'scaled'] = df1['scaled']
And the result is:
id val scaled
0 1 0.457713 1.000000
1 1 -0.464513 0.000000
2 1 0.216352 0.738285
3 2 0.633652 0.990656
4 2 -1.099065 0.000000
5 2 0.649995 1.000000
6 3 -0.251099 0.306631
7 3 -1.003295 0.081387
8 3 2.064389 1.000000
9 3 -1.275086 0.000000
How can I do this faster without a loop?
You can do this with groupby:
In [6]: def minmaxscale(s): return (s - s.min()) / (s.max() - s.min())
In [7]: df.groupby('id')['val'].apply(minmaxscale)
Out[7]:
0 0.000000
1 1.000000
2 0.654490
3 1.000000
4 0.524256
5 0.000000
6 0.000000
7 0.100238
8 0.014697
9 1.000000
Name: val, dtype: float64
(Note that np.ptp() / peak-to-peak can be used in placed of s.max() - s.min().)
This applies the function minmaxscale() to each smaller-sized Series of val, grouped by id.
Taking the first group, for example:
In [11]: s = df[df.id == 1]['val']
In [12]: s
Out[12]:
0 0.002722
1 0.656233
2 0.430438
Name: val, dtype: float64
In [13]: s.max() - s.min()
Out[13]: 0.6535106879021447
In [14]: (s - s.min()) / (s.max() - s.min())
Out[14]:
0 0.00000
1 1.00000
2 0.65449
Name: val, dtype: float64
Solution from sklearn MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['new']=np.concatenate([scaler.fit_transform(x.values.reshape(-1,1)) for y, x in df.groupby('id').val])
df
Out[271]:
id val scaled new
0 1 0.457713 1.000000 1.000000
1 1 -0.464513 0.000000 0.000000
2 1 0.216352 0.738285 0.738284
3 2 0.633652 0.990656 0.990656
4 2 -1.099065 0.000000 0.000000
5 2 0.649995 1.000000 1.000000
6 3 -0.251099 0.306631 0.306631
7 3 -1.003295 0.081387 0.081387
8 3 2.064389 1.000000 1.000000
9 3 -1.275086 0.000000 0.000000

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources