pandas apply custom function to every row of one column grouped by another column - python-3.x

I have a dataframe containing two columns: id and val.
df = pd.DataFrame ({'id': [1,1,1,2,2,2,3,3,3,3], 'val' : np.random.randn(10)})
id val
0 1 2.644347
1 1 0.378770
2 1 -2.107230
3 2 -0.043051
4 2 0.115948
5 2 0.054485
6 3 0.574845
7 3 -0.228612
8 3 -2.648036
9 3 0.569929
And I want to apply a custom function to every val according to id. Let's say I want to apply min-max scaling. This is how I would do it using a for loop:
df['scaled']=0
ids = df.id.drop_duplicates()
for i in range(len(ids)):
df1 = df[df.id==ids.iloc[i]]
df1['scaled'] = (df1.val-df1.val.min())/(df1.val.max()-df1.val.min())
df.loc[df.id==ids.iloc[i],'scaled'] = df1['scaled']
And the result is:
id val scaled
0 1 0.457713 1.000000
1 1 -0.464513 0.000000
2 1 0.216352 0.738285
3 2 0.633652 0.990656
4 2 -1.099065 0.000000
5 2 0.649995 1.000000
6 3 -0.251099 0.306631
7 3 -1.003295 0.081387
8 3 2.064389 1.000000
9 3 -1.275086 0.000000
How can I do this faster without a loop?

You can do this with groupby:
In [6]: def minmaxscale(s): return (s - s.min()) / (s.max() - s.min())
In [7]: df.groupby('id')['val'].apply(minmaxscale)
Out[7]:
0 0.000000
1 1.000000
2 0.654490
3 1.000000
4 0.524256
5 0.000000
6 0.000000
7 0.100238
8 0.014697
9 1.000000
Name: val, dtype: float64
(Note that np.ptp() / peak-to-peak can be used in placed of s.max() - s.min().)
This applies the function minmaxscale() to each smaller-sized Series of val, grouped by id.
Taking the first group, for example:
In [11]: s = df[df.id == 1]['val']
In [12]: s
Out[12]:
0 0.002722
1 0.656233
2 0.430438
Name: val, dtype: float64
In [13]: s.max() - s.min()
Out[13]: 0.6535106879021447
In [14]: (s - s.min()) / (s.max() - s.min())
Out[14]:
0 0.00000
1 1.00000
2 0.65449
Name: val, dtype: float64

Solution from sklearn MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['new']=np.concatenate([scaler.fit_transform(x.values.reshape(-1,1)) for y, x in df.groupby('id').val])
df
Out[271]:
id val scaled new
0 1 0.457713 1.000000 1.000000
1 1 -0.464513 0.000000 0.000000
2 1 0.216352 0.738285 0.738284
3 2 0.633652 0.990656 0.990656
4 2 -1.099065 0.000000 0.000000
5 2 0.649995 1.000000 1.000000
6 3 -0.251099 0.306631 0.306631
7 3 -1.003295 0.081387 0.081387
8 3 2.064389 1.000000 1.000000
9 3 -1.275086 0.000000 0.000000

Related

How to plot median value on boxplot?

I have this boxplot but it is hard to read what value the value of the green line. How can I plot that value? Or is there another way?
One thing you can do could be just plotting directly from the medians calculated from the dataframe
In [204]: df
Out[204]:
a b c d
0 0.807540 0.719843 0.291329 0.928670
1 0.449082 0.000000 0.575919 0.299698
2 0.703734 0.626004 0.582303 0.243273
3 0.363013 0.539557 0.000000 0.743613
4 0.185610 0.526161 0.795284 0.929223
5 0.000000 0.323683 0.966577 0.259640
6 0.000000 0.386281 0.000000 0.000000
7 0.500604 0.131910 0.413131 0.936908
8 0.992779 0.672049 0.108021 0.558684
9 0.797385 0.199847 0.329550 0.605690
In [205]: df.boxplot()
Out[205]: <matplotlib.axes._subplots.AxesSubplot at 0x103404d6a0>
In [206]: for s, x, y in zip(df.median().map('{:0.2f}'.format), np.arange(df.shape[0]) + 1, df.median()):
...: plt.text(x, y, s, va='center', ha='center')

how to compare strings in two data frames in pandas(python 3.x)?

I have two DFs like so:
df1:
ProjectCode ProjectName
1 project1
2 project2
3 projc3
4 prj4
5 prjct5
and df2 as
VillageName
v1
proj3
pro1
prjc3
project1
What I have to do is compare each ProjectName with VillageName and also add the percentage of matching. The percentage to be calculated as:
No. of matching characters/total characters * 100
The Village data i.e. df2 has more than 10 million records and the Project data i.e. df1 contains around 1200 records.
What I have done so far:
import pandas as pd
df1 = pd.read_excel("C:\\Users\\Desktop\\distinctVillage.xlsx")
df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
for idx, row in df.iteritems():
for idx1, row1 in df1.iteritems():
I don't know how to proceed with this. How to find substring and get third df having percentage match with each string. I think it is not feasible since each record from Project will have matching with each value of Village which will produce a huge result.
Is there any better way to find out which project names are matching with which village names and how good is the match?
Expected output:
ProjectName VillageName charactersMatching PercentageMatch
project1 v1 1 whateverPercent
project1 proj3 4 whateverPercent
The expected output can be changed depending on the feasibility and solution.
The following code assumes you don't care about repeated characters (since it's taking the set on both sides).
percentage_match = df1['ProjectName'].apply(lambda x: df2['VillageName'].apply(lambda y: len(set(y).intersection(set(x))) / len(set(x+y))))
Output:
0 1 2 3 4
ProjectCode
1 0.111111 0.444444 0.500000 0.444444 1.000000
2 0.000000 0.444444 0.333333 0.444444 0.777778
3 0.000000 0.833333 0.428571 0.833333 0.555556
4 0.000000 0.500000 0.333333 0.500000 0.333333
5 0.000000 0.375000 0.250000 0.571429 0.555556
If you want the 'best match' for each Project:
percentage_match.idxmax(axis = 1)
Output:
1 4
2 4
3 1
4 1
5 3

Merge distance matrix results and original indices with Python Pandas

I have a panda df with list of bus stops and their geolocations:
stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125
stop_id isn't necessarily incremental.
Using sklearn.metrics.pairwise.manhattan_distances I calculate distances and get a symmetric distance matrix:
array([[0. , 1.412176, 2.33437 , 3.422297, 5.24705 ],
[1.412176, 0. , 1.151232, 2.047153, 4.165126],
[2.33437 , 1.151232, 0. , 1.104079, 3.143274],
[3.422297, 2.047153, 1.104079, 0. , 2.175247],
[5.24705 , 4.165126, 3.143274, 2.175247, 0. ]])
But I can't manage to easily connect between the two now. I want to have a df that contains a tuple for each pair of stops and their distance, something like:
stop_id_1 stop_id_2 distance
1 2 3.33
I tried working with the lower triangle, convert to vector and all sorts but I feel I just over-complicate things with no success.
Hope this helps!
d= ''' stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125 '''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+')
from sklearn.metrics.pairwise import manhattan_distances
distance_df = pd.DataFrame(manhattan_distances(df))
distance_df.index = df.stop_id.values
distance_df.columns = df.stop_id.values
print(distance_df)
output:
1 2 3 4 6
1 0.000000 1.412176 2.334370 3.422297 5.247050
2 1.412176 0.000000 1.151232 2.047153 4.165126
3 2.334370 1.151232 0.000000 1.104079 3.143274
4 3.422297 2.047153 1.104079 0.000000 2.175247
6 5.247050 4.165126 3.143274 2.175247 0.000000
Now, to create the long format of the same df, use the following.
long_frmt_dist=distance_df.unstack().reset_index()
long_frmt_dist.columns = ['stop_id_1', 'stop_id_2', 'distance']
print(long_frmt_dist.head())
output:
stop_id_1 stop_id_2 distance
0 1 1 0.000000
1 1 2 1.412176
2 1 3 2.334370
3 1 4 3.422297
4 1 6 5.247050
df_dist = pd.DataFrame.from_dict(dist_matrix)
pd.merge(df, df_dist, how='left', left_index=True, right_index=True)
example

Getting exponential values when using describe() in Python

I am getting exponential output when i use the below describe function:
error_df = pandas.DataFrame({'reconstruction_error': mse,'true_class':
dummy_y_test_o})
print(error_df)
error_df.describe()
I tried printing the values in error_df and the count is 3150. But why do i get a exponential value then?
reconstruction_error true_class
0 0.000003 0
1 0.000003 1
... ... ...
3148 0.000003 2
3149 0.000003 0
[3150 rows x 2 columns]
Out[17]:
reconstruction_error true_class
count 3.150000e+03 3150.000000
mean 3.199467e-06 1.000000
std 1.764693e-07 0.816626
min 2.914150e-06 0.000000
25% 3.058539e-06 0.000000
50% 3.092931e-06 1.000000
75% 3.423379e-06 2.000000
max 3.737767e-06 2.000000

Pandas create percentile field based on groupby with level 1

Given the following data frame:
import pandas as pd
df = pd.DataFrame({
('Group', 'group'): ['a','a','a','b','b','b'],
('sum', 'sum'): [234, 234,544,7,332,766]
})
I'd like to create a new field which calculates the percentile of each value of "sum" per group in "group". The trouble is, I have 2 header columns and cannot figure out how to avoid getting the error:
ValueError: level > 0 only valid with MultiIndex
when I run this:
df=df.groupby('Group',level=1).sum.rank(pct=True, ascending=False)
I need to keep the headers in the same structure.
Thanks in advance!
To group by the first column, ('Group', 'group'), and compute the rank for the ('sum', 'sum') column use:
In [106]: df['rank'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')]).rank(pct=True, ascending=False))
In [107]: df
Out[107]:
Group sum rank
group sum
0 a 234 0.833333
1 a 234 0.833333
2 a 544 0.333333
3 b 7 1.000000
4 b 332 0.666667
5 b 766 0.333333
Note that .rank(pct=True) computes a percentage rank, not a percentile. To compute a percentile you could use scipy.stats.percentileofscore.
import scipy.stats as stats
df['percentile'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')])
.apply(lambda ser: 100-pd.Series([stats.percentileofscore(ser, x, kind='rank')
for x in ser], index=ser.index)))
yields
Group sum rank percentile
group sum
0 a 234 0.833333 50.000000
1 a 234 0.833333 50.000000
2 a 544 0.333333 0.000000
3 b 7 1.000000 66.666667
4 b 332 0.666667 33.333333
5 b 766 0.333333 0.000000

Resources