How to plot median value on boxplot? - python-3.x

I have this boxplot but it is hard to read what value the value of the green line. How can I plot that value? Or is there another way?

One thing you can do could be just plotting directly from the medians calculated from the dataframe
In [204]: df
Out[204]:
a b c d
0 0.807540 0.719843 0.291329 0.928670
1 0.449082 0.000000 0.575919 0.299698
2 0.703734 0.626004 0.582303 0.243273
3 0.363013 0.539557 0.000000 0.743613
4 0.185610 0.526161 0.795284 0.929223
5 0.000000 0.323683 0.966577 0.259640
6 0.000000 0.386281 0.000000 0.000000
7 0.500604 0.131910 0.413131 0.936908
8 0.992779 0.672049 0.108021 0.558684
9 0.797385 0.199847 0.329550 0.605690
In [205]: df.boxplot()
Out[205]: <matplotlib.axes._subplots.AxesSubplot at 0x103404d6a0>
In [206]: for s, x, y in zip(df.median().map('{:0.2f}'.format), np.arange(df.shape[0]) + 1, df.median()):
...: plt.text(x, y, s, va='center', ha='center')

Related

Assigning ID to column name in merging distance matrix to dataframe

I have this issue I haven't been able to solve and I was hoping to get some insights here.
I have this geopandas dataframe:
GEO =
id geometry_zone \
0 A001DFD POLYGON ((48.08793 50.93755, 48.08793 49.18650...
1 A001DG POLYGON ((60.96434 49.05222, 59.86796 49.29929...
2 A001DS007 POLYGON ((53.16200 50.20131, 52.84363 48.45026...
3 A001DS01 POLYGON ((59.04953 49.34561, 58.77158 47.52346...
4 A001DS02 POLYGON ((58.12301 49.46915, 57.79873 47.67788...
5 A001DS03 POLYGON ((57.07498 49.66937, 56.79702 47.84722...
6 A001DS04 POLYGON ((56.13302 49.80835, 55.83962 48.00164...
7 A001DS05 POLYGON ((55.16017 49.93189, 54.89766 48.18694...
8 A001DS06 POLYGON ((54.14099 50.05542, 53.86304 48.27959...
9 A001DS08 POLYGON ((52.22678 50.36050, 51.94821 48.52985...
10 A001DS09 POLYGON ((50.93339 48.70894, 51.96811 48.52985...
11 A001DS10 POLYGON ((50.23695 50.67887, 49.91857 48.84823...
12 A001DS11 POLYGON ((50.23695 50.67887, 49.60020 50.75847...
13 A001FS01 POLYGON ((46.47617 48.94772, 46.47617 47.63443...
14 A001FS02 POLYGON ((46.49606 50.04213, 46.47617 48.94772...
centroid
0 POINT (48.75295 49.98494)
1 POINT (60.27696 48.21993)
2 POINT (53.49869 49.22928)
3 POINT (59.29040 48.38586)
4 POINT (58.42620 48.49535)
5 POINT (57.43469 48.68996)
6 POINT (56.46528 48.82210)
7 POINT (55.50608 48.98701)
8 POINT (54.51093 49.10232)
9 POINT (52.52668 49.40021)
10 POINT (51.59314 49.51614)
11 POINT (50.57522 49.68396)
12 POINT (49.74105 49.81923)
13 POINT (47.00679 48.58955)
14 POINT (47.23437 49.55921)
where the points are the geometry_zone centroids. Now, I know how to calculate the distance between every point, i.e. compute the distance matrix:
GEO_distances
0 1 2 3 4 5 6 \
0 0.000000 11.063874 4.299228 10.275246 9.312075 8.274448 7.312941
1 10.983097 0.000000 6.348082 0.616036 1.399226 2.373198 3.374784
2 4.132203 6.259105 0.000000 5.469828 4.507633 3.469029 2.507443
3 9.982697 0.409114 5.348195 0.000000 0.399280 1.373252 2.374671
4 9.112541 1.279148 4.477119 0.487986 0.000000 0.504366 1.503677
5 8.102334 2.289412 3.468492 1.497509 0.538514 0.000000 0.494605
6 7.124643 3.266993 2.490125 2.475753 1.515950 0.474954 0.000000
7 6.151367 4.240258 1.517485 3.448859 2.489192 1.448060 0.487174
8 5.151208 5.240246 0.515855 4.450013 3.488962 2.449214 1.487936
9 3.145284 7.246023 0.481768 6.456493 5.494540 4.455695 3.494278
10 2.205711 8.185458 1.420986 7.396838 6.433798 5.396039 4.434327
11 1.174092 9.217045 2.452510 8.428427 7.465334 6.427628 5.465988
12 0.329081 10.062023 3.297427 9.273461 8.310263 7.272662 6.311059
13 1.235000 12.579303 5.838504 11.812993 10.830385 9.818336 8.852372
14 0.853558 12.484730 5.717153 11.712257 10.730567 9.711458 8.743639
7 8 9 10 11 12 13 \
0 6.343811 5.312333 3.377798 2.368462 1.343153 0.675055 1.051959
1 4.353762 5.318769 7.388784 8.269175 9.305375 10.325337 12.247130
2 1.538467 0.506829 0.544190 1.416284 2.454398 3.479383 5.430826
3 3.353424 4.318400 6.388838 7.269062 8.304972 9.325272 11.250890
4 2.482659 3.447704 5.519952 6.398068 7.434796 8.456133 10.381205
5 1.473030 2.437971 4.509526 5.388997 6.424600 7.445701 9.379809
6 0.494829 1.459821 3.533033 4.410650 5.446892 6.468964 8.405156
7 0.000000 0.486633 2.560113 3.437762 4.473614 5.495941 7.440721
8 0.518599 0.000000 1.561677 2.436310 3.473427 4.497171 6.443451
9 2.525085 1.493574 0.000000 0.429875 1.467480 2.492644 4.463771
10 3.465481 2.433809 0.499402 0.000000 0.527884 1.554218 3.540493
11 4.497042 3.465439 1.530986 0.521601 0.000000 0.523013 2.556065
12 5.342058 4.310497 2.376017 1.366597 0.341276 0.000000 1.788666
13 7.901132 6.863255 4.941781 3.928417 2.923256 2.273971 0.000000
14 7.782154 6.746808 4.815043 3.790372 2.766326 2.077512 0.492253
14
0 0.703212
1 12.250335
2 5.430658
3 11.253792
4 10.383930
5 9.382000
6 8.406976
7 7.441895
8 6.444094
9 4.461567
10 3.531133
11 2.517604
12 1.686975
13 0.444277
14 0.000000
(So, first row contains the distance to all points in the centroid column, including the first point).
What I actually want is to merge this matrix to the dataframe AND that the column names be the ids from GEO.
Now, I know how to merge:
new = GEO.merge(GEO_distances, on=['index'])
which returns:
index id geometry_zone \
0 0 A001DFD POLYGON ((48.08793 50.93755, 48.08793 49.18650...
1 1 A001DG POLYGON ((60.96434 49.05222, 59.86796 49.29929...
2 2 A001DS007 POLYGON ((53.16200 50.20131, 52.84363 48.45026...
3 3 A001DS01 POLYGON ((59.04953 49.34561, 58.77158 47.52346...
4 4 A001DS02 POLYGON ((58.12301 49.46915, 57.79873 47.67788...
5 5 A001DS03 POLYGON ((57.07498 49.66937, 56.79702 47.84722...
6 6 A001DS04 POLYGON ((56.13302 49.80835, 55.83962 48.00164...
7 7 A001DS05 POLYGON ((55.16017 49.93189, 54.89766 48.18694...
8 8 A001DS06 POLYGON ((54.14099 50.05542, 53.86304 48.27959...
9 9 A001DS08 POLYGON ((52.22678 50.36050, 51.94821 48.52985...
10 10 A001DS09 POLYGON ((50.93339 48.70894, 51.96811 48.52985...
11 11 A001DS10 POLYGON ((50.23695 50.67887, 49.91857 48.84823...
12 12 A001DS11 POLYGON ((50.23695 50.67887, 49.60020 50.75847...
13 13 A001FS01 POLYGON ((46.47617 48.94772, 46.47617 47.63443...
14 14 A001FS02 POLYGON ((46.49606 50.04213, 46.47617 48.94772...
centroid 0 1 2 3 \
0 POINT (48.75295 49.98494) 0.000000 11.063874 4.299228 10.275246
1 POINT (60.27696 48.21993) 10.983097 0.000000 6.348082 0.616036
2 POINT (53.49869 49.22928) 4.132203 6.259105 0.000000 5.469828
3 POINT (59.29040 48.38586) 9.982697 0.409114 5.348195 0.000000
4 POINT (58.42620 48.49535) 9.112541 1.279148 4.477119 0.487986
5 POINT (57.43469 48.68996) 8.102334 2.289412 3.468492 1.497509
6 POINT (56.46528 48.82210) 7.124643 3.266993 2.490125 2.475753
7 POINT (55.50608 48.98701) 6.151367 4.240258 1.517485 3.448859
8 POINT (54.51093 49.10232) 5.151208 5.240246 0.515855 4.450013
9 POINT (52.52668 49.40021) 3.145284 7.246023 0.481768 6.456493
10 POINT (51.59314 49.51614) 2.205711 8.185458 1.420986 7.396838
11 POINT (50.57522 49.68396) 1.174092 9.217045 2.452510 8.428427
12 POINT (49.74105 49.81923) 0.329081 10.062023 3.297427 9.273461
13 POINT (47.00679 48.58955) 1.235000 12.579303 5.838504 11.812993
14 POINT (47.23437 49.55921) 0.853558 12.484730 5.717153 11.712257
4 5 6 7 8 9 10 \
0 9.312075 8.274448 7.312941 6.343811 5.312333 3.377798 2.368462
1 1.399226 2.373198 3.374784 4.353762 5.318769 7.388784 8.269175
2 4.507633 3.469029 2.507443 1.538467 0.506829 0.544190 1.416284
3 0.399280 1.373252 2.374671 3.353424 4.318400 6.388838 7.269062
4 0.000000 0.504366 1.503677 2.482659 3.447704 5.519952 6.398068
5 0.538514 0.000000 0.494605 1.473030 2.437971 4.509526 5.388997
6 1.515950 0.474954 0.000000 0.494829 1.459821 3.533033 4.410650
7 2.489192 1.448060 0.487174 0.000000 0.486633 2.560113 3.437762
8 3.488962 2.449214 1.487936 0.518599 0.000000 1.561677 2.436310
9 5.494540 4.455695 3.494278 2.525085 1.493574 0.000000 0.429875
10 6.433798 5.396039 4.434327 3.465481 2.433809 0.499402 0.000000
11 7.465334 6.427628 5.465988 4.497042 3.465439 1.530986 0.521601
12 8.310263 7.272662 6.311059 5.342058 4.310497 2.376017 1.366597
13 10.830385 9.818336 8.852372 7.901132 6.863255 4.941781 3.928417
14 10.730567 9.711458 8.743639 7.782154 6.746808 4.815043 3.790372
11 12 13 14
0 1.343153 0.675055 1.051959 0.703212
1 9.305375 10.325337 12.247130 12.250335
2 2.454398 3.479383 5.430826 5.430658
3 8.304972 9.325272 11.250890 11.253792
4 7.434796 8.456133 10.381205 10.383930
5 6.424600 7.445701 9.379809 9.382000
6 5.446892 6.468964 8.405156 8.406976
7 4.473614 5.495941 7.440721 7.441895
8 3.473427 4.497171 6.443451 6.444094
9 1.467480 2.492644 4.463771 4.461567
10 0.527884 1.554218 3.540493 3.531133
11 0.000000 0.523013 2.556065 2.517604
12 0.341276 0.000000 1.788666 1.686975
13 2.923256 2.273971 0.000000 0.444277
14 2.766326 2.077512 0.492253 0.000000
But, how do I give the column the id names in a simple way? Manually renaming 18 000 columns is not my idea of a fun afternoon.
I have found an answer to my question, but I still wonder if there is a better and more elegant way to do this (e.g. on the fly). What I did was this:
new_column_name = GEO.id.to_list()
columnlist = GEO_distances.columns.to_list()
cols_remove = ['index','id','geometry_zone','centroid']
old_column_names = [x for x in columnlist if (x not in cols_remove)]
col_rename_dict = {i:j for i,j in zip(old_column_names,new_column_name)}
GEO_distances.rename(columns=col_rename_dict, inplace=True)
which gives:
index id geometry_zone \
0 0 A001DFD POLYGON ((48.08793 50.93755, 48.08793 49.18650...
1 1 A001DG POLYGON ((60.96434 49.05222, 59.86796 49.29929...
2 2 A001DS007 POLYGON ((53.16200 50.20131, 52.84363 48.45026...
3 3 A001DS01 POLYGON ((59.04953 49.34561, 58.77158 47.52346...
4 4 A001DS02 POLYGON ((58.12301 49.46915, 57.79873 47.67788...
... ... ... ...
1790 1790 R13C1G POLYGON ((63.72846 54.07087, 61.04155 54.02454...
1791 1791 R13D1A POLYGON ((63.03727 60.43190, 65.27641 57.78312...
1792 1792 R13D1D POLYGON ((68.90781 67.16844, 68.95414 60.51294...
1793 1793 R13D1F POLYGON ((61.42043 67.16403, 75.48019 67.22166...
1794 1794 R13D1G POLYGON ((61.40300 67.15300, 61.43388 63.43148...
centroid A001DFD A001DG A001DS007 A001DS01 \
0 POINT (48.75295 49.98494) 0.000000 11.063874 4.299228 10.275246
1 POINT (60.27696 48.21993) 10.983097 0.000000 6.348082 0.616036
2 POINT (53.49869 49.22928) 4.132203 6.259105 0.000000 5.469828
3 POINT (59.29040 48.38586) 9.982697 0.409114 5.348195 0.000000
4 POINT (58.42620 48.49535) 9.112541 1.279148 4.477119 0.487986
... ... ... ... ... ...
1790 POINT (62.36165 51.28081) 12.814471 2.630419 8.337061 3.267367
1791 POINT (69.85889 59.16021) 21.991462 13.464194 18.191827 14.124815
1792 POINT (72.22137 63.86261) 26.206982 18.602918 22.776510 19.187716
1793 POINT (68.46954 68.61039) 26.045757 20.948750 23.468535 21.237352
1794 POINT (65.33358 63.93216) 20.589210 15.508162 17.853344 15.717912
A001DS02 A001DS03 A001DS04 A001DS05 A001DS06 A001DS08 \
0 9.312075 8.274448 7.312941 6.343811 5.312333 3.377798
1 1.399226 2.373198 3.374784 4.353762 5.318769 7.388784
2 4.507633 3.469029 2.507443 1.538467 0.506829 0.544190
3 0.399280 1.373252 2.374671 3.353424 4.318400 6.388838
4 0.000000 0.504366 1.503677 2.482659 3.447704 5.519952
... ... ... ... ... ... ...
1790 3.862726 4.616091 5.526811 6.415345 7.329588 9.280236
1791 14.623171 15.220831 15.921820 16.621702 17.363742 18.956701
1792 19.622822 20.146582 20.757196 21.374153 22.035882 23.454339
1793 21.458123 21.751888 22.104256 22.496392 22.947829 23.939341
1794 15.894837 16.153433 16.481252 16.864660 17.318735 18.347265
Any other more efficient solution is welcome.

Merge distance matrix results and original indices with Python Pandas

I have a panda df with list of bus stops and their geolocations:
stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125
stop_id isn't necessarily incremental.
Using sklearn.metrics.pairwise.manhattan_distances I calculate distances and get a symmetric distance matrix:
array([[0. , 1.412176, 2.33437 , 3.422297, 5.24705 ],
[1.412176, 0. , 1.151232, 2.047153, 4.165126],
[2.33437 , 1.151232, 0. , 1.104079, 3.143274],
[3.422297, 2.047153, 1.104079, 0. , 2.175247],
[5.24705 , 4.165126, 3.143274, 2.175247, 0. ]])
But I can't manage to easily connect between the two now. I want to have a df that contains a tuple for each pair of stops and their distance, something like:
stop_id_1 stop_id_2 distance
1 2 3.33
I tried working with the lower triangle, convert to vector and all sorts but I feel I just over-complicate things with no success.
Hope this helps!
d= ''' stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125 '''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+')
from sklearn.metrics.pairwise import manhattan_distances
distance_df = pd.DataFrame(manhattan_distances(df))
distance_df.index = df.stop_id.values
distance_df.columns = df.stop_id.values
print(distance_df)
output:
1 2 3 4 6
1 0.000000 1.412176 2.334370 3.422297 5.247050
2 1.412176 0.000000 1.151232 2.047153 4.165126
3 2.334370 1.151232 0.000000 1.104079 3.143274
4 3.422297 2.047153 1.104079 0.000000 2.175247
6 5.247050 4.165126 3.143274 2.175247 0.000000
Now, to create the long format of the same df, use the following.
long_frmt_dist=distance_df.unstack().reset_index()
long_frmt_dist.columns = ['stop_id_1', 'stop_id_2', 'distance']
print(long_frmt_dist.head())
output:
stop_id_1 stop_id_2 distance
0 1 1 0.000000
1 1 2 1.412176
2 1 3 2.334370
3 1 4 3.422297
4 1 6 5.247050
df_dist = pd.DataFrame.from_dict(dist_matrix)
pd.merge(df, df_dist, how='left', left_index=True, right_index=True)
example

pandas apply custom function to every row of one column grouped by another column

I have a dataframe containing two columns: id and val.
df = pd.DataFrame ({'id': [1,1,1,2,2,2,3,3,3,3], 'val' : np.random.randn(10)})
id val
0 1 2.644347
1 1 0.378770
2 1 -2.107230
3 2 -0.043051
4 2 0.115948
5 2 0.054485
6 3 0.574845
7 3 -0.228612
8 3 -2.648036
9 3 0.569929
And I want to apply a custom function to every val according to id. Let's say I want to apply min-max scaling. This is how I would do it using a for loop:
df['scaled']=0
ids = df.id.drop_duplicates()
for i in range(len(ids)):
df1 = df[df.id==ids.iloc[i]]
df1['scaled'] = (df1.val-df1.val.min())/(df1.val.max()-df1.val.min())
df.loc[df.id==ids.iloc[i],'scaled'] = df1['scaled']
And the result is:
id val scaled
0 1 0.457713 1.000000
1 1 -0.464513 0.000000
2 1 0.216352 0.738285
3 2 0.633652 0.990656
4 2 -1.099065 0.000000
5 2 0.649995 1.000000
6 3 -0.251099 0.306631
7 3 -1.003295 0.081387
8 3 2.064389 1.000000
9 3 -1.275086 0.000000
How can I do this faster without a loop?
You can do this with groupby:
In [6]: def minmaxscale(s): return (s - s.min()) / (s.max() - s.min())
In [7]: df.groupby('id')['val'].apply(minmaxscale)
Out[7]:
0 0.000000
1 1.000000
2 0.654490
3 1.000000
4 0.524256
5 0.000000
6 0.000000
7 0.100238
8 0.014697
9 1.000000
Name: val, dtype: float64
(Note that np.ptp() / peak-to-peak can be used in placed of s.max() - s.min().)
This applies the function minmaxscale() to each smaller-sized Series of val, grouped by id.
Taking the first group, for example:
In [11]: s = df[df.id == 1]['val']
In [12]: s
Out[12]:
0 0.002722
1 0.656233
2 0.430438
Name: val, dtype: float64
In [13]: s.max() - s.min()
Out[13]: 0.6535106879021447
In [14]: (s - s.min()) / (s.max() - s.min())
Out[14]:
0 0.00000
1 1.00000
2 0.65449
Name: val, dtype: float64
Solution from sklearn MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['new']=np.concatenate([scaler.fit_transform(x.values.reshape(-1,1)) for y, x in df.groupby('id').val])
df
Out[271]:
id val scaled new
0 1 0.457713 1.000000 1.000000
1 1 -0.464513 0.000000 0.000000
2 1 0.216352 0.738285 0.738284
3 2 0.633652 0.990656 0.990656
4 2 -1.099065 0.000000 0.000000
5 2 0.649995 1.000000 1.000000
6 3 -0.251099 0.306631 0.306631
7 3 -1.003295 0.081387 0.081387
8 3 2.064389 1.000000 1.000000
9 3 -1.275086 0.000000 0.000000

Getting exponential values when using describe() in Python

I am getting exponential output when i use the below describe function:
error_df = pandas.DataFrame({'reconstruction_error': mse,'true_class':
dummy_y_test_o})
print(error_df)
error_df.describe()
I tried printing the values in error_df and the count is 3150. But why do i get a exponential value then?
reconstruction_error true_class
0 0.000003 0
1 0.000003 1
... ... ...
3148 0.000003 2
3149 0.000003 0
[3150 rows x 2 columns]
Out[17]:
reconstruction_error true_class
count 3.150000e+03 3150.000000
mean 3.199467e-06 1.000000
std 1.764693e-07 0.816626
min 2.914150e-06 0.000000
25% 3.058539e-06 0.000000
50% 3.092931e-06 1.000000
75% 3.423379e-06 2.000000
max 3.737767e-06 2.000000

Pandas create percentile field based on groupby with level 1

Given the following data frame:
import pandas as pd
df = pd.DataFrame({
('Group', 'group'): ['a','a','a','b','b','b'],
('sum', 'sum'): [234, 234,544,7,332,766]
})
I'd like to create a new field which calculates the percentile of each value of "sum" per group in "group". The trouble is, I have 2 header columns and cannot figure out how to avoid getting the error:
ValueError: level > 0 only valid with MultiIndex
when I run this:
df=df.groupby('Group',level=1).sum.rank(pct=True, ascending=False)
I need to keep the headers in the same structure.
Thanks in advance!
To group by the first column, ('Group', 'group'), and compute the rank for the ('sum', 'sum') column use:
In [106]: df['rank'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')]).rank(pct=True, ascending=False))
In [107]: df
Out[107]:
Group sum rank
group sum
0 a 234 0.833333
1 a 234 0.833333
2 a 544 0.333333
3 b 7 1.000000
4 b 332 0.666667
5 b 766 0.333333
Note that .rank(pct=True) computes a percentage rank, not a percentile. To compute a percentile you could use scipy.stats.percentileofscore.
import scipy.stats as stats
df['percentile'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')])
.apply(lambda ser: 100-pd.Series([stats.percentileofscore(ser, x, kind='rank')
for x in ser], index=ser.index)))
yields
Group sum rank percentile
group sum
0 a 234 0.833333 50.000000
1 a 234 0.833333 50.000000
2 a 544 0.333333 0.000000
3 b 7 1.000000 66.666667
4 b 332 0.666667 33.333333
5 b 766 0.333333 0.000000

Resources