Getting exponential values when using describe() in Python - python-3.x

I am getting exponential output when i use the below describe function:
error_df = pandas.DataFrame({'reconstruction_error': mse,'true_class':
dummy_y_test_o})
print(error_df)
error_df.describe()
I tried printing the values in error_df and the count is 3150. But why do i get a exponential value then?
reconstruction_error true_class
0 0.000003 0
1 0.000003 1
... ... ...
3148 0.000003 2
3149 0.000003 0
[3150 rows x 2 columns]
Out[17]:
reconstruction_error true_class
count 3.150000e+03 3150.000000
mean 3.199467e-06 1.000000
std 1.764693e-07 0.816626
min 2.914150e-06 0.000000
25% 3.058539e-06 0.000000
50% 3.092931e-06 1.000000
75% 3.423379e-06 2.000000
max 3.737767e-06 2.000000

Related

Decimal date manipulation with Pandas in Python

This may be a little silly to ask but I've tried to search examples to manipulate dates in a Data Frame using pandas. But what confuses me is that my dates have this format:
Time A B C D
1.000347257 626.9966431 0 0 -99.98999786
1.001041651 626.9967651 0 0 -99.98999786
1.001736164 627.0130005 0 0 -99.98999786
1.002430558 627.0130005 0 0 -99.98999786
1.003124952 627.0455933 0 0 -99.98999786
1.003819466 627.0618286 0 0 -99.98999786
...
1.998263836 627.7052002 0.3417936265 0.2321419418 0.07069379836
1.998958349 627.7216187 0.3260073066 0.2284916639 0.073251158
1.999652743 627.6726074 0.3180454969 0.2164463699 0.07418025285
2.000347137 627.7371826 0.3161731362 0.2277853489 0.07479456067
2.001041651 627.7365723 0.301556468 0.2394933105 0.07920494676
2.001736164 627.7686157 0.3718534708 0.2506033182 0.07810453326
...
366.996887 625.413574 3.168393 2.114161 2.119713
366.997559 625.413391 3.163851 2.104703 2.117746
366.998261 625.461792 3.184296 2.113827 2.117964
366.998962 625.449463 3.163331 2.117869 2.116489
366.999664 625.510681 3.166895 2.126145 2.110077
This is an extract of the file where I have the data stored. Is there a way to convert this format using the datetime library to something like 2010-10-23? The year here is 2011 but is not specified in the data.
Thank you!
I looked into the documentation of pandas, though I don't understand very well, it worked. The time was in decimal format, and by day. So I just defined it and used a timestamp to declare the year that I already knew of.
df['Time'] = pd.to_datetime(
df['Time'], unit='D', origin=pd.Timestamp('2011-01-01')
)
With this the result is what I wanted it to be. And it goes through 366 days, as shown below:
Time A B C D
2016-01-02 00:00:30.003004800 626.996643 0.000000 0.000000 -99.989998
2016-01-02 00:01:29.998646400 626.996765 0.000000 0.000000 -99.989998
2016-01-02 00:02:30.004569600 627.013000 0.000000 0.000000 -99.989998
2016-01-02 00:03:30.000211200 627.013000 0.000000 0.000000 -99.989998
2016-01-02 00:04:29.995852800 627.045593 0.000000 0.000000 -99.989998
... ... ... ... ...
2017-01-01 23:55:31.054080000 625.413574 2.706322 2.086675 2.094654
2017-01-01 23:56:29.063040000 625.413391 2.738388 2.082261 2.092784
2017-01-01 23:57:29.707200000 625.461792 2.762815 2.097127 2.091273
2017-01-01 23:58:30.351360000 625.449463 2.698989 2.105750 2.090060
2017-01-01 23:59:30.995520000 625.510681 2.751848 2.109448 2.090664
Your column Time seems to be day fractions. If you know the year, you can convert that to a datetime column using
# 1 - convert the year to nanoseconds since the epoch
# 2 - add the day fraction, after you convert that to nanoseconds as well
# 3 - convert the resulting nanoseconds since the epoch to datetime
year = '2011'
df['datetime'] = pd.to_datetime(pd.to_datetime(year).value + df['Time']*86400*1e9)
which will give you e.g.
df
Time A B C D datetime
0 1.000347 626.996643 0 0 -99.989998 2011-01-02 00:00:30.003004928
1 1.001042 626.996765 0 0 -99.989998 2011-01-02 00:01:29.998646272
2 1.001736 627.013000 0 0 -99.989998 2011-01-02 00:02:30.004569600
3 1.002431 627.013000 0 0 -99.989998 2011-01-02 00:03:30.000211200
4 1.003125 627.045593 0 0 -99.989998 2011-01-02 00:04:29.995852800
5 1.003819 627.061829 0 0 -99.989998 2011-01-02 00:05:30.001862400
You can cast column to datetime with pd.to_datetime():
df.Time = pd.to_datetime(df.Time)
df.head(2)
Time A B C D
0 1970-01-01 00:00:01.000347257 626.996643 0 0 -99.989998
1 1970-01-01 00:00:01.001041651 626.996765 0 0 -99.989998

How to plot median value on boxplot?

I have this boxplot but it is hard to read what value the value of the green line. How can I plot that value? Or is there another way?
One thing you can do could be just plotting directly from the medians calculated from the dataframe
In [204]: df
Out[204]:
a b c d
0 0.807540 0.719843 0.291329 0.928670
1 0.449082 0.000000 0.575919 0.299698
2 0.703734 0.626004 0.582303 0.243273
3 0.363013 0.539557 0.000000 0.743613
4 0.185610 0.526161 0.795284 0.929223
5 0.000000 0.323683 0.966577 0.259640
6 0.000000 0.386281 0.000000 0.000000
7 0.500604 0.131910 0.413131 0.936908
8 0.992779 0.672049 0.108021 0.558684
9 0.797385 0.199847 0.329550 0.605690
In [205]: df.boxplot()
Out[205]: <matplotlib.axes._subplots.AxesSubplot at 0x103404d6a0>
In [206]: for s, x, y in zip(df.median().map('{:0.2f}'.format), np.arange(df.shape[0]) + 1, df.median()):
...: plt.text(x, y, s, va='center', ha='center')

Merge distance matrix results and original indices with Python Pandas

I have a panda df with list of bus stops and their geolocations:
stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125
stop_id isn't necessarily incremental.
Using sklearn.metrics.pairwise.manhattan_distances I calculate distances and get a symmetric distance matrix:
array([[0. , 1.412176, 2.33437 , 3.422297, 5.24705 ],
[1.412176, 0. , 1.151232, 2.047153, 4.165126],
[2.33437 , 1.151232, 0. , 1.104079, 3.143274],
[3.422297, 2.047153, 1.104079, 0. , 2.175247],
[5.24705 , 4.165126, 3.143274, 2.175247, 0. ]])
But I can't manage to easily connect between the two now. I want to have a df that contains a tuple for each pair of stops and their distance, something like:
stop_id_1 stop_id_2 distance
1 2 3.33
I tried working with the lower triangle, convert to vector and all sorts but I feel I just over-complicate things with no success.
Hope this helps!
d= ''' stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125 '''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+')
from sklearn.metrics.pairwise import manhattan_distances
distance_df = pd.DataFrame(manhattan_distances(df))
distance_df.index = df.stop_id.values
distance_df.columns = df.stop_id.values
print(distance_df)
output:
1 2 3 4 6
1 0.000000 1.412176 2.334370 3.422297 5.247050
2 1.412176 0.000000 1.151232 2.047153 4.165126
3 2.334370 1.151232 0.000000 1.104079 3.143274
4 3.422297 2.047153 1.104079 0.000000 2.175247
6 5.247050 4.165126 3.143274 2.175247 0.000000
Now, to create the long format of the same df, use the following.
long_frmt_dist=distance_df.unstack().reset_index()
long_frmt_dist.columns = ['stop_id_1', 'stop_id_2', 'distance']
print(long_frmt_dist.head())
output:
stop_id_1 stop_id_2 distance
0 1 1 0.000000
1 1 2 1.412176
2 1 3 2.334370
3 1 4 3.422297
4 1 6 5.247050
df_dist = pd.DataFrame.from_dict(dist_matrix)
pd.merge(df, df_dist, how='left', left_index=True, right_index=True)
example

pandas apply custom function to every row of one column grouped by another column

I have a dataframe containing two columns: id and val.
df = pd.DataFrame ({'id': [1,1,1,2,2,2,3,3,3,3], 'val' : np.random.randn(10)})
id val
0 1 2.644347
1 1 0.378770
2 1 -2.107230
3 2 -0.043051
4 2 0.115948
5 2 0.054485
6 3 0.574845
7 3 -0.228612
8 3 -2.648036
9 3 0.569929
And I want to apply a custom function to every val according to id. Let's say I want to apply min-max scaling. This is how I would do it using a for loop:
df['scaled']=0
ids = df.id.drop_duplicates()
for i in range(len(ids)):
df1 = df[df.id==ids.iloc[i]]
df1['scaled'] = (df1.val-df1.val.min())/(df1.val.max()-df1.val.min())
df.loc[df.id==ids.iloc[i],'scaled'] = df1['scaled']
And the result is:
id val scaled
0 1 0.457713 1.000000
1 1 -0.464513 0.000000
2 1 0.216352 0.738285
3 2 0.633652 0.990656
4 2 -1.099065 0.000000
5 2 0.649995 1.000000
6 3 -0.251099 0.306631
7 3 -1.003295 0.081387
8 3 2.064389 1.000000
9 3 -1.275086 0.000000
How can I do this faster without a loop?
You can do this with groupby:
In [6]: def minmaxscale(s): return (s - s.min()) / (s.max() - s.min())
In [7]: df.groupby('id')['val'].apply(minmaxscale)
Out[7]:
0 0.000000
1 1.000000
2 0.654490
3 1.000000
4 0.524256
5 0.000000
6 0.000000
7 0.100238
8 0.014697
9 1.000000
Name: val, dtype: float64
(Note that np.ptp() / peak-to-peak can be used in placed of s.max() - s.min().)
This applies the function minmaxscale() to each smaller-sized Series of val, grouped by id.
Taking the first group, for example:
In [11]: s = df[df.id == 1]['val']
In [12]: s
Out[12]:
0 0.002722
1 0.656233
2 0.430438
Name: val, dtype: float64
In [13]: s.max() - s.min()
Out[13]: 0.6535106879021447
In [14]: (s - s.min()) / (s.max() - s.min())
Out[14]:
0 0.00000
1 1.00000
2 0.65449
Name: val, dtype: float64
Solution from sklearn MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['new']=np.concatenate([scaler.fit_transform(x.values.reshape(-1,1)) for y, x in df.groupby('id').val])
df
Out[271]:
id val scaled new
0 1 0.457713 1.000000 1.000000
1 1 -0.464513 0.000000 0.000000
2 1 0.216352 0.738285 0.738284
3 2 0.633652 0.990656 0.990656
4 2 -1.099065 0.000000 0.000000
5 2 0.649995 1.000000 1.000000
6 3 -0.251099 0.306631 0.306631
7 3 -1.003295 0.081387 0.081387
8 3 2.064389 1.000000 1.000000
9 3 -1.275086 0.000000 0.000000

Pandas create percentile field based on groupby with level 1

Given the following data frame:
import pandas as pd
df = pd.DataFrame({
('Group', 'group'): ['a','a','a','b','b','b'],
('sum', 'sum'): [234, 234,544,7,332,766]
})
I'd like to create a new field which calculates the percentile of each value of "sum" per group in "group". The trouble is, I have 2 header columns and cannot figure out how to avoid getting the error:
ValueError: level > 0 only valid with MultiIndex
when I run this:
df=df.groupby('Group',level=1).sum.rank(pct=True, ascending=False)
I need to keep the headers in the same structure.
Thanks in advance!
To group by the first column, ('Group', 'group'), and compute the rank for the ('sum', 'sum') column use:
In [106]: df['rank'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')]).rank(pct=True, ascending=False))
In [107]: df
Out[107]:
Group sum rank
group sum
0 a 234 0.833333
1 a 234 0.833333
2 a 544 0.333333
3 b 7 1.000000
4 b 332 0.666667
5 b 766 0.333333
Note that .rank(pct=True) computes a percentage rank, not a percentile. To compute a percentile you could use scipy.stats.percentileofscore.
import scipy.stats as stats
df['percentile'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')])
.apply(lambda ser: 100-pd.Series([stats.percentileofscore(ser, x, kind='rank')
for x in ser], index=ser.index)))
yields
Group sum rank percentile
group sum
0 a 234 0.833333 50.000000
1 a 234 0.833333 50.000000
2 a 544 0.333333 0.000000
3 b 7 1.000000 66.666667
4 b 332 0.666667 33.333333
5 b 766 0.333333 0.000000

Resources