How to write values over matplotlib bar charts without distorted figures - python-3.x

NOTE
The answers here produce a distorted figure; here is the bad result:
Here is the modified code that produces the bad result:
def plot_compare_bar(col1, col2, frame, fig_prefix=''):
frame = frame.sort_values(by=col1)
ind = np.arange(len(frame))
width = 0.4
fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(ind, frame[col1], width, color='red', label=col1)
ax.barh(ind + width, frame[col2], width, color='blue', label=col2)
ax.set(
yticks=ind + width, yticklabels=frame['Class Name'],
ylim=[2 * width - 1, len(frame)], title=(
f'{fig_prefix} {col1} vs {col2} evaluation results'))
for i, v in enumerate(frame[col1].values):
ax.text(v + 3, i + .25, str(v), color='red', fontweight='bold')
for i, v in enumerate(frame[col2].values):
ax.text(v + 3, i + .25, str(v), color='blue', fontweight='bold')
ax.legend()
The original question:
import matplotlib.pyplot as plt
import numpy as np
def plot_compare_bar(col1, col2, frame, fig_prefix=''):
frame = frame.sort_values(by=col1)
ind = np.arange(len(frame))
width = 0.4
fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(ind, frame[col1], width, color='red', label=col1)
ax.barh(ind + width, frame[col2], width, color='blue', label=col2)
ax.set(
yticks=ind + width, yticklabels=frame['Class Name'],
ylim=[2 * width - 1, len(frame)], title=(
f'{fig_prefix} {col1} vs {col2} evaluation results'))
ax.legend()
frame is a pandas DataFrame that looks like the following:
Class Name Average Precision Actual Detections True Positives False Positives Combined
2 Palm Tree 91.152760 1379 1428 1292 141 1433
5 Traffic Lights 71.026533 1269 1036 948 88 1036
3 Street Lamp 63.871910 995 848 727 121 848
0 Car 57.535491 3153 1955 1881 104 1985
1 Street Sign 56.925982 1109 704 658 46 704
6 Pedestrian 55.243564 1418 887 835 73 908
13 Road Block 52.182602 363 215 203 14 217
4 Minivan 51.786659 68 41 38 3 41
11 Bus 36.805556 43 18 16 2 18
9 Trash Can 14.444444 90 13 13 0 13
10 Bicycle 5.882353 17 1 1 0 1
8 Flag 5.000000 124 10 7 3 10
7 Fire Hydrant 1.923077 52 1 1 0 1
12 Pickup Truck 0.000000 20 0 0 0 0
14 Delivery Truck 0.000000 4 0 0 0 0
15 Motorcycle 0.000000 3 0 0 0 0
The function I defined above, produces the following plot:
I need every bar's value written next to it which might look like this:
How to modify the function above to do it.

You should not hard-coded the text. Instead, try to extract the values from patches:
def plot_compare_bar(col1, col2, frame, fig_prefix=''):
frame = frame.sort_values(by=col1)
ind = np.arange(len(frame))
width = 0.4
fig, ax = plt.subplots(figsize=(10,10))
ax.barh(ind, frame[col1], width, color='red', label=col1)
ax.barh(ind + width, frame[col2], width, color='blue', label=col2)
ax.set(
yticks=ind + width, yticklabels=frame['Class Name'],
ylim=[2 * width - 1, len(frame)], title=(
f'{fig_prefix} {col1} vs {col2} evaluation results'))
# annotation here
for patch in ax.patches:
# extract information from patch
pw = patch.get_width()
_,y = patch.get_xy()
color = patch.get_facecolor()
ax.text(pw + 3, y + width/2, str(pw),
color=color,verticalalignment='center')
ax.legend(loc='lower right')
Output:

Related

define range in pandas column based on define input from list

I have one data frame, wherein I need to apply range in one column, based on the list provided,
I am able to achieve results using fixed values but input values will be dynamic in a list format and the range will be based on input.
MY Data frame looks like below:
import pandas as pd
rangelist=[90,70,50]
data = {'Result': [75,85,95,45,76,8,10,44,22,65,35,67]}
sampledf=pd.DataFrame(data)
range list is my list, from that I need to create range like 100-90,90-70 & 70-50. These ranges may differ from time to time, till now I am achieving results using the below function.
def cat(value):
cat=''
if (value>90):
cat='90-100'
if (value<90 and value>70 ):
cat='90-70'
else:
cat='< 50'
return cat
sampledf['category']=sampledf['Result'].apply(cat)
How can I pass dynamic value in function"cat" based on the range list? I will be grateful if someone can help me to achieve the below result.
Result category
0 75 90-70
1 85 90-70
2 95 < 50
3 45 < 50
4 76 90-70
5 8 < 50
6 10 < 50
7 44 < 50
8 22 < 50
9 65 < 50
10 35 < 50
11 67 < 50
I would recommend pd.cut for this:
sampledf['Category'] = pd.cut(sampledf['Result'],
[-np.inf] + sorted(rangelist) + [np.inf])
Output:
Result Category
0 75 (70.0, 90.0]
1 85 (70.0, 90.0]
2 95 (90.0, inf]
3 45 (-inf, 50.0]
4 76 (70.0, 90.0]
5 8 (-inf, 50.0]
6 10 (-inf, 50.0]
7 44 (-inf, 50.0]
8 22 (-inf, 50.0]
9 65 (50.0, 70.0]
10 35 (-inf, 50.0]
11 67 (50.0, 70.0]
import numpy as np
breaks = pd.Series([100, 90, 75, 50, 45, 20, 0])
sampledf["ind"] = sampledf.Result.apply(lambda x: np.where(x >= breaks)[0][0])
sampledf["category"] = sampledf.ind.apply(lambda i: (breaks[i], breaks[i-1]))
sampledf
# Result ind category
# 0 75 2 (75, 90)
# 1 85 2 (75, 90)
# 2 95 1 (90, 100)
# 3 45 4 (45, 50)
# 4 76 2 (75, 90)
# 5 8 6 (0, 20)
# 6 10 6 (0, 20)
# 7 44 5 (20, 45)
# 8 22 5 (20, 45)
# 9 65 3 (50, 75)
# 10 35 5 (20, 45)
# 11 67 3 (50, 75)

Random forest prediction values

Having a dataset like this:
y x size type total_neighbours res
113040 29 1204 15 3 2 0
66281 52 402 9 3 3 0
32296 21 1377 35 0 3 0
48367 3 379 139 0 4 0
33501 1 66 17 0 3 0
... ... ... ... ... ... ...
131230 39 1002 439 3 4 6
131237 40 1301 70 1 2 1
131673 26 1124 365 1 2 1
131678 27 1002 629 3 3 6
131684 28 1301 67 1 2 1
I would like to use random forest algorithm to predict the value of res column (res column can only take integer values between [0-6])
I'm doing it like this:
labels = np.array(features['res'])
features= features.drop('res', axis = 1)
features = np.array(features)
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25,
random_state = 42)
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)
rf.fit(train_features, train_labels);
predictions = rf.predict(test_features)
The prediction I get are the following:
array([1.045e+00, 4.824e+00, 4.608e+00, 1.200e-01, 5.982e+00, 3.660e-01,
4.659e+00, 5.239e+00, 5.982e+00, 1.524e+00])
I have no experience on this field so I don't quite understand the predictions.
How do I interpret them?
Is there any way to limit the predictions to the res column values (integers between [0-6])?
Thanks
As #MaxNoe said, I had a misconception about the model. I was using a regression to predict a discrete variable.
RandomForestClassifier is giving the expected output.

Setting specific bin length in python list

I have a straightforward question but I'm facing issues for conversion.
I have a pandas dataframe column which I converted to a list. It has both positive and negative values:
bin_length = 5
list = [-200, -112, -115, 0, 50, 120, 250]
I need to group these numbers into a bin of length 5.
For example:
-100 to -95 should have a value of -100
-95 to -90 should have a value of -95
Similarly for positive values:
0 to 5 should be 5
5 to 10 should be 10
What I have tried until now:
df = pd.DataFrame(dataframe['rd2'].values.tolist(), columns = ['values'])
bins = np.arange(0, df['values'].max() + 5, 5)
df['bins'] = pd.cut(df['values'], bins, include_lowest = True)
But this doesn't account for negative values and then I get problems in converting the pandas interval into a separate columns for list.
Any help would be amazing.
Setting up the correct lower limit with np.arange:
bins = np.arange(df["values"].min(), df['values'].max() + 5, 5)
df['bins'] = pd.cut(df['values'], bins, include_lowest = True)
print (df)
values bins
0 -200 (-200.001, -195.0]
1 -112 (-115.0, -110.0]
2 -115 (-120.0, -115.0]
3 0 (-5.0, 0.0]
4 50 (45.0, 50.0]
5 120 (115.0, 120.0]
6 250 (245.0, 250.0]
Convert the intervals back to a list:
s = pd.IntervalIndex(df["bins"])
print ([[x,y] for x,y in zip(s.left, s.right)])
[[-200.001, -195.0], [-115.0, -110.0], [-120.0, -115.0], [-5.0, 0.0], [45.0, 50.0], [115.0, 120.0], [245.0, 250.0]]

How do I unstack these three histogram columns? Position Argument Fails

I am using the below code to create a histogram plot of three different variables. I would like to separate out the three bars at each data point in order to obtain a better visualizations. I tried adding the "position" argument for each feature but it won't work
count, bin_edges = np.histogram(df['age'])
fig = plt.figure(figsize=(7,3))
ax = fig.add_subplot(111) # Create matplotlib axes
df['age'].plot(kind = 'hist', figsize=(10,5), xticks = bin_edges,
width = 2, color = 'blue', alpha=0.4)
df[df['y'] == 1]['age'].plot(kind = 'hist', figsize=(10,5), xticks = bin_edges,
width = 2, color='red', alpha=0.4)
df[(df['y'] == 1)&(df['new_customer'] == 1)]['age'].plot(kind = 'hist', figsize=(10,5), xticks = bin_edges,
width = 2, color='green', alpha=0.4)
plt.title("Age")
plt.xlabel("Age Bins")
plt.ylabel("Number of Contacts")
plt.legend(loc='upper right')
plt.show()
EDIT: this is what my df looks like:
df[['age', 'y', 'new_customer']]
age y new_customer
0 56 0 1
1 57 0 1
2 37 0 1
3 40 0 1
4 56 0 1
5 45 0 1
6 59 0 1
7 41 0 1
8 24 0 1
9 25 0 1
10 41 0 1
11 25 0 1
12 29 0 1
The pandas plotting api isn't nearly as flexible as the underlying Matplotlib library it uses to make the actual plots. Just use Matplotlib directly:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
csv = ''' age y new_customer
0 56 0 1
1 57 1 1
2 37 0 1
3 40 0 1
4 56 1 1
5 45 0 0
6 59 0 1
7 41 1 1
8 24 0 0
9 25 0 1
10 41 1 1
11 25 0 0
12 29 0 1'''
df = pd.read_csv(pd.compat.StringIO(csv), sep='\s+')
bin_edges = np.histogram_bin_edges(df['age'])
fig = plt.figure(figsize=(7,3))
ax = fig.add_subplot(111) # Create matplotlib axes
data = [df['age'],
df[df['y'] == 1]['age'],
df[(df['y'] == 1)&(df['new_customer'] == 1)]['age']]
plt.hist(data, bins=bin_edges, label=['age', 'age_y', 'age_y_newcustomer'])
bin_cens = (bin_edges[:-1] + bin_edges[1:])/2
plt.xticks(bin_cens)
plt.title("Age")
plt.xlabel("Age Bins (center)")
plt.ylabel("Number of Contacts")
plt.legend()
plt.show()
Output:

Using distance matrix to find coordinate points of set of points

Given a distance matrix and a set of points, how do you figure out the coordinates of these points?
Edit: This is on a plane.
This question was answered here but in trying different distance matrices, I really couldn't use this answer because the M matrix had negative values, as did my eigenvectors. So when you took the square root, the program (in R) outputs "NaN" for those associated entries.
I'm guessing this will happen every time D(i,j)^2 is greater than D(1,j)^2 + D(i,1)^2.
For example, say I have a distance matrix:
0 73 102 496 432 184
73 0 303 392 436 233
102 303 0 366 207 353
496 392 366 0 172 103
432 436 207 172 0 352
184 233 353 103 352 0
Using the equation M(i,j) = (0.5)(D(1,j)^2+D(i,1)^2-D(i,j)^2), I get (which already has negative entries):
0 0.0 0.0 0.0 0.0 0.0
0 5329.0 -38038.0 48840.5 928.5 -7552.0
0 -38038.0 10404.0 61232.0 77089.5 -40174.5
0 48840.5 61232.0 246016.0 201528.0 134631.5
0 928.5 77089.5 201528.0 186624.0 48288.0
0 -7552.0 -40174.5 134631.5 48288.0 33856.0
Then I get non - zero eigenvalues & eigenvectors:
477718.27 101845.63 16474.30 -13116.72 -100692.49
[,1] [,2] [,3] [,4] [,5]
0.00000000 0.0000000 0.00000000 0.00000000 0.00000000
-0.05928626 0.3205747 0.84148945 0.04869546 -0.42806691
-0.16650486 -0.5670946 -0.04507520 -0.58222690 -0.55647098
-0.73371713 0.2827320 0.07386302 -0.45957443 0.40627254
-0.59727407 -0.4623603 0.07806418 0.64968004 -0.03617241
-0.27144823 0.5309625 -0.52755471 0.15920983 -0.58372335
Since there are both negative eigenvalues and eigenvectors, when we compute
sqrt(eigenvector(i)*eigenvalue(i)), we'll have negative values.
Here is my final output:
[,1] [,2] [,3] [,4] [,5]
0 0.0000 0.00000 0.00000 0.00000
NaN 180.6907 117.74103 NaN 207.61291
NaN NaN NaN 87.38939 236.71174
NaN 169.6910 34.88326 77.64089 NaN
NaN NaN 35.86158 NaN 60.35139
NaN 232.5429 NaN NaN 242.43877
Is this the only clear way of computing the coordinate points without using angles?
If it is, do we have to fix the distance matrix so D(i,j)^2 is not greater than D(1,j)^2 + D(i,1)^2.
Thanks.
Your data is inconsistent
Your coordinates are not consistent with positions of points in ℝ⁴, let alone a space of lower dimension. You can tell that fact by computing the Menger determinant of your squared distance matrix:
D <- as.matrix(read.table(textConnection("\
0 73 102 496 432 184
73 0 303 392 436 233
102 303 0 366 207 353
496 392 366 0 172 103
432 436 207 172 0 352
184 233 353 103 352 0")))
n <- nrow(D)
det(rbind(cbind(D^2, 1), c(rep(1, n), 0)))
# Result: 3.38761e+25
If your coordinates really came from points in a space of dimension less than five, then this determinant would have to be zero. As it is not, your distances are inconsistent, or the points form a simplex in a space of sufficiently high dimension.
But no mater the dimension, your data is still inconsistent since it violates the triangle inequality in several cases:
a b c ac abc ab bc
1 2 4: 496 > 465 = 73 + 392
1 3 4: 496 > 468 = 102 + 366
1 3 5: 432 > 309 = 102 + 207
1 6 4: 496 > 287 = 184 + 103
2 1 3: 303 > 175 = 73 + 102
2 6 4: 392 > 336 = 233 + 103
3 1 6: 353 > 286 = 102 + 184
5 4 6: 352 > 275 = 172 + 103
Going from a to c directly can never take longer than going via b, but according to your data it does.
Simple planar approach
If you had data consistent with points in the plane (i.e. all Menger determinants for combinations of four points evaluate to zero), you could use the following to obtain coordinates:
distance2coordinates <- function(D) {
n <- nrow(D)
maxDist <- which.max(D)
p1 <- ((maxDist - 1) %% n) + 1
p2 <- ((maxDist - 1) %/% n) + 1
x2 <- D[p1, p2]
r1sq <- D[p1,]^2
r2sq <- D[p2,]^2
x <- (r1sq - r2sq + x2^2)/(2*x2)
y <- sqrt(r1sq - x^2)
p3 <- which.max(y)
x3 <- x[p3]
y3 <- y[p3]
plus <- abs(D[p3,]^2 - (x3 - x)^2 - (y3 - y)^2)
minus <- abs(D[p3,]^2 - (x3 - x)^2 - (y3 + y)^2)
y[minus < plus] <- -y[minus < plus]
coords <- data.frame(x = x, y = y)
return(coords)
}
The idea is that you choose two points with maximal distance as starting points. You place on in the origin and the other on the positive x axis. Then you can compute all other x coordinates from this, as the intersection of two circles, following the equations
I: x² + y² = r₁²
II: (x - x₂)² + y² = r₂²
I-II: 2*x*x₂ = r₁² - r₂² + x₂²
Given these x coordinates, you can obtain y coordinates as well, up to sign. You then choose a third point, sufficiently far away from either of these two starting points, to decide on the sign.
This approach makes no attempt at all to handle imprecise input. It assumes exact data, and will only use part of the distance matrix to find the points. It will not find the point set most closely matching all of the input data.
On your data, this will fail, since some arguments to the square root will be negative. This means that the two circles involved don't intersect at all, hence the triangle inequality is violated.
If it is, do we have to fix the distance matrix so D(i,j)^2 is not greater than D(1,j)^2 + D(i,1)^2.
D(i,j) ≤ D(i,k) + D(k,j) would help, i.e. for all triples and without squares. This would ensure that the triangle inequality holds everywhere. The result still need not be planar; for that you'd have to fix all those Menger determinants.
This is a simple python function to calculate what you need, solving hyperspheres.
import sympy
import numpy as np
def give_coords(distances):
"""give coordinates of points for which distances given
coordinates are given relatively. 1st point on origin, 2nd on x-axis, 3rd
x-y plane and so on. Maximum n-1 dimentions for which n is the number
of points
Args:
distanes (list): is a n x n, 2d array where distances[i][j] gives the distance
from i to j assumed distances[i][j] == distances[j][i]
Returns:
numpy.ndarray: cordinates in list form n dim
Examples:
>>> a = sympy.sqrt(2)
>>> distances = [[0,1,1,1,1,1],
[1,0,a,a,a,a],
[1,a,0,a,a,a],
[1,a,a,0,a,a],
[1,a,a,a,0,a],
[1,a,a,a,a,0]]
>>> give_coords(distances)
array([[0, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]], dtype=object)
>>> give_coords([[0, 3, 4], [3, 0, 5], [4, 5, 0]])
array([[0, 0],
[3, 0],
[0, 4]], dtype=object)
"""
distances = np.array(distances)
n = len(distances)
X = sympy.symarray('x', (n, n - 1))
for row in range(n):
X[row, row:] = [0] * (n - 1 - row)
for point2 in range(1, n):
expressions = []
for point1 in range(point2):
expression = np.sum((X[point1] - X[point2]) ** 2)
expression -= distances[point1,point2] ** 2
expressions.append(expression)
X[point2,:point2] = sympy.solve(expressions, list(X[point2,:point2]))[1]
return X
This is Solvable
If you would like to see cartesian-type coordinates which satisfy the distance matrix you provided in your question, then please view the following image.
distances matrix and coordinates
Your input matrix gives the distances between 6 nodes which we shall call a, b, c, d, e, and f. There are a total of 5 dimensions required to be able to assign coordinates to all six nodes which satisfy your distance matrix. Two of these dimensions are imaginary valued -- which is a consequence of breaking the triangle rule. The results were arrived at by using the law of cosines and a bit of number crunching.
a (0, 0, 0, 0, 0)
b (73, 0, 0, 0, 0)
c (-521.07, 510.99i, 0, 0, 0)
d (669.05, -802.08i, 664.62, 0, 0)
e (12.72, -163.83i, 488.13, 158.01i, 0)
f (-103.45, 184.11i, 84.52, 138.06i, 262.62)

Resources