Matplotlib fix y-axis - python-3.x

I am trying to create a horizontal bar chart with matplotlib. My data points are the following two arrays
distance = [100, 200, 300, 400, 500, 3000]
value = [10, 15, 50, 74, 95, 98]
my code to generate the horizontal bar chart is as follows
plt.barh(distance, value, height=75)
plt.savefig(fig_name, dpi=300)
plt.close()
The problem is my image comes out like this
https://imgur.com/a/Q8dvHKR
Is there a way to ensure all blocks are the same width and to skip the spaces in between 500 and 300

You can do this making sure Matplotlib treats your labels like labels, not like numbers. You can do this by converting them to strings:
import matplotlib.pyplot as plt
distance = [100, 200, 300, 400, 500, 3000]
value = [10, 15, 50, 74, 95, 98]
distance = [str(number) for number in distance]
plt.barh(distance, value, height=0.75)
Note that you have to change the height.

Alternatively, you can use a range of numbers as y-values, using range() function, to position the horizontal bars and then set the tick-labels as desired using plt.yticks() function whose first argument is the positions of the ticks and the second argument is the tick-labels.
import matplotlib.pyplot as plt
distance = [100, 200, 300, 400, 500, 3000]
value = [10, 15, 50, 74, 95, 98]
plt.barh(range(len(distance)), value, height=0.6)
plt.yticks(range(len(distance)), distance)
plt.show()

Related

Python: Plot histograms with customized bins

I am using matplotlib.pyplot to make a histogram. Due to the distribution of the data, I want manually set up the bins. The details are as follows:
Any value = 0 in one bin;
Any value > 60 in the last bin;
Any value > 0 and <= 60 are in between the bins described above and the bin size is 5.
Could you please give me some help? Thank you.
I'm not sure what you mean by "the bin size is 5". You can either plot a histogramm by specifying the bins with a sequence:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -5] # your data here
plt.hist(data, bins=[0, 0.5, 60, max(data)])
plt.show()
But the bin size will match the corresponding interval, meaning -in this example- that the "0-case" will be barely visible:
(Note that 60 is moved to the last bin when specifying bins as a sequence, changing the sequence to [0, 0.5, 59.5, max(data)] would fix that)
What you (probably) need is first to categorize your data and then plot a bar chart of the categories:
import matplotlib.pyplot as plt
import pandas as pd
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -5] # your data here
df = pd.DataFrame()
df['data'] = data
def find_cat(x):
if x == 0:
return "0"
elif x > 60:
return "> 60"
elif x > 0:
return "> 0 and <= 60"
df['category'] = df['data'].apply(find_cat)
df.groupby('category', as_index=False).count().plot.bar(x='category', y='data', rot=0, width=0.8)
plt.show()
Output:
building off Tranbi's answer, you could specify the bin edges as detailed in the link they shared.
import matplotlib.pyplot as plt
import pandas as pd
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -6] # your data here
df = pd.DataFrame()
df['data'] = data
bin_edges = [-5, 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65]
bin_edges_offset = [x+0.000001 for x in bin_edges]
plt.figure()
plt.hist(df['data'], bins=bin_edges_offset)
plt.show()
histogram
IIUC you want a classic histogram for value between 0 (not included) and 60 (included) and add two bins for 0 and >60 on the side.
In that case I would recommend plotting the 3 regions separately:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -3] # your data here
fig, axes = plt.subplots(1,3, sharey=True, width_ratios=[1, 12, 1])
fig.subplots_adjust(wspace=0)
# counting 0 values and drawing a bar between -5 and 0
axes[0].bar(-5, data.count(0), width=5, align='edge')
axes[0].xaxis.set_visible(False)
axes[0].spines['right'].set_visible(False)
axes[0].set_xlim((-5, 0))
# histogram between (0, 60]
axes[1].hist(data, bins=12, range=(0.0001, 60.0001))
axes[1].yaxis.set_visible(False)
axes[1].spines['left'].set_visible(False)
axes[1].spines['right'].set_visible(False)
axes[1].set_xlim((0, 60))
# counting values > 60 and drawing a bar between 60 and 65
axes[2].bar(60, len([x for x in data if x > 60]), width=5, align='edge')
axes[2].xaxis.set_visible(False)
axes[2].yaxis.set_visible(False)
axes[2].spines['left'].set_visible(False)
axes[2].set_xlim((60, 65))
plt.show()
Output:
Edit: If you wanna plot probability density, I would edit the data and simply use hist:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -3] # your data here
data2 = []
for el in data:
if el < 0:
pass
elif el > 60:
data2.append(61)
else:
data2.append(el)
plt.hist(data2, bins=14, density=True, range=(-4.99,65.01))
plt.show()

Computer Vision: How to find how many rows(lines) of bounding boxes in an invoice?

I have multiple invoices which I already found the coordinates of the bounding boxes in each invoice.
Here is the y coordinate(each small list is the y coordinate of the bounding box-ymin and ymax):
[[4, 43],
[9, 47],
[76, 122],
[30, 74],
[10, 47],
[81, 125],
[84, 124],
[47, 90],
[1, 38]]
I want to determine which bounding box is on the first line, which is on the second and which is on the third depending the y-coordinates. More generally, how can I find the range of the first row, second, or third?
There are multiple invoices that have more rows or less rows
This solution is sensitive to the threshold which you might need to adjust depending to the amount of text in each line!
Firstly, segment the lines depending on the presence, and the amount of text(black pixels).
Secondly, find lines borders to compare your bounding boxes with.
Finally, do the comparison between your bounding boxes indices, and segmented line indices.
output:
[17, 40, 53, 79, 95, 117]
box [4,43] belongs to line 1
box [9,47] belongs to line 1
box [76,122] belongs to line 3
box [10,47] belongs to line 1
box [81,125] belongs to line 3
box [84,124] belongs to line 3
box [47,90] belongs to line 2
Code:
import cv2
# Read the image
orig = cv2.imread('input.jpg', 0)[:,15:]
# The detected boxes
boxes = [[4, 43],
[9, 47],
[76, 122],
[30, 74],
[10, 47],
[81, 125],
[84, 124],
[47, 90],
[1, 38]]
# make a deep copy
img = orig.copy()
# quantify the black pixels in each line
summ = img.sum(axis=1)
# Threshold
th = summ.mean()
img[summ>th, :] = 0
img[summ<=th,:] = 1
rows = []
for y in range(img.shape[0]-1):
if img[y,0]>img[y+1,0] or img[y,0]<img[y+1,0]:
rows.append(y)
# sort lines indices.
rows.sort()
print(rows)
# compare the indices
for box in boxes:
for idx in range(0, len(rows), 2):
if box[0] < rows[idx] and box[1] > rows[idx+1]:
print("box [{},{}]".format(box[0], box[1]), " belongs to line {}".format(idx//2+1))

Hue, colorbar, or scatterplot colors do not match in seaborn.scatterplot

Using an example from another post, I'm adding a color bar to a scatter plot. The idea is that both dot hue, and colorbar hue, should conform to the maximum and minimum possible, so that the colorbar can reflect the range of values in the hue:
x= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
y= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
z= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 255]
df = pd.DataFrame(list(zip(x, y, z)), columns =['x', 'y', 'z'])
colormap=matplotlib.cm.viridis
#A continuous color bar needs to be added independently
norm = plt.Normalize(df.z.min(), df.z.max())
sm = plt.cm.ScalarMappable(cmap=colormap, norm=norm)
sm.set_array([])
fig = plt.figure(figsize = (10,8), dpi=300)
ax = fig.add_subplot(1,1,1)
sb.scatterplot(x="x", y="y",
hue="z",
hue_norm=(0,255),
data=df,
palette=colormap,
ax=ax
)
ax.legend(bbox_to_anchor=(0, 1), loc=2, borderaxespad=0., title='hue from sb.scatterplot')
ax.figure.colorbar(sm).set_label('hue from sm')
plt.xlim(0,255)
plt.ylim(0,255)
plt.show()
Note how the hue from the scatterplot, even with hue_norm, ranges up to 300. In turn, the hue from the colorbar ranges from 0 to 255. From experimenting with values in hue_norm, it seems that matplotlib always rounds it off so that you have a "good" (even?) number of intervals.
My questions are:
Is which one is showing an incorrect range: the scatterplot, the scatterplot legend, or the colorbar? And how to correct it?
How could you retrieve min and max hue from the scatterplot (in this case 0 and 300, respectively), in order to set them as maximum and minimum of the colorbar?
Do you really need to use seaborn's scatterplot(). Using a numerical hue is always quite messy.
The following code is much simpler and yields an unambiguous output
fig, ax = plt.subplots()
g = ax.scatter(df['x'],df['y'], c=df['z'], cmap=colormap)
fig.colorbar(g)

how to plot a single line with different types of line dash using bokeh?

I am trying to plot the line for a set of points. Currently, I have set of points as Column names X, Y and Type in the form of a data frame. Whenever the type is 1, I would like to plot the points as dashed and whenever the type is 2, I would like to plot the points as a solid line.
Currently, I am using for loop to iterate over all points and plot each point using plt.dash. However, this is slowing down my run time since I want to plot more than 40000 points.
So, is an easy way to plot the line overall points with different line dash type?
You could realize it by drawing multiple line segments like this
(Bokeh v1.1.0)
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, Range1d, LinearAxis
line_style = {1: 'solid', 2: 'dashed'}
data = {'name': [1, 1, 1, 2, 2, 2, 1, 1, 1, 1],
'counter': [1, 2, 3, 3, 4, 5, 5, 6, 7, 8],
'score': [150, 150, 150, 150, 150, 150, 150, 150, 150, 150],
'age': [20, 21, 22, 22, 23, 24, 24, 25, 26, 27]}
df = pd.DataFrame(data)
plot = figure(y_range = (100, 200))
plot.extra_y_ranges = {"Age": Range1d(19, 28)}
plot.add_layout(LinearAxis(y_range_name = "Age"), 'right')
for i, g in df.groupby([(df.name != df.name.shift()).cumsum()]):
source = ColumnDataSource(g)
plot.line(x = 'counter', y = 'score', line_dash = line_style[g.name.unique()[0]], source = source)
plot.circle(x = 'counter', y = 'age', color = "blue", size = 10, y_range_name = "Age", source = source)
show(plot)

How to label bubble chart/scatter plot with column from pandas dataframe?

I am trying to label a scatter/bubble chart I create from matplotlib with entries from a column in a pandas data frame. I have seen plenty of examples and questions related (see e.g. here and here). Hence I tried to annotate the plot accordingly. Here is what I do:
import matplotlib.pyplot as plt
import pandas as pd
#example data frame
x = [5, 10, 20, 30, 5, 10, 20, 30, 5, 10, 20, 30]
y = [100, 100, 200, 200, 300, 300, 400, 400, 500, 500, 600, 600]
s = [5, 10, 20, 30, 5, 10, 20, 30, 5, 10, 20, 30]
users =['mark', 'mark', 'mark', 'rachel', 'rachel', 'rachel', 'jeff', 'jeff', 'jeff', 'lauren', 'lauren', 'lauren']
df = pd.DataFrame(dict(x=x, y=y, users=users)
#my attempt to plot things
plt.scatter(x_axis, y_axis, s=area, alpha=0.5)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.annotate(df.users, xy=(x,y))
plt.show()
I use a pandas datframe and I somehow get a KeyError- so I guess a dict() object is expected? Is there any other way to label the data using with entries from a pandas data frame?
You can use DataFrame.plot.scatter and then select in loop by DataFrame.iat:
ax = df.plot.scatter(x='x', y='y', alpha=0.5)
for i, txt in enumerate(df.users):
ax.annotate(txt, (df.x.iat[i],df.y.iat[i]))
plt.show()
Jezreal's answer is fine, but i will post this just to show what i meant with df.iterrows in the other thread.
I'm afraid you have to put the scatter (or plot) command in the loop as well if you want to have a dynamic size.
df = pd.DataFrame(dict(x=x, y=y, s=s, users=users))
fig, ax = plt.subplots(facecolor='w')
for key, row in df.iterrows():
ax.scatter(row['x'], row['y'], s=row['s']*5, alpha=.5)
ax.annotate(row['users'], xy=(row['x'], row['y']))

Resources