Does Vega/Vega-Lite/Altair have a builtin method to draw a special mark for empty bars? When x == x2 no mark is currently shown. Perhaps a vertical rule mark of the same expected bar color as derived from a third encoding? Or perhaps a semi-transparent bar mark covering an expanded region with a red border?
data =\
[ ("a", 1, 100, 123, 4.5)
, ("a", 2, 140, 190, 5.6)
, ("a", 3, 402, 402, 1.6)
, ("b", 1, 100, 123, 5.7)
, ("b", 2, 134, 456, 6.7)
, ("b", 3, 503, 504, 8.2)
, ("b", 4, 602, 765, 1.1)
, ("c", 1, 95, 95, 0.1)
, ("c", 2, 140, 145, 7.5)
, ("c", 3, 190, 190, 9.9)
]
data = pd.DataFrame(data, columns=["k","ki","min","max","other"])
chart =\
( alt.Chart(data)
. mark_bar()
. encode
( x="min"
, x2="max"
, y="k:O"
, color="ki:N"
, tooltip=["min", "max", "other"]
)
. interactive()
. properties
( width="container"
, height=300
)
)
A calculate transform for min/max could reach a solution via an expanded region, with a conditional opacity on the original min/max fields. Not sure about the red border. The downside to this approach is the bar increases with zoom, unlike a rule mark.
A rule mark has the advantage of not misrepresenting the data but might be hard to spot. It would require a filter transform though I'm not sure if I have to build from the initial bar chart, or if I can chain mark_bar -> transform_filter -> mark_rule.
Anyway a solution is complicated both technically and in terms of data representation and I was wondering if Vega/Altair has a builtin solution to make either easier.
You can set the stroke color for the outlines of the bars using something like mark_bar(stroke='gray') (it defaults to transparent): then empty bars will be shown by their outline:
Related
I have multiple invoices which I already found the coordinates of the bounding boxes in each invoice.
Here is the y coordinate(each small list is the y coordinate of the bounding box-ymin and ymax):
[[4, 43],
[9, 47],
[76, 122],
[30, 74],
[10, 47],
[81, 125],
[84, 124],
[47, 90],
[1, 38]]
I want to determine which bounding box is on the first line, which is on the second and which is on the third depending the y-coordinates. More generally, how can I find the range of the first row, second, or third?
There are multiple invoices that have more rows or less rows
This solution is sensitive to the threshold which you might need to adjust depending to the amount of text in each line!
Firstly, segment the lines depending on the presence, and the amount of text(black pixels).
Secondly, find lines borders to compare your bounding boxes with.
Finally, do the comparison between your bounding boxes indices, and segmented line indices.
output:
[17, 40, 53, 79, 95, 117]
box [4,43] belongs to line 1
box [9,47] belongs to line 1
box [76,122] belongs to line 3
box [10,47] belongs to line 1
box [81,125] belongs to line 3
box [84,124] belongs to line 3
box [47,90] belongs to line 2
Code:
import cv2
# Read the image
orig = cv2.imread('input.jpg', 0)[:,15:]
# The detected boxes
boxes = [[4, 43],
[9, 47],
[76, 122],
[30, 74],
[10, 47],
[81, 125],
[84, 124],
[47, 90],
[1, 38]]
# make a deep copy
img = orig.copy()
# quantify the black pixels in each line
summ = img.sum(axis=1)
# Threshold
th = summ.mean()
img[summ>th, :] = 0
img[summ<=th,:] = 1
rows = []
for y in range(img.shape[0]-1):
if img[y,0]>img[y+1,0] or img[y,0]<img[y+1,0]:
rows.append(y)
# sort lines indices.
rows.sort()
print(rows)
# compare the indices
for box in boxes:
for idx in range(0, len(rows), 2):
if box[0] < rows[idx] and box[1] > rows[idx+1]:
print("box [{},{}]".format(box[0], box[1]), " belongs to line {}".format(idx//2+1))
How can I create an altair/vega line chart from data and metadata tables where the data table is too large to fit in memory? AKA Can I use lookup to select a row that's not a join field
For simplicity, I'm showing it as if it were Altair+Pandas, but the actual data table is enormous, so I expect to load them from json or csv.
I have a data table something like:
data = pd.DataFrame([
('A', 0.5, 0.45, 0.2, 0.25, 0.55, 0.45, 0.4, 0),
('B', 0.2, 0.3, 0.1, 0, 0.15, 0.25, 0.1, 0),
('C', 0.3, 0.25, 0.7, 0.75, 0.3, 0.3, 0.5, 1),
],
columns = ('gene', 's1r1t1', 's1r2t1', 's1r1t2', 's1r2t2', 's2r1t1', 's2r2t1', 's2r2t1', 's2r2t2'),
).set_index('gene')
(in reality, with 50K rows)
with corresponding metadata like:
md = pd.DataFrame([
('s1r1t1', 1, 1, 1),
('s1r2t1', 1, 2, 1),
('s1r1t2', 1, 1, 2),
('s1r2t2', 1, 2, 2),
('s2r1t1', 2, 1, 1),
('s2r2t1', 2, 2, 1),
('s2r2t1', 2, 1, 2),
('s2r2t2', 2, 2, 2),
],
columns = ('sample', 'subject', 'replicate', 'timepoint'),
).set_index('sample')
(in reality, with 4 replicates, 6 timepoints, and 5 experimental conditions)
and want to show graphs of the expression levels of a single gene by timepoint.
For a tiny set like this, I can graph it like:
data_melt = data.reset_index().melt(
id_vars='gene',
var_name='sample',
)
merged = pd.merge(
left=md.reset_index(),
right=data_melt,
left_on='sample',
right_on='sample',
)
dropdown = alt.selection_single(
fields=['gene'],
bind=alt.binding_select(options=data.index.to_list()),
name='gene',
init={'gene': data.index[0]}
)
alt.Chart(
merged
).mark_line(
).encode(
x='timepoint:O',
y='mean(value):Q'
).add_selection(
dropdown
).transform_filter(
dropdown
)
However, I'd like some way to only load a small fraction (ideally a single gene) from the data table.
I have been trying methods like:
# Create files and serve them via proxy
data_fname = 'data.csv'
data.to_csv(data_fname)
md_fname = 'metadata.csv'
md.to_csv('metadata.csv')
alt.data_transformers.enable('data_server')
# Build chart
lookup = alt.LookupData(
data=data_fname,
key='gene',
fields=md.index.to_list(),
)
c = alt.Chart(md_fname).mark_point().encode(
x='timepoint:O',
y='gene:Q'
).add_selection(
dropdown
).transform_filter(
dropdown
).transform_lookup(
from_=lookup,
lookup='gene',
)
But have clearly messed up the lookup, since the chart doesn't show anything, whether in my devel notebook, or in an exported HTML on my web-server, with the urls fixed in the JSON to point to the actual files.
Is this possible, and if so, what am I doing wrong?
There's currently no built-in way to do dynamic data loading with Altair or Vega-Lite, but there are efforts underway to handle larger datasets and to push computations into a database backend. See https://github.com/vega/scalable-vega for details.
Using an example from another post, I'm adding a color bar to a scatter plot. The idea is that both dot hue, and colorbar hue, should conform to the maximum and minimum possible, so that the colorbar can reflect the range of values in the hue:
x= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
y= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
z= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 255]
df = pd.DataFrame(list(zip(x, y, z)), columns =['x', 'y', 'z'])
colormap=matplotlib.cm.viridis
#A continuous color bar needs to be added independently
norm = plt.Normalize(df.z.min(), df.z.max())
sm = plt.cm.ScalarMappable(cmap=colormap, norm=norm)
sm.set_array([])
fig = plt.figure(figsize = (10,8), dpi=300)
ax = fig.add_subplot(1,1,1)
sb.scatterplot(x="x", y="y",
hue="z",
hue_norm=(0,255),
data=df,
palette=colormap,
ax=ax
)
ax.legend(bbox_to_anchor=(0, 1), loc=2, borderaxespad=0., title='hue from sb.scatterplot')
ax.figure.colorbar(sm).set_label('hue from sm')
plt.xlim(0,255)
plt.ylim(0,255)
plt.show()
Note how the hue from the scatterplot, even with hue_norm, ranges up to 300. In turn, the hue from the colorbar ranges from 0 to 255. From experimenting with values in hue_norm, it seems that matplotlib always rounds it off so that you have a "good" (even?) number of intervals.
My questions are:
Is which one is showing an incorrect range: the scatterplot, the scatterplot legend, or the colorbar? And how to correct it?
How could you retrieve min and max hue from the scatterplot (in this case 0 and 300, respectively), in order to set them as maximum and minimum of the colorbar?
Do you really need to use seaborn's scatterplot(). Using a numerical hue is always quite messy.
The following code is much simpler and yields an unambiguous output
fig, ax = plt.subplots()
g = ax.scatter(df['x'],df['y'], c=df['z'], cmap=colormap)
fig.colorbar(g)
As you can see below, I manually define the range for each yaxis as well as setting the autorange option to be False.
However, if you graph this, you will still find the yaxis1 range is 0 to 20 rather than 0 to 25. As a result, one of the bars sticks out of the chart.
How do I make it so that I can be certain every value will be contained within the yaxis range?
Edit: Additionally, the top grid line in the second row is not showing. If I rescale slightly, it will appear again. So the issue seems to be purely graphical. Any ideas are appreciated.
from plotly import tools
fig = tools.make_subplots(rows=2, cols=2, subplot_titles=['A', 'B'], shared_xaxes=False, shared_yaxes=True)
data = [[10, 4, 15, 20.5], [3, 12, 22.2], [6.5, 12, 26.2], [18, 4.2, 22.2]]
traces = [go.Bar(x=['Type A', 'Type B', 'Type C'], y=d) for d in data]
fig.append_trace(traces[0], 1, 1)
fig.append_trace(traces[1], 1, 2)
fig.append_trace(traces[2], 2, 1)
fig.append_trace(traces[3], 2, 2)
fig['layout']['yaxis1'].update(title='', range=[0, 25], autorange=False)
fig['layout']['yaxis2'].update(title='', range=[0, 30], autorange=False)
py.iplot(fig)
So I tried your code and was able to replicate the issue.
Reason:
The cause for this, is that, if you look at the top left graph's yaxis you can see there are 3 values [0, 10, 20], so there is a difference of 10, between each of the values. so when you set the range as [0, 25], the difference of 10 is not met, hence we not able to see 25 in the yaxis.
If we look at the graph on the bottom left's xaxis, we can see that the value 30 obeys the difference of 10, between each of the values. Thus we are able to see 30 in the yaxis!
Solution:
If you look at the plotly documentation, found here, we can use a particular property of the yaxis object to set the increment between each of the ticks, called dtick, plotly defines it as:
P.S: A personal Thank you to Maximilian Peters for aiding to find the solution!!!!
dtick (number or categorical coordinate string) Sets the step
in-between ticks on this axis. Use with tick0. Must be a positive
number, or special strings available to "log" and "date" axes. If the
axis type is "log", then ticks are set every 10^(n"dtick) where n is
the tick number. For example, to set a tick mark at 1, 10, 100, 1000,
... set dtick to 1. To set tick marks at 1, 100, 10000, ... set dtick
to 2. To set tick marks at 1, 5, 25, 125, 625, 3125, ... set dtick to
log_10(5), or 0.69897000433. "log" has several special values; "L",
where f is a positive number, gives ticks linearly spaced in value
(but not position). For example tick0 = 0.1, dtick = "L0.5" will
put ticks at 0.1, 0.6, 1.1, 1.6 etc. To show powers of 10 plus small
digits between, use "D1" (all digits) or "D2" (only 2 and 5). tick0
is ignored for "D1" and "D2". If the axis type is "date", then you
must convert the time to milliseconds. For example, to set the
interval between ticks to one day, set dtick to 86400000.0. "date"
also has special values "M" gives ticks spaced by a number of
months. n must be a positive integer. To set ticks on the 15th of
every third month, set tick0 to "2000-01-15" and dtick to "M3". To
set ticks every 4 years, set dtick to "M48"
So, when we set the dtick as 5 and the range as [0,25] we will get the expected result!
Please tryout the below code and let me know if your issue is resolved completely!
import pandas as pd
import plotly.offline as py_offline
import plotly.graph_objs as go
py_offline.init_notebook_mode()
from plotly import tools
fig = tools.make_subplots(rows=2, cols=2, subplot_titles=['A', 'B'], shared_xaxes=False, shared_yaxes=True)
data = [[10, 4, 15, 20.5], [3, 12, 22.2], [6.5, 12, 26.2], [18, 4.2, 22.2]]
traces = [go.Bar(x=['Type A', 'Type B', 'Type C'], y=d) for d in data]
fig.append_trace(traces[0], 1, 1)
fig.append_trace(traces[1], 1, 2)
fig.append_trace(traces[2], 2, 1)
fig.append_trace(traces[3], 2, 2)
fig['layout']['yaxis1'].update(title='', range=[0, 25], dtick=5, autorange=False)
fig['layout']['yaxis2'].update(title='', range=[0, 30], autorange=False)
py_offline.iplot(fig)
Suppose one has an array of observation times ts, each of which corresponds to some observed value in vs. The observation times are taken to be the number of elapsed hours (starting from zero) and can contain duplicates. I would like to find the indices that correspond to the maximum observed value per unique observation time. I am asking for the indices as opposed to the values, unlike a similar question I asked several months ago. This way, I can apply the same indices on various arrays. Below is a sample dataset, which I would like to use to adapt a code for a much larger dataset.
import numpy as np
ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])
My current approach is to split the array of values at any points at which there is not a duplicate time.
condition = np.where(np.diff(ts) != 0)[0]+1
ts_spl = np.split(ts, condition)
vs_spl = np.split(vs, condition)
print(ts_spl)
>> [array([0, 0]), array([1]), array([2]), array([3, 3, 3]), array([4, 4]), array([5]), array([6]), array([7]), array([8, 8]), array([9]), array([10])]
print(vs_spl)
>> [array([500, 600]), array([550]), array([700]), array([500, 500, 450]), array([800, 900]), array([700]), array([600]), array([850]), array([850, 900]), array([900]), array([900])]
In this case, duplicate max values at any duplicate times should be counted. Given this example, the returned indices would be:
[1, 2, 3, 4, 5, 8, 9, 10, 11, 13, 14, 15]
# indices = 4,5,6 correspond to values = 500, 500, 450 ==> count indices 4,5
# I might modify this part of the algorithm to return either 4 or 5 instead of 4,5 at some future time
Though I have not yet been able to adapt this algorithm for my purpose, I think it must be possible to exploit the size of each previously-split array in vs_spl to keep an index counter. Is this approach feasible for a large dataset (10,000 elements per array before padding; 70,000 elements per array after padding)? If so, how can I adapt it? If not, what are some other approaches that may be useful here?
70,000 isn't that insanely large, so yes it should be feasible. It is, however, faster to avoid the splitting and use the .reduceat method of relevant ufuncs. reduceat is like reduce applied to chunks, but you don't have to provide the chunks, just tell reduceat where you would have cut to get them. For example, like so
import numpy as np
N = 10**6
ts = np.cumsum(np.random.rand(N) < 0.1)
vs = 50*np.random.randint(10, 20, (N,))
#ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
#vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])
# flatnonzero is a bit faster than where
condition = np.r_[0, np.flatnonzero(np.diff(ts)) + 1, len(ts)]
sizes = np.diff(condition)
maxima = np.repeat(np.maximum.reduceat(vs, condition[:-1]), sizes)
maxat = maxima == vs
indices = np.flatnonzero(maxat)
# if you want to know how many maxima at each hour
nmax = np.add.reduceat(maxat, condition[:-1])