Creating wordclouds with Altair - word-cloud

How do I create a wordcloud with Altair?
Vega and vega-lite provide wordcloud functionality which I have used succesfully in the past.
Therefore it should be possible to access it from Altair if I understand correctly and
I would prefer to prefer to express the visualizations in Python rather than embedded JSON.
All the examples for Altair I have seen involve standard chart types like
scatter plots and bar graphs.
I have not seen any involving wordclouds, networks, treemaps, etc.
More specifically how would I express or at least approximate the following Vega visualization in Altair?
def wc(pages, width=2**10.5, height=2**9.5):
return {
"$schema": "https://vega.github.io/schema/vega/v3.json",
"name": "wordcloud",
"width": width,
"height": height,
"padding": 0,
"data" : [
{
'name' : 'table',
'values' : [{'text': pg.title, 'definition': pg.defn, 'count': pg.count} for pg in pages)]
}
],
"scales": [
{
"name": "color",
"type": "ordinal",
"range": ["#d5a928", "#652c90", "#939597"]
}
],
"marks": [
{
"type": "text",
"from": {"data": "table"},
"encode": {
"enter": {
"text": {"field": "text"},
"align": {"value": "center"},
"baseline": {"value": "alphabetic"},
"fill": {"scale": "color", "field": "text"},
"tooltip": {"field": "definition", "type": "nominal", 'fontSize': 32}
},
"update": {
"fillOpacity": {"value": 1}
},
},
"transform": [
{
"type": "wordcloud",
"size": [width, height],
"text": {"field": "text"},
#"rotate": {"field": "datum.angle"},
"font": "Helvetica Neue, Arial",
"fontSize": {"field": "datum.count"},
#"fontWeight": {"field": "datum.weight"},
"fontSizeRange": [2**4, 2**6],
"padding": 2**4
}
]
}
],
}
Vega(wc(pages))

Altair's API is built on the Vega-Lite grammar, which includes only a subset of the plot types available in Vega. Word clouds cannot be created in Vega-Lite, so they cannot be created in Altair.

With mad respect to #jakevdp, you can construct a word cloud (or something word cloud-like) in altair by recognizing that the elements of a word cloud chart involve:
a dataset of words and their respective quantities
text_marks encoded with each word, and optionally size and or color based on quantity
"randomly" distributing the text_marks in 2d space.
One simple option to distribute marks is to add an additional 'x' and 'y' column to data, each element being a random sample from the range of your chosen x and y domain:
import random
def shuffled_range(n): return random.sample(range(n), k=n)
n = len(words_and_counts) # words_and_counts: a pandas data frame
x = shuffled_range(n)
y = shuffled_range(n)
data = words_and_counts.assign(x=x, y=y)
This isn't perfect as it doesn't explicitly prevent word overlap, but you can play with n and do a few runs of random number generation until you find a layout that's pleasing.
Having thus prepared your data you may specify the word cloud elements like so:
base = alt.Chart(data).encode(
x=alt.X('x:O', axis=None),
y=alt.Y('y:O', axis=None)
).configure_view(strokeWidth=0) # remove border
word_cloud = base.mark_text(baseline='middle').encode(
text='word:N',
color=alt.Color('count:Q', scale=alt.Scale(scheme='goldred')),
size=alt.Size('count:Q', legend=None)
)
Here's the result applied to the same dataset used in the Vega docs:

Related

Why do groups not respect the parent's padding, except at the top level?

What exactly is the behavior of padding for group marks in Vega? At the top-most level the children groups respect the top-level padding, but this doesn't seem the case for the children's children, they don't respect their parent's padding.
For example, here I would expect to get a rectangle centered in a rectangle centered in another rectangle:
Open the Chart in the Vega Editor
Instead each rectangle seems to be anchored at the origin of the top-level coordinate system.
Note that replacing "padding": {"signal": "level_2_padding"} with "padding": {"value": 0} doesn't seem to have any effect, so I'm not even sure if inner groups can have padding?
How can I best implement nested groups that respect the parent's padding?
There is no padding property on a Group mark. Instead, you can access group properties using Field Values. Something like the following should work.
Editor
{
"$schema": "https://vega.github.io/schema/vega/v5.json",
"autosize": "none",
"config": {"group": {"stroke": "black"}},
"signals": [
{"name": "target_height", "value": 400},
{"name": "target_width", "value": 300},
{"name": "level_0_padding", "value": 64},
{"name": "level_1_padding", "update": "1/2 * level_0_padding"},
{"name": "level_2_padding", "update": "1/4 * level_0_padding"},
{"name": "level_0_height", "update": "target_height - 2*level_0_padding"},
{"name": "level_0_width", "update": "target_width - 2*level_0_padding"},
{"name": "level_1_width", "update": "level_0_width - 2*level_1_padding"},
{"name": "level_1_height", "update": "level_0_height - 2*level_1_padding"}
],
"width": {"signal": "level_0_width"},
"height": {"signal": "level_0_height"},
"padding": {"signal": "level_0_padding"},
"marks": [
{
"type": "group",
"signals": [
{
"name": "level_2_width",
"update": "level_1_width - 2*level_2_padding"
},
{
"name": "level_2_height",
"update": "level_1_height - 2*level_2_padding"
}
],
"encode": {
"update": {
"width": {"signal": "level_1_width"},
"height": {"signal": "level_1_height"},
"x": {"signal": "level_0_width-level_1_width - level_1_padding"},
"y": {"signal": "level_0_height-level_1_height - level_1_padding"},
"stroke": {"value": "red"},
"strokeOpacity": {"value": 0.5}
}
},
"marks": [
{
"type": "group",
"encode": {
"update": {
"width": {"signal": "level_2_width"},
"height": {"signal": "level_2_height"},
"x": {
"field": {"group": "width"},
"mult": 0.5,
"offset": {"signal": "-level_2_width/2"}
},
"y": {
"field": {"group": "height"},
"mult": 0.5,
"offset": {"signal": "-level_2_height/2"}
},
"stroke": {"value": "blue"},
"strokeOpacity": {"value": 0.5}
}
}
}
]
}
]
}
I'll accept David's answer, but also post my own to complement David's.
Here's an alternative solution specification, like David's spec it uses the "x" and "y" group properties, but I think it's a bit simpler and closer to what I need: Open the Chart in the Vega Editor
An important point that I have to mention is that using layout prevents x and y from working, that is: groups directly contained in a layout/grid may not be offset using x or y.

Vega-Lite: Excessive padding when generating a chart with `vl2svg`

I'm using vega-lite's vl2svg CLI helper to headlessly generate chart SVGs on a Heroku server. This works great for the most part, but I'm noticing that the generated SVGs have way too much padding on the left side, to the left of the labels, and the amount of "excess padding" grows as the labels get longer.
For example, here's a simple bar chart with 3 bars and 3 labels. In the online Vega-Lite editor, the chart renders correctly as pictured below (ie. no excessive padding). But when I generate the same chart via vl2svg on my local computer (using the same package versions - Vega 5.21.0, Vega-Lite 5.2.0) the chart has around 100px of excess padding on the left side before the labels begin.
The chart's VL spec:
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"data": {
"values": [{"category":"Male","count":112},{"category":"Female","count":122},{"category":"Other (eg. non-binary)","count":31}]
},
"encoding": {
"y": {"field": "category", "sort": null, "title": "", "type": "nominal", "axis": {}},
"x": {"field": "count", "title": "", "type": "quantitative", "axis": {}}
},
"layer": [
{"mark": "bar"},
{
"mark": {"type": "text", "dx": 3, "dy": 0, "xOffset": 0, "yOffset": 0, "align": "left", "baseline": "middle", "limit": 20, "fontSize": 8},
"encoding": {"text": {"field": "count"}}
}
]
}
Steps I used to generate the svg headlessly (on MacOS 10.15.7, in case the OS matters):
npm install vega-lite#5.2.0
write the above spec to test.json
node_modules/vega-lite/bin/vl2svg test.json > test.svg
The SVG chart rendered by the online editor (I added the red border to highlight the difference in padding):
The SVG chart rendered by vl2svg run locally:
Why does this difference in padding / spacing exist? What can I do to make the headlessly-generated svg render identically to what the online editor generates? Is this a bug? or user error on my end?
NOTE: The excessive spacing mostly disappears when the labels are uniformly short. For example if I reword that third label to "Other", then re-render the svg using vl2svg, there's only ~5px of excessive left padding, as shown below:

Can't get transform_filter to work in Altair

For my teaching notes I am trying to implement this vega-lite example in Altair:
{
"data": {"url": "data/seattle-weather.csv"},
"layer": [{
"params": [{
"name": "brush",
"select": {"type": "interval", "encodings": ["x"]}
}],
"mark": "bar",
"encoding": {
"x": {
"timeUnit": "month",
"field": "date",
"type": "ordinal"
},
"y": {
"aggregate": "mean",
"field": "precipitation",
"type": "quantitative"
},
"opacity": {
"condition": {
"param": "brush", "value": 1
},
"value": 0.7
}
}
}, {
"transform": [{
"filter": {"param": "brush"}
}],
"mark": "rule",
"encoding": {
"y": {
"aggregate": "mean",
"field": "precipitation",
"type": "quantitative"
},
"color": {"value": "firebrick"},
"size": {"value": 3}
}
}]
}
I getting the separate charts (bar and rule to work) was easy, but I run into issues in making mark_rule interactive.
import altair as alt
from vega_datasets import data
df = data.seattle_weather()
selection = alt.selection_interval(encodings=['x'])
base = alt.Chart(df).add_selection(selection)
bar_i = base.mark_bar().encode(
x="month(date):T",
y="mean(precipitation):Q",
opacity=alt.condition(selection, alt.value(1.0), alt.value(0.7)))
rule_i = base.mark_rule().transform_filter(selection).encode(y="mean(precipitation):Q")
(bar_i + rule_i).properties(width=600)
The error reads
Javascript Error: Duplicate signal name: "selector013_scale_trigger"
This usually means there's a typo in your chart specification. See the javascript console for the full traceback.
It looks like the chart you're interested in creating is part of Altair's example gallery: https://altair-viz.github.io/gallery/selection_layer_bar_month.html
import altair as alt
from vega_datasets import data
source = data.seattle_weather()
brush = alt.selection(type='interval', encodings=['x'])
bars = alt.Chart(source).mark_bar().encode(
x='month(date):O',
y='mean(precipitation):Q',
opacity=alt.condition(brush, alt.OpacityValue(1), alt.OpacityValue(0.7)),
).add_selection(
brush
)
line = alt.Chart(source).mark_rule(color='firebrick').encode(
y='mean(precipitation):Q',
size=alt.SizeValue(3)
).transform_filter(
brush
)
bars + line
The error you're seeing comes from the fact that base includes the selection, and both layers are derived from base, so the same selection is declared twice within the single chart.

Data is getting embedded via a local json file

I'm trying to plot some data, that data is in a pandas dataframe cdfs:
alt.Chart(cdfs).mark_line().encode(
x = alt.X('latency:Q', scale=alt.Scale(type='log'), axis=alt.Axis(format="", title='Response_time (ms)')),
y = alt.Y('percentile:Q', axis=alt.Axis(format="", title='Cumulative Fraction')),
color='write_size:N',
)
The issue is that when viewing the source of the resultant plot there is just a url to a json file. That json file can't be found and hence the plots are appearing to be blank (no data).
{
"config": {"view": {"continuousWidth": 400, "continuousHeight": 300}},
"data": {
"url": "altair-data-78b044f23db74f7d4408fba9f31b9ea9.json",
"format": {"type": "json"}
},
"mark": "line",
"encoding": {
"color": {"type": "nominal", "field": "write_size"},
"x": {
"type": "quantitative",
"axis": {"format": "", "title": "Response_time (ms)"},
"field": "latency",
"scale": {"type": "log"}
},
"y": {
"type": "quantitative",
"axis": {"format": "", "title": "Cumulative Fraction"},
"field": "percentile"
}
},
"$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json"
}
This code was previously working (displaying the data on the chart) however I restarted the jupyterlab server its running on between now and then.
Hence I'm wondering why the data is getting embedded via a url rather than directly all of a sudden?
At some point in your session, you must have run
alt.data_transformers.enable('json')
If you want to restore the default data transformer which embeds data directly into the chart, run
alt.data_transformers.enable('default')

Vega lite use one data field for the axis and another for the label

In Vega Lite, is it possible to use one field of the data values as the axis value, and another field as the label?
If this is my vega lite spec, then the graph works correctly, but shows the dates on the x-axis. How can I show the day names on the x-axis instead?
{
"$schema": "https://vega.github.io/schema/vega-lite/v2.json",
"description": "basic line graph",
"data": {
"values": [
{"date":"2017-08-15", "dayName":"Tue","item":"foo","count":"0"},
{"date":"2017-08-16", "dayName":"Wed","item":"foo","count":"11"},
{"date":"2017-08-17", "dayName":"Thu","item":"foo","count":"7"},
{"date":"2017-08-18", "dayName":"Fri","item":"foo","count":"28"},
{"date":"2017-08-19", "dayName":"Sat","item":"foo","count":"0"},
{"date":"2017-08-20", "dayName":"Sun","item":"foo","count":"0"},
{"date":"2017-08-21", "dayName":"Mon","item":"foo","count":"0"}
]},
"mark": {
"type": "line",
"interpolate": "monotone"
},
"encoding": {
"x": {"field": "date", "type": "temporal"},
"y": {"field": "count", "type": "quantitative"}
}
}
It shows the date field, August 16, August 17 on the x-axis. How can I make it show the dayName field instead? It should show Tue, Wed, and so on.
You can use timeUnit.
{
"$schema": "https://vega.github.io/schema/vega-lite/v2.json",
"description": "basic line graph",
"data": {
"values": [
{"date":"2017-08-15", "dayName":"Tue","item":"foo","count":"0"},
{"date":"2017-08-16", "dayName":"Wed","item":"foo","count":"11"},
{"date":"2017-08-17", "dayName":"Thu","item":"foo","count":"7"},
{"date":"2017-08-18", "dayName":"Fri","item":"foo","count":"28"},
{"date":"2017-08-19", "dayName":"Sat","item":"foo","count":"0"},
{"date":"2017-08-20", "dayName":"Sun","item":"foo","count":"0"},
{"date":"2017-08-21", "dayName":"Mon","item":"foo","count":"0"}
]},
"mark": {
"type": "line",
"interpolate": "monotone"
},
"encoding": {
"x": {
"timeUnit": "day",
"field": "date",
"type": "temporal"
},
"y": {"field": "count", "type": "quantitative"}
}
}
If you want to customize the label format, you can add axis format, as well

Resources