How to make an altair plot within an IF statement? - python-3.x

The situation seems to be quite simple: I am working in a Jupyter Lab file with several Altair plots, which eventually make the file too large to run and to save. Since I don't need to see these plots every single time, I figured I could avoid this by specifying something like plotAltair = True at the beginning of the script and then nesting each Altair plot in if statements. As simple as this may sound, for some reason it doesn't appear to work. Am I missing out on something obvious? [edit: turns out I was]
For instance:
import altair as alt
import os
import pandas as pd
import numpy as np
lengths = np.random.randint(0,100,200)
lengths_list = lengths.tolist()
labels = [str(i) for i in lengths_list]
peak_lengths = pd.DataFrame.from_dict({'coords': labels,
'lengths': lengths_list},
orient='columns')
What works:
alt.Chart(peak_lengths).mark_bar().encode(
x = alt.X('lengths:Q', bin=True),
y='count(*):Q'
)
What doesn't work:
plotAltair = True
if plotAltair:
alt.Chart(peak_lengths).mark_bar().encode(
x = alt.X('lengths:Q', bin=True),
y='count(*):Q'
)
** Obs.: I have already attempted to use alt.data_transformers.enable('json') as a way of reducing file size and it is also not working, but let's please not focus on this but rather on the more simple question.

Short answer: use chart.display()
Long answer: Jupyter notebooks in general will only display things if you tell them to. For example, this code will not result in any output:
if x:
x + 1
You are telling the notebook to evaluate x + 1, but not to do anything with it. What you need to do is tell the notebook to print the result, either implicitly by putting it as the last line in the main block of the cell, or explicitly by asking for it to be printed when the statement appears anywhere else:
if x:
print(x + 1)
It is similar for Altair charts, which are just normal Python objects. If you put the chart at the end of the cell, you are implicitly asking for the result to be displayed, and Jupyter will display it as it will any variable. If you want it to be displayed from any other location in the cell, you need to explicitly ask that it be displayed using the IPython.display.display() function:
from IPython.display import display
if plotChart:
chart = alt.Chart(data).mark_point().encode(x='x', y='y')
display(chart)
Because this extra import is a bit verbose, Altair provides a .display() method as a convenience function to do the same thing:
if plotChart:
chart = alt.Chart(data).mark_point().encode(x='x', y='y')
chart.display()
Note that calling .display() on multiple charts is the way that you can display multiple charts in a single cell.

Related

Is there a way to select or highlight last or first "n" data points in Altair?

One of the things I have found wanting lately is the ability to highlight or select just the last n data points in Altair. For example, for a daily updated time series data, selecting/highlighting the last 7 days data window.
The issue with condition is that you have to explicitly specify the date or a value, from which the selection/highlight happens. One drawback of this is that in case of a time series data that updates fairly frequently, it becomes a manual task.
One possible solution is to just use native Python and if the x axis is datetime data, then write the code in such a way that it programmatically takes care of things perhaps using f-strings.
I was wondering, despite these two solutions above, is there a way natively built into Altair/Vega-Lite to select the last/first n data points?
A contrived example using f-strings -
index = 7 #a perhaps bad way to highlight last 2 data points
data = pd.DataFrame({'time':[0,1,2,3,4,5,6,7,8,9], 'value':[1,2,4,8,16,15,14,13,12,11]})
bar = alt.Chart(data).mark_bar(opacity=1, width=15).encode(
x='time:T',
y='value:Q',
color = alt.condition(alt.datum.time>f'{index}', alt.value('red'), alt.value('steelblue'))
)
text = bar.mark_text(align='center', dy=-10).encode(
text='value:Q'
)
bar+text
You can do this using a window transform, in a similar way to the Top-K Items example:
import altair as alt
import pandas as pd
data = pd.DataFrame({'time':[0,1,2,3,4,5,6,7,8,9], 'value':[1,2,4,8,16,15,14,13,12,11]})
num_items = 2
base = alt.Chart(data).transform_window(
rank='rank()',
sort=[alt.SortField('time', order='descending')]
)
bar = base.mark_bar(opacity=1, width=15).encode(
x='time:T',
y='value:Q',
color = alt.condition(alt.datum.rank<=num_items, alt.value('red'), alt.value('steelblue'))
)
text = bar.mark_text(align='center', dy=-10).encode(
text='value:Q'
)
bar+text

How to add traces in plotly.express

I am very new to python and plotly.express, and I find it very confusing...
I am trying to use the principle of adding different traces to my figure, using example code shown here https://plotly.com/python/line-charts/, Line Plot Modes, #Create traces.
BUT I get my data from a .CSV file.
import plotly.express as px
import plotly as plotly
import plotly.graph_objs as go
import pandas as pd
data = pd.read_csv(r"C:\Users\x.csv")
fig = px.scatter(data, x="Time", y="OD", color="C-source", size="C:A 1 ratio")
fig = px.line(data, x="Time", y="OD", color="C-source")
fig.show()
The above lines produces scatter/line plots with the correct data, but the data is mixed together. I have data from 2 different sources marked by a column named "Strain" in my .csv file that I would like the chart to reflect.
Is the traces option a possible way to do it, or is there another way?
You can add traces using an Express plot by using .select_traces(). Something like:
fig.add_traces(
list(px.line(...).select_traces())
)
Note the need to convert to list, since .select_traces() returns a generator.
It looks like you probably want the lines with the scatter dots as well on a single plot?
You're setting fig to equal px.scatter() and then setting (changing) it to equal px.line(). When set to line, the scatter plot is overwritten.
You're already importing graph objects so you can use add_trace with go, something like this:
fig.add_trace(go.Scatter(x=data["Time"], y=data["OD"], mode='markers', marker=dict(color=data["C-source"], size=data["C:A 1 ratio"])))
Depending on how your data is set up, you may need to add each C-source separately doing something like:
x=data.query("C-source=='Term'")["Time"], ... , name='Term'`
Here's a few references with examples and options you can use to set up your scatter:
Scatter plot examples  
Marker styles  
Scatter arguments and attributes
You can use the apporach stated in Plotly: How to combine scatter and line plots using Plotly Express?
fig3 = go.Figure(data=fig1.data + fig2.data)
or a more convenient and scalable approach:
fig1.data and fig2.data are common tuples that hold all the info needed for a plot and the + just concatenates them.
# this will hold all figures until they are combined
all_figures = []
# data_collection: dictionary with Pandas dataframes
for df_label in data_collection:
df = data_collection[df_label]
fig = px.line(df, x='Date', y=['Value'])
all_figures.append(fig)
import operator
import functools
# now you can concatenate all the data tuples
# by using the programmatic add operator
fig3 = go.Figure(data=functools.reduce(operator.add, [_.data for _ in all_figures]))
fig3.show()
thanks for taking the time to help me out. I ended up with two solutions that worked, of which using "facet_col" to divide the plot into two subplots (1 for each strain) was the most simple solution.
https://plotly.com/python/axes/
Thanks. this worked for me also where Fig_Set_B is a list of scatter plots
# create a tuple of first line plots in first 6 plots from plot set Fig_Set_B`
fig_combined = go.Figure(data= tuple(Fig_Set_B[x].data[0] for x in range(6)) )
fig_combined.show()

How to render seaborn objects repeatedly?

The version of python I am using is 3.7. I tried it both in Spyder and JupyterNotebook
I used a sns.dataset as an example.
As I run the following code, the figure will be automatically rendered in IPython console without using plt.show() which is different from some instructions in previous posts.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
df = sns.load_dataset('iris')
g = sns.pairplot(df, hue = 'species', height = 2.5)
However, I want to repeatedly show the seaborn object. How can I render g?
I've tried
plt.show(g)
g.show()
etc...
but none of them works. I do not want that everytime I call a figure, I have to re-plot it.
As long as you put the previously created figure object as the last line of new cells, this figure will return with whatever new additional elements, see cell 4 below:
In your case if g = ... is in your 1st cell, add an f = plt.gcf() to get the Figure object as here.

Need help in creating a function to plot a Matplotlib GridSpec

I have a dataset with 80 variables. I am interested in creating a function that will automate the creation of a 20 X 4 GridSpec in Matplotlib. Each subplot would either contain a histogram or a barplot for each of the 80 variables in the data. As a first step, I successfully created two functions (I call them 'counts' and 'histogram') that contain the layout of the plot that I want. Both of them work when tested on individual variables. As a next step, I attempted to create a function that would take the column names, loop through a conditional to test whether the data type is an object or otherwise and call the right function based on the datatype as a new subplot. Here is the code that I have so far:
Creates list of coordinates we will need for subplot specification:
A = np.arange(21)
B = np.arange(4)
coords = []
for i in A:
for j in B:
coords.append([A[i], B[j]])
#Create the gridspec and layout the figure
import matplotlib.gridspec as gridspec
fig = plt.figure(figsize=(12,6))
gs = gridspec.GridSpec(2,4)
#Function that relies on what we've done above:
def grid(cols=['MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley']):
for i in cols:
for vals in coords:
if str(train[i].dtype) == 'object':
plt.subplot('gs'+str(vals))
counts(cols)
else:
plt.subplot('gs'+str(vals))
histogram(cols)
When attempted, this code returns an error:
ValueError: Single argument to subplot must be a 3-digit integer
For purposes of helping you visualize, what I am hoping to achieve, I attach the screen shot below, which was produced by the line by line coding (with my created helper functions) I am trying to avoid:
Can anyone help me figure out where I am going wrong? I would appreciate any advice. Thank you!
The line plt.subplot('gs'+str(vals)) cannot work; which is also what the error tells you.
As can be seen from the matplotlib GridSpec tutorial, it needs to be
ax = plt.subplot(gs[0, 0])
So in your case you may use the values from the list as
ax = plt.subplot(gs[vals[0], vals[1]])
Mind that you also need to make sure that the coords list must have the n*m elements, if the gridspec is defined as gs = gridspec.GridSpec(n,m).

Matplotlib - Stacked Bar Chart with ~1000 Bars

Background:
I'm working on a program to show a 2d cross section of 3d data. The data is stored in a simple text csv file in the format x, y, z1, z2, z3, etc. I take a start and end point and flick through the dataset (~110,000 lines) to create a line of points between these two locations, and dump them into an array. This works fine, and fairly quickly (takes about 0.3 seconds). To then display this line, I've been creating a matplotlib stacked bar chart. However, the total run time of the program is about 5.5 seconds. I've narrowed the bulk of it (3 seconds worth) down to the code below.
'values' is an array with the x, y and z values plus a leading identifier, which isn't used in this part of the code. The first plt.bar is plotting the bar sections, and the second is used to create an arbitrary floor of -2000. In order to generate a continuous looking section, I'm using an interval between each bar of zero.
import matplotlib.pyplot as plt
for values in crossSection:
prevNum = None
layerColour = None
if values != None:
for i in range(3, len(values)):
if values[i] != 'n':
num = float(values[i].strip())
if prevNum != None:
plt.bar(spacing, prevNum-num, width=interval, \
bottom=num, color=layerColour, \
edgecolor=None, linewidth=0)
prevNum = num
layerColour = layerParams[i].strip()
if prevNum != None:
plt.bar(spacing, prevNum+2000, width=interval, bottom=-2000, \
color=layerColour, linewidth=0)
spacing += interval
I'm sure there's a more efficient way to do this, but I'm new to Matplotlib and still unfamilar with its capabilities. The other main use of time in the code is:
plt.savefig('output.png')
which takes about a second, but I figure this is to be expected to save the file and I can't do anything about it.
Question:
Is there a faster way of generating the same output (a stacked bar chart or something that looks like one) by using plt.bar() better, or a different Matplotlib function?
EDIT:
I forgot to mention in the original post that I'm using Python 3.2.3 and Matplotlib 1.2.0
Leaving this here in case someone runs into the same problem...
While not exactly the same as using bar(), with a sufficiently large dataset (large enough that using bar() takes a few seconds) the results are indistinguishable from stackplot(). If I sort the data into layers using the method given by tcaswell and feed it into stackplot() the chart is created in 0.2 seconds, rather than 3 seconds.
EDIT
Code provided by tcaswell to turn the data into layers:
accum_values = []
for values in crosssection:
accum_values.append([float(v.strip()) for v iv values[3:]])
accum_values = np.vstack(accum_values).T
layer_params = [l.strip() for l in layerParams]
bottom = numpy.zeros(accum_values[0].shape)
It looks like you are drawing each bar, you can pass sequences to bar (see this example)
I think something like:
accum_values = []
for values in crosssection:
accum_values.append([float(v.strip()) for v iv values[3:]])
accum_values = np.vstack(accum_values).T
layer_params = [l.strip() for l in layerParams]
bottom = numpy.zeros(accum_values[0].shape)
ax = plt.gca()
spacing = interval*numpy.arange(len(accum_values[0]))
for data,color is zip(accum_values,layer_params):
ax.bar(spacing,data,bottom=bottom,color=color,linewidth=0,width=interval)
bottom += data
will be faster (because each call to bar creates one BarContainer and I suspect the source of your issues is you were creating one for each bar, instead of one for each layer).
I don't really understand what you are doing with the bars that have tops below their bottoms, so I didn't try to implement that, so you will have to adapt this a bit.

Resources