Vega-Lite/Altair extend regression line to the edges of the graph - altair

I'm trying to find a way to extend regression lines in vega-lite/altair charts to the edge of the chart. As of now, when applying a regression transform to a dataset results in datapoints that only stretch to the bounding-box of the original dataset. Is it possible somehow to extend this range to the x/y extents of the chart? In the picture below, the black line is what vega-lite calculates per default. Extending the line to the edges as shown in yellow is what I'm trying to achieve.
EDIT
When specifying the extent property on the transform_regression call it seems like it is adjusting the y variable instead of the x variable. Maybe I'm grossly misunderstanding something but maybe it has something to do with the fact that my x variable are dates which might behave differently?
When I specify the extent like so
CDR_base.transform_regression(
'per_capita',
'year',
groupby=['region'],
extent=[2000, 2100]
).mark_line()
I would expect the extent of the regression lines to extend from 2000 to 2100. For some reason the extent gets applied to the y axis it seems.

You can use the extent argument of the regression transform to control the extent of the line. For example, here is a dataset with a default line:
import altair as alt
import pandas as pd
import numpy as np
np.random.seed(2)
df = pd.DataFrame({
'x': np.random.randint(0, 100, 10),
'y': np.random.randint(0, 100, 10)
})
points = alt.Chart(df).mark_point().encode(
x='x:Q',
y='y:Q'
)
points + points.transform_regression('x', 'y').mark_line()
And here it is with extent set:
points + points.transform_regression('x', 'y', extent=[0, 90]).mark_line()

Related

Is there a way to apply 3d-like appearance (like bevel) to 2d matplotlib plots?

I've been working for a while with the matplotlib package in Python, and I know that you can do 2D graphs (usually involving two "dimensions", x and y) or 3D graphs (with functions like plot3D). However, I am unable to find documentation about giving a '3D aesthetic' to a 2D plot.
That is, giving the plot a bit of volume, some shadows, etc.
To give an example, let's say I wanted to create a donut chart in matplotlib. A first draft could be something like this:
import matplotlib.pyplot as plt
#Given an array of values 'values' and,
#optionally, an array of colors 'colors'
#and an array of labels 'labels':
ax = plt.subplot()
ax.pie(
x = values,
labels = labels,
colors = colors
)
center_circle = plt.Circle((0,0), radius = 0.5, fc = "white")
ax.add_artist(center_circle)
plt.show()
However, a quick graph with Excel can give a much more appealing result:
Looking at the documentation of plt.pie, I was not able to find anything significant, apart from the parameter shadow, which when set to True, gives an underwhelming result:
Also, I would like to add effect such as the use of bevel (like the 3d-look of the borders of each wedge of the pie) and more style things. How could I improve the look of my graph with matplotlib? Is it even possible to accomplish it with this library?
One solution might be using a different library. I am not familiar with seaborn, but I know it is also a powerful visualisation library. The same with plotly. Does any one of these libraries allow for these kind of customisations?
There are a whole bunch of options on the matplotlib website for pie charts here: https://matplotlib.org/stable/gallery/pie_and_polar_charts/index.html
Matplotlib does not have a built-in option to add a bevel to a 2D pie chart or any other types of charts directly.
But, you could do this (raised shaddow) for a 3d effect:
import matplotlib.pyplot as plt
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = 'Frogs', 'Hogs', 'Dogs', 'Logs'
sizes = [15, 30, 45, 10]
explode = (0, 0.1, 0, 0) # only "explode" the 2nd slice (i.e. 'Hogs')
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
which give this:

How to plot a histogram with plot.hist for continous data in a dataframe in pandas?

In this data set I need to plot,pH as the x-column which is having continuous data and need to group it together the pH axis as per the quality value and plot the histogram. In many of the resources I referred I found solutions for using random data generated. I tried this piece of code.
plt.hist(, density=True, bins=1)
plt.ylabel('quality')
plt.xlabel('pH');
Where I eliminated the random generated data, but I received and error
File "<ipython-input-16-9afc718b5558>", line 1
plt.hist(, density=True, bins=1)
^
SyntaxError: invalid syntax
What is the proper way to plot my data?I want to feed into the histogram not randomly generated data, but data found in the data set.
Your Error
The immediate problem in your code is the missing data to the plt.hist() command.
plt.hist(, density=True, bins=1)
should be something like:
plt.hist(data_table['pH'], density=True, bins=1)
Seaborn histplot
But this doesn't get the plot broken down by quality. The answer by Mr.T looks correct, but I'd also suggest seaborn which works with "melted" data like you have. The histplot command should give you what you want:
import seaborn as sns
sns.histplot(data=df, x="pH", hue="quality", palette="Dark2", element='step')
Assuming the table you posted is in a pandas.DataFrame named df with columns "pH" and "quality", you get something like:
The palette (Dark2) can can be any matplotlib colormap.
Subplots
If the overlaid histograms are too hard to see, an option is to do facets or small multiples. To do this with pandas and matplotlib:
# group dataframe by quality values
data_by_qual = df.groupby('quality')
# create a sub plot for each quality group
fig, axes = plt.subplots(nrows=len(data_by_qual),
figsize=[6,12],
sharex=True)
fig.subplots_adjust(hspace=.5)
# loop over axes and quality groups together
for ax, (quality, qual_data) in zip(axes, data_by_qual):
ax.hist(qual_data['pH'], bins=10)
ax.set_title(f"quality = {quality}")
ax.set_xlabel('pH')
Altair Facets
The plotting library altair can do this for you:
import altair as alt
alt.Chart(df).mark_bar().encode(
alt.X("pH:Q", bin=True),
y='count()',
).facet(row='quality')
Several possibilities here to represent multiple histograms. All have in common that the data have to be transformed from long to wide format - meaning, each category is in its own column:
import matplotlib.pyplot as plt
import pandas as pd
#test data generation
import numpy as np
np.random.seed(123)
n=300
df = pd.DataFrame({"A": np.random.randint(1, 100, n), "pH": 3*np.random.rand(n), "quality": np.random.choice([3, 4, 5, 6], n)})
df.pH += df.quality
#instead of this block you have to read here your stored data, e.g.,
#df = pd.read_csv("my_data_file.csv")
#check that it read the correct data
#print(df.dtypes)
#print(df.head(10))
#bringing the columns in the required wide format
plot_df = df.pivot(columns="quality")["pH"]
bin_nr=5
#creating three subplots for different ways to present the same histograms
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(6, 12))
ax1.hist(plot_df, bins=bin_nr, density=True, histtype="bar", label=plot_df.columns)
ax1.legend()
ax1.set_title("Basically bar graphs")
plot_df.plot.hist(stacked=True, bins=bin_nr, density=True, ax=ax2)
ax2.set_title("Stacked histograms")
plot_df.plot.hist(alpha=0.5, bins=bin_nr, density=True, ax=ax3)
ax3.set_title("Overlay histograms")
plt.show()
Sample output:
It is not clear, though, what you intended to do with just one bin and why your y-axis was labeled "quality" when this axis represents the frequency in a histogram.

Unable to make 3d Plots with legend for dataframes

I am trying to make a 3d plot from a Pandas.DataFrame object.
Requirements
The number of columns to be plotted for z may vary and hence I am using a loop for the z values with a fixed x and y values. The code is shown in Code 1.
Code 1
import matplotlib.pyplot as plt
import urllib, base64
from mpl_toolkits.mplot3d import axes3d
import numpy as np
import pandas as pd
column_names = ['A', 'B', 'C', 'D', 'E']
df = pd.DataFrame(columns=column_names)
fig2 = plt.figure(figsize=(15,15))
ax2 = fig2.add_subplot(111, projection='3d')
for x in df.columns:
if(x!='A' and x!='B'):
ax2.plot_surface(df['A'].values, df['B'].values, df[x].values, linewidth=0, antialiased=False)
ax2.legend()
Problem:
When I execute Code 1, I get an error -
Argument Z must be 2-dimensional.
I have solved it when i used - plot_trisurf, as shown in Code 2.
Code 2
for x in df.columns:
if(x!='A' and x!='B'):
ax2.plot_trisurf(df['A'].values, df['B'].values, df[x].values, linewidth=0, antialiased=False)
ax2.legend()
But now I am getting a different error -
Error in qhull Delaunay triangulation calculation: singular input data (exitcode=2); use python verbose option (-v) to see original qhull error.
Question
How can I make 3d plots for a Pandas.DataFrame with different number of columns for Z with Legend
Note
The data provided above is just for experimentation and may not be uniform and can have decimals.
The qhull error you're getting is probably because your 'A' and 'B' columns contain data that is linearly dependent, i.e., the points are geometrically on a line (you did not post the data so I cannot verify this). The plot_trisurf function tries to construct a Delaunay triangulation from the X, Y parameters you pass it (the df['A'].values, df['B'].values in your code).
A singular/degenerate configuration invokes the error from Qhull, which is the underlying library that is used to construct the Delaunay triangulation (see also my answer here).
If your data is singular/degenerate you can use scatter plots or line plots instead.
If you insist on a surface plot, and your data is singular, you might try to "joggle" the X, Y data so the underlying triangulation will not fail.

Interpolating using a cubic function gives a negative value for probability

I have a set of data which correspond to ages (in steps of 0.1) along the x axis, and probabilities along the y axis. I'm trying to interpolate the data so I can find the maximum and a range of ages which covers 95% of the probability.
I've tried a simple interpolation using the code below, taken from the SciPy help pages, and it produces good results (I change the x and y variables to read my data), except for one feature.
from scipy.interpolate import interp1d
x = np.linspace(72, 100, num=29, endpoint=True)
y = df.iloc[:,0].values
f = interp1d(x, y)
f2 = interp1d(x, y, kind='cubic')
xnew = np.linspace(0, 10, num=41, endpoint=True)
import matplotlib.pyplot as plt
plt.plot(x, y, 'o', xnew, f(xnew), '-', xnew, f2(xnew), '--')
plt.legend(['data', 'linear', 'cubic'], loc='best')
plt.show()
The problem is, the cubic function works best, with the smoothest fit. However, it gives negative values for some parts of the probability curve, which is obviously not acceptable. Is there some way of setting a floor at y=0? I thought maybe switching to a quadratic kind would fix it, but it doesn't seem to. The linear fit does, but it's not smoothed, so is not a very good match.
I'm also not sure how to perform the second part of what I'm trying to do. It's probably very simple, but I don't know how to find the mean when I don't have a frequency table, but a grid of interpolated points which form a function. If I knew the function, I could integrate it, but I'm not sure how to do that in Python.
EDIT to include some data:
This is what my y data looks like:
array([3.41528917e-08, 7.81041275e-05, 9.60711716e-04, 5.75868934e-05,
6.50260297e-05, 2.95556411e-05, 2.37331370e-05, 9.11990619e-05,
1.08003254e-04, 4.16800419e-05, 6.63673113e-05, 2.57934035e-04,
3.42235937e-03, 5.07534495e-03, 1.76603165e-02, 1.69535370e-01,
2.67624254e-01, 4.29420872e-01, 8.25165926e-02, 2.08367339e-02,
2.01227453e-03, 1.15405995e-04, 5.40163098e-07, 1.66905537e-10,
8.31862858e-18, 4.14093219e-23, 8.32103362e-29, 5.65637769e-34,
7.93547444e-40])

Matplotlib: personalize imshow axis

I have the results of a (H,ranges) = numpy.histogram2d() computation and I'm trying to plot it.
Given H I can easily put it into plt.imshow(H) to get the corresponding image. (see http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.imshow )
My problem is that the axis of the produced image are the "cell counting" of H and are completely unrelated to the values of ranges.
I know I can use the keyword extent (as pointed in: Change values on matplotlib imshow() graph axis ). But this solution does not work for me: my values on range are not growing linearly (actually they are going exponentially)
My question is: How can I put the value of range in plt.imshow()? Or at least, or can I manually set the label values of the plt.imshow resulting object?
Editing the extent is not a good solution.
You can just change the tick labels to something more appropriate for your data.
For example, here we'll set every 5th pixel to an exponential function:
import numpy as np
import matplotlib.pyplot as plt
im = np.random.rand(21,21)
fig,(ax1,ax2) = plt.subplots(1,2)
ax1.imshow(im)
ax2.imshow(im)
# Where we want the ticks, in pixel locations
ticks = np.linspace(0,20,5)
# What those pixel locations correspond to in data coordinates.
# Also set the float format here
ticklabels = ["{:6.2f}".format(i) for i in np.exp(ticks/5)]
ax2.set_xticks(ticks)
ax2.set_xticklabels(ticklabels)
ax2.set_yticks(ticks)
ax2.set_yticklabels(ticklabels)
plt.show()
Expanding a bit on #thomas answer
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mi
im = np.random.rand(20, 20)
ticks = np.exp(np.linspace(0, 10, 20))
fig, ax = plt.subplots()
ax.pcolor(ticks, ticks, im, cmap='viridis')
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlim([1, np.exp(10)])
ax.set_ylim([1, np.exp(10)])
By letting mpl take care of the non-linear mapping you can now accurately over-plot other artists. There is a performance hit for this (as pcolor is more expensive to draw than AxesImage), but getting accurate ticks is worth it.
imshow is for displaying images, so it does not support x and y bins.
You could either use pcolor instead,
H,xedges,yedges = np.histogram2d()
plt.pcolor(xedges,yedges,H)
or use plt.hist2d which directly plots your histogram.

Resources