Plot several boxplots in one figure - python-3.x

I am using python-3.x and I would like to plot several boxplots in one figure, all the data from one numpy array where the shape of this array is (100, 301)
If I use the code below it will plot them all (I will have 301 boxplots in one figure which is too much)
fig, ax = plt.subplots()
ax.boxplot(my_data)
plt.show()
I don't want to plot all the data, I just want to plot 10, 15 or 20 (variable number) of the data by using for loop or any method that work best.
for example, I want to plot boxplots every 50 number of data that mean I will have around 6 boxplots from 301 in my figure, I tried to use for loop but no luck
Any advice would be much appreciated

You can just use indexing to plot every 50th data points using a variable step. To have separate box plots and avoid overlapping, you can specify the positions of individual box plot using the positions parameter. my_data[:, ::step] gives you the desired data to plot. Below is an example using some random data.
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
my_data = np.random.randint(0, 20, (100, 301))
step = 50
posit = range(my_data[:, ::step].shape[1])
ax.boxplot(my_data[:, ::step], positions=posit)
plt.show()

Related

How to plot a histogram with plot.hist for continous data in a dataframe in pandas?

In this data set I need to plot,pH as the x-column which is having continuous data and need to group it together the pH axis as per the quality value and plot the histogram. In many of the resources I referred I found solutions for using random data generated. I tried this piece of code.
plt.hist(, density=True, bins=1)
plt.ylabel('quality')
plt.xlabel('pH');
Where I eliminated the random generated data, but I received and error
File "<ipython-input-16-9afc718b5558>", line 1
plt.hist(, density=True, bins=1)
^
SyntaxError: invalid syntax
What is the proper way to plot my data?I want to feed into the histogram not randomly generated data, but data found in the data set.
Your Error
The immediate problem in your code is the missing data to the plt.hist() command.
plt.hist(, density=True, bins=1)
should be something like:
plt.hist(data_table['pH'], density=True, bins=1)
Seaborn histplot
But this doesn't get the plot broken down by quality. The answer by Mr.T looks correct, but I'd also suggest seaborn which works with "melted" data like you have. The histplot command should give you what you want:
import seaborn as sns
sns.histplot(data=df, x="pH", hue="quality", palette="Dark2", element='step')
Assuming the table you posted is in a pandas.DataFrame named df with columns "pH" and "quality", you get something like:
The palette (Dark2) can can be any matplotlib colormap.
Subplots
If the overlaid histograms are too hard to see, an option is to do facets or small multiples. To do this with pandas and matplotlib:
# group dataframe by quality values
data_by_qual = df.groupby('quality')
# create a sub plot for each quality group
fig, axes = plt.subplots(nrows=len(data_by_qual),
figsize=[6,12],
sharex=True)
fig.subplots_adjust(hspace=.5)
# loop over axes and quality groups together
for ax, (quality, qual_data) in zip(axes, data_by_qual):
ax.hist(qual_data['pH'], bins=10)
ax.set_title(f"quality = {quality}")
ax.set_xlabel('pH')
Altair Facets
The plotting library altair can do this for you:
import altair as alt
alt.Chart(df).mark_bar().encode(
alt.X("pH:Q", bin=True),
y='count()',
).facet(row='quality')
Several possibilities here to represent multiple histograms. All have in common that the data have to be transformed from long to wide format - meaning, each category is in its own column:
import matplotlib.pyplot as plt
import pandas as pd
#test data generation
import numpy as np
np.random.seed(123)
n=300
df = pd.DataFrame({"A": np.random.randint(1, 100, n), "pH": 3*np.random.rand(n), "quality": np.random.choice([3, 4, 5, 6], n)})
df.pH += df.quality
#instead of this block you have to read here your stored data, e.g.,
#df = pd.read_csv("my_data_file.csv")
#check that it read the correct data
#print(df.dtypes)
#print(df.head(10))
#bringing the columns in the required wide format
plot_df = df.pivot(columns="quality")["pH"]
bin_nr=5
#creating three subplots for different ways to present the same histograms
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(6, 12))
ax1.hist(plot_df, bins=bin_nr, density=True, histtype="bar", label=plot_df.columns)
ax1.legend()
ax1.set_title("Basically bar graphs")
plot_df.plot.hist(stacked=True, bins=bin_nr, density=True, ax=ax2)
ax2.set_title("Stacked histograms")
plot_df.plot.hist(alpha=0.5, bins=bin_nr, density=True, ax=ax3)
ax3.set_title("Overlay histograms")
plt.show()
Sample output:
It is not clear, though, what you intended to do with just one bin and why your y-axis was labeled "quality" when this axis represents the frequency in a histogram.

Legend overwritten by plot - matplotlib

I have a plot that looks as follows:
I want to put labels for both the lineplot and the markers in red. However the legend is not appearning because its the plot is taking out its space.
Update
it turns out I cannot put several strings in plt.legend()
I made the figure bigger by using the following:
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
However now I have only one label in the legend, with the marker appearing on the lineplot while I rather want two: one for the marker alone and another for the line alone:
Updated code:
plt.plot(range(len(y)), y, '-bD', c='blue', markerfacecolor='red', markeredgecolor='k', markevery=rare_cases, label='%s' % target_var_name)
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
# changed this over here
plt.legend()
plt.savefig(output_folder + fig_name)
plt.close()
What you want to do (have two labels for a single object) is not completely impossible but it's MUCH easier to plot separately the line and the rare values, e.g.
# boilerplate
import numpy as np
import matplotlib.pyplot as plt
# synthesize some data
N = 501
t = np.linspace(0, 10, N)
s = np.sin(np.pi*t)
rare = np.zeros(N, dtype=bool); rare[:20]=True; np.random.shuffle(rare)
plt.plot(t, s, label='Curve')
plt.scatter(t[rare], s[rare], label='rare')
plt.legend()
plt.show()
Update
[...] it turns out I cannot put several strings in plt.legend()
Well, you can, as long as ① the several strings are in an iterable (a tuple or a list) and ② the number of strings (i.e., labels) equals the number of artists (i.e., thingies) in the plot.
plt.legend(('a', 'b', 'c'))

Using "hue" for a Seaborn visual: how to get legend in one graph?

I created a scatter plot in seaborn using seaborn.relplot, but am having trouble putting the legend all in one graph.
When I do this simple way, everything works fine:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
df2 = df[df.ln_amt_000s < 700]
sns.relplot(x='ln_amt_000s', y='hud_med_fm_inc', hue='outcome', size='outcome', legend='brief', ax=ax, data=df2)
The result is a scatter plot as desired, with the legend on the right hand side.
However, when I try to generate a matplotlib figure and axes objects ahead of time to specify the figure dimensions I run into problems:
a4_dims = (10, 10) # generating a matplotlib figure and axes objects ahead of time to specify figure dimensions
df2 = df[df.ln_amt_000s < 700]
fig, ax = plt.subplots(figsize = a4_dims)
sns.relplot(x='ln_amt_000s', y='hud_med_fm_inc', hue='outcome', size='outcome', legend='brief', ax=ax, data=df2)
The result is two graphs -- one that has the scatter plots as expected but missing the legend, and another one below it that is all blank except for the legend on the right hand side.
How do I fix this such? My desired result is one graph where I can specify the figure dimensions and have the legend at the bottom in two rows, below the x-axis (if that is too difficult, or not supported, then the default legend position to the right on the same graph would work too)? I know the problem lies with "ax=ax", and in the way I am specifying the dimensions as matplotlib figure, but I'd like to know specifically why this causes a problem so I can learn from this.
Thank you for your time.
The issue is that sns.relplot is a "Figure-level interface for drawing relational plots onto a FacetGrid" (see the API page). With a simple sns.scatterplot (the default type of plot used by sns.relplot), your code works (changed to use reproducible data):
df = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv", index_col=0)
fig, ax = plt.subplots(figsize = (5,5))
sns.scatterplot(x = 'Sepal.Length', y = 'Sepal.Width',
hue = 'Species', legend = 'brief',
ax=ax, data = df)
plt.show()
Further edits to legend
Seaborn's legends are a bit finicky. Some tweaks you may want to employ:
Remove the default seaborn title, which is actually a legend entry, by getting and slicing the handles and labels
Set a new title that is actually a title
Move the location and make use of bbox_to_anchor to move outside the plot area (note that the bbox parameters need some tweaking depending on your plot size)
Specify the number of columns
fig, ax = plt.subplots(figsize = (5,5))
sns.scatterplot(x = 'Sepal.Length', y = 'Sepal.Width',
hue = 'Species', legend = 'brief',
ax=ax, data = df)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[1:], labels=labels[1:], loc=8,
ncol=2, bbox_to_anchor=[0.5,-.3,0,0])
plt.show()

Append matplotlib histogram plots to a list, without overwriting

I am trying to write a function, such that it creates multiple histograms(using matplotlib.pyplot) using a for loop on some data. I append these plots to a list, and the list is returned.
But when I try to use show() on each of the plots, they are all the same plot, and all the plots are being overwritten.
I have tried using plt.clf() at the end of the for loop, but it does not work for me.
for data in data_list:
n, bins, patches = plt.hist(x=data, bins='auto', color=color,
alpha=0.7, rwidth=0.85)
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.55)
plots.append(plt)
First, maybe we should question the need: why do you want to keep all plots at hand? Why not save them after plotting plt.savefig('xxxxx')? Are you planning to animate all of the figures?
Second, if we want to keep all figures, subplots might be a handy tool. Note that you are still plotting in one plot, but calling subplots creates multiple axes and allow them to not interfere with each other. Try this:
import matplotlib.pyplot as plt
np.random.seed(0)
data_list = [np.random.randn(10) for _ in range(4)]
plots=[]
fig, axes = plt.subplots(4,1)
for i,data in enumerate(data_list):
n, bins, patches = axes[i].hist(x=data, bins='auto',
alpha=0.7, rwidth=0.85)
axes[i].set_xlabel('Values')
axes[i].set_ylabel('Frequency')
axes[i].grid(axis='y', alpha=0.55)
For more info on subplots
For more info on multiple figures

Matplotlib: personalize imshow axis

I have the results of a (H,ranges) = numpy.histogram2d() computation and I'm trying to plot it.
Given H I can easily put it into plt.imshow(H) to get the corresponding image. (see http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.imshow )
My problem is that the axis of the produced image are the "cell counting" of H and are completely unrelated to the values of ranges.
I know I can use the keyword extent (as pointed in: Change values on matplotlib imshow() graph axis ). But this solution does not work for me: my values on range are not growing linearly (actually they are going exponentially)
My question is: How can I put the value of range in plt.imshow()? Or at least, or can I manually set the label values of the plt.imshow resulting object?
Editing the extent is not a good solution.
You can just change the tick labels to something more appropriate for your data.
For example, here we'll set every 5th pixel to an exponential function:
import numpy as np
import matplotlib.pyplot as plt
im = np.random.rand(21,21)
fig,(ax1,ax2) = plt.subplots(1,2)
ax1.imshow(im)
ax2.imshow(im)
# Where we want the ticks, in pixel locations
ticks = np.linspace(0,20,5)
# What those pixel locations correspond to in data coordinates.
# Also set the float format here
ticklabels = ["{:6.2f}".format(i) for i in np.exp(ticks/5)]
ax2.set_xticks(ticks)
ax2.set_xticklabels(ticklabels)
ax2.set_yticks(ticks)
ax2.set_yticklabels(ticklabels)
plt.show()
Expanding a bit on #thomas answer
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mi
im = np.random.rand(20, 20)
ticks = np.exp(np.linspace(0, 10, 20))
fig, ax = plt.subplots()
ax.pcolor(ticks, ticks, im, cmap='viridis')
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlim([1, np.exp(10)])
ax.set_ylim([1, np.exp(10)])
By letting mpl take care of the non-linear mapping you can now accurately over-plot other artists. There is a performance hit for this (as pcolor is more expensive to draw than AxesImage), but getting accurate ticks is worth it.
imshow is for displaying images, so it does not support x and y bins.
You could either use pcolor instead,
H,xedges,yedges = np.histogram2d()
plt.pcolor(xedges,yedges,H)
or use plt.hist2d which directly plots your histogram.

Resources