log x axis on matplotlib histogram with imshow() - python-3.x

I have some x data in a lognorm distribution, of which I would like to plot a histogram. I have this data for different parameters, so I would like to have those parameters as my y-axis, on my x-axis I would like to have the bins of my histogram (in log-scale with reasonable ticks like 100 to 103) - exactly like a ax.hist(), only using imshow() to get it in a more beautiful and compact way.
mu, sigma = 3., 2.
img = []
for i in range(20):
dat = np.random.lognormal(mu, sigma + 1/10, 10000)
hist = np.histogram(dat, bins = 10**(np.arange(0, 3, step = 0.1)))[0]
img.append(hist)
plt.imshow(img)
plt.show()
The result looks like
this
but I would like to have the x-axis being log and matching the bins.
I also have data for y, but that is not so much of a problem.

Related

Time series anomaly detection plot

I have a pandas dataframe of size (1280,2). The head of the data looks as follows:
I'm using a clustering based anomaly detection method using k-means. It creates 'k' similar clusters of data points. Data points that fall outside of these groups are marked as anomalies.
def getDistanceByPoint(data, model):
distance = pd.Series()
for i in range(0,len(data)):
Xa = np.array(data.loc[i])
Xb = model.cluster_centers_[model.labels_[i]-1]
distance.set_value(i, np.linalg.norm(Xa-Xb))
return distance
kmeans = KMeans(n_clusters=9).fit(data)
outliers_fraction = 0.01
distance = getDistanceByPoint(data, kmeans)
number_of_outliers = int(outliers_fraction*len(distance))
threshold = distance.nlargest(number_of_outliers).min()
(0:normal, 1:anomaly)
df['anomaly1'] = (distance >= threshold).astype(int)
I want to plot data frame with the x-axis as time elapsed and the y-axis as value. I would like to plot the normal data values in blue and the anomaly values in red. How could I plot this?
This is what you need. Remember to change time and value to your column name accordingly.
fig, ax = plt.subplots()
a = df.loc[df['anomaly'] == 1, ['time', 'value']]
ax.plot(df['time'], df['value'], color='blue')
ax.scatter(a['time'], a['value'], color='red')
plt.show()
Check this notebook out for more information.

Matplotlib how to plot 1 colorbar for four 2d histogram

Before I start I want to say that I've tried follow this and this post on the same problem however they are doing it with imshow heatmaps unlike 2d histogram like I'm doing.
Here is my code(the actual data has been replaced by randomly generated data but the gist is the same):
import matplotlib.pyplot as plt
import numpy as np
def subplots_hist_2d(x_data, y_data, x_labels, y_labels, titles):
fig, a = plt.subplots(2, 2)
a = a.ravel()
for idx, ax in enumerate(a):
image = ax.hist2d(x_data[idx], y_data[idx], bins=50, range=[[-2, 2],[-2, 2]])
ax.set_title(titles[idx], fontsize=12)
ax.set_xlabel(x_labels[idx])
ax.set_ylabel(y_labels[idx])
ax.set_aspect("equal")
cb = fig.colorbar(image[idx])
cb.set_label("Intensity", rotation=270)
# pad = how big overall pic is
# w_pad = how separate they're left to right
# h_pad = how separate they're top to bottom
plt.tight_layout(pad=-1, w_pad=-10, h_pad=0.5)
x1, y1 = np.random.uniform(-2, 2, 10000), np.random.uniform(-2, 2, 10000)
x2, y2 = np.random.uniform(-2, 2, 10000), np.random.uniform(-2, 2, 10000)
x3, y3 = np.random.uniform(-2, 2, 10000), np.random.uniform(-2, 2, 10000)
x4, y4 = np.random.uniform(-2, 2, 10000), np.random.uniform(-2, 2, 10000)
x_data = [x1, x2, x3, x4]
y_data = [y1, y2, y3, y4]
x_labels = ["x1", "x2", "x3", "x4"]
y_labels = ["y1", "y2", "y3", "y4"]
titles = ["1", "2", "3", "4"]
subplots_hist_2d(x_data, y_data, x_labels, y_labels, titles)
And this is what it's generating:
So now my problem is that I could not for the life of me make the colorbar apply for all 4 of the histograms. Also for some reason the bottom right histogram seems to behave weirdly compared with the others. In the links that I've posted their methods don't seem to use a = a.ravel() and I'm only using it here because it's the only way that allows me to plot my 4 histograms as subplots. Help?
EDIT:
Thomas Kuhn your new method actually solved all of my problem until I put my labels down and tried to use plt.tight_layout() to sort out the overlaps. It seems that if I put down the specific parameters in plt.tight_layout(pad=i, w_pad=0, h_pad=0) then the colorbar starts to misbehave. I'll now explain my problem.
I have made some changes to your new method so that it suits what I want, like this
def test_hist_2d(x_data, y_data, x_labels, y_labels, titles):
nrows, ncols = 2, 2
fig, axes = plt.subplots(nrows, ncols, sharex=True, sharey=True)
##produce the actual data and compute the histograms
mappables=[]
for (i, j), ax in np.ndenumerate(axes):
H, xedges, yedges = np.histogram2d(x_data[i][j], y_data[i][j], bins=50, range=[[-2, 2],[-2, 2]])
ax.set_title(titles[i][j], fontsize=12)
ax.set_xlabel(x_labels[i][j])
ax.set_ylabel(y_labels[i][j])
ax.set_aspect("equal")
mappables.append(H)
##the min and max values of all histograms
vmin = np.min(mappables)
vmax = np.max(mappables)
##second loop for visualisation
for ax, H in zip(axes.ravel(), mappables):
im = ax.imshow(H,vmin=vmin, vmax=vmax, extent=[-2,2,-2,2])
##colorbar using solution from linked question
fig.colorbar(im,ax=axes.ravel())
plt.show()
# plt.tight_layout
# plt.tight_layout(pad=i, w_pad=0, h_pad=0)
Now if I try to generate my data, in this case:
phi, cos_theta = get_angles(runs)
detector_x1, detector_y1, smeared_x1, smeared_y1 = detection_vectorised(1.5, cos_theta, phi)
detector_x2, detector_y2, smeared_x2, smeared_y2 = detection_vectorised(1, cos_theta, phi)
detector_x3, detector_y3, smeared_x3, smeared_y3 = detection_vectorised(0.5, cos_theta, phi)
detector_x4, detector_y4, smeared_x4, smeared_y4 = detection_vectorised(0, cos_theta, phi)
Here detector_x, detector_y, smeared_x, smeared_y are all lists of data point
So now I put them into 2x2 lists so that they can be unpacked suitably by my plotting function, as such:
data_x = [[detector_x1, detector_x2], [detector_x3, detector_x4]]
data_y = [[detector_y1, detector_y2], [detector_y3, detector_y4]]
x_labels = [["x positions(m)", "x positions(m)"], ["x positions(m)", "x positions(m)"]]
y_labels = [["y positions(m)", "y positions(m)"], ["y positions(m)", "y positions(m)"]]
titles = [["0.5m from detector", "1.0m from detector"], ["1.5m from detector", "2.0m from detector"]]
I now run my code with
test_hist_2d(data_x, data_y, x_labels, y_labels, titles)
with just plt.show() turned on, it gives this:
which is great because data and visual wise, it is exactly what I want i.e. the colormap corresponds to all 4 histograms. However, since the labels are overlapping with the titles, I thought I would just run the same thing but this time with plt.tight_layout(pad=a, w_pad=b, h_pad=c) hoping that I would be able to adjust the overlapping labels problem. However this time it doesn't matter how I change the numbers a, b and c, I always get my colorbar lying on the second column of graphs, like this:
Now changing a only makes the overall subplots bigger or smaller, and the best I could do was to adjust it with plt.tight_layout(pad=-10, w_pad=-15, h_pad=0), which looks like this
So it seems that whatever your new method is doing, it made the whole plot lost its adjustability. Your solution, as wonderful as it is at solving one problem, in return, created another. So what would be the best thing to do here?
Edit 2:
Using fig, axes = plt.subplots(nrows, ncols, sharex=True, sharey=True, constrained_layout=True) along with plt.show() gives
As you can see there's still a vertical gap between the columns of subplots for which not even using plt.subplots_adjust() can get rid of.
Edit:
As has been noted in the comments, the biggest problem here is actually to make the colorbar for many histograms meaningful, as ax.hist2d will always scale the histogram data it receives from numpy. It may therefore be best to first calculated the 2d histogram data using numpy and then use again imshow to visualise it. This way, also the solutions of the linked question can be applied. To make the problem with the normalisation more visible, I put some effort into producing some qualitatively different 2d histograms using scipy.stats.multivariate_normal, which shows how the height of the histogram can change quite dramatically even though the number of samples is the same in each figure.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec as gs
from scipy.stats import multivariate_normal
##opening figure and axes
nrows=3
ncols=3
fig, axes = plt.subplots(nrows,ncols)
##generate some random data for the distributions
means = np.random.rand(nrows,ncols,2)
sigmas = np.random.rand(nrows,ncols,2)
thetas = np.random.rand(nrows,ncols)*np.pi*2
##produce the actual data and compute the histograms
mappables=[]
for mean,sigma,theta in zip( means.reshape(-1,2), sigmas.reshape(-1,2), thetas.reshape(-1)):
##the data (only cosmetics):
c, s = np.cos(theta), np.sin(theta)
rot = np.array(((c,-s), (s, c)))
cov = rot#np.diag(sigma)#rot.T
rv = multivariate_normal(mean,cov)
data = rv.rvs(size = 10000)
##the 2d histogram from numpy
H,xedges,yedges = np.histogram2d(data[:,0], data[:,1], bins=50, range=[[-2, 2],[-2, 2]])
mappables.append(H)
##the min and max values of all histograms
vmin = np.min(mappables)
vmax = np.max(mappables)
##second loop for visualisation
for ax,H in zip(axes.ravel(),mappables):
im = ax.imshow(H,vmin=vmin, vmax=vmax, extent=[-2,2,-2,2])
##colorbar using solution from linked question
fig.colorbar(im,ax=axes.ravel())
plt.show()
This code produces a figure like this:
Old Answer:
One way to solve your problem is to generate the space for your colorbar explicitly. You can use a GridSpec instance to define how wide your colorbar should be. Below your subplots_hist_2d() function with a few modifications. Note that your use of tight_layout() shifted the colorbar into a funny place, hence the replacement. If you want the plots closer to each other, I'd rather recommend to play with the aspect ratio of the figure.
def subplots_hist_2d(x_data, y_data, x_labels, y_labels, titles):
## fig, a = plt.subplots(2, 2)
fig = plt.figure()
g = gs.GridSpec(nrows=2, ncols=3, width_ratios=[1,1,0.05])
a = [fig.add_subplot(g[n,m]) for n in range(2) for m in range(2)]
cax = fig.add_subplot(g[:,2])
## a = a.ravel()
for idx, ax in enumerate(a):
image = ax.hist2d(x_data[idx], y_data[idx], bins=50, range=[[-2, 2],[-2, 2]])
ax.set_title(titles[idx], fontsize=12)
ax.set_xlabel(x_labels[idx])
ax.set_ylabel(y_labels[idx])
ax.set_aspect("equal")
## cb = fig.colorbar(image[-1],ax=a)
cb = fig.colorbar(image[-1], cax=cax)
cb.set_label("Intensity", rotation=270)
# pad = how big overall pic is
# w_pad = how separate they're left to right
# h_pad = how separate they're top to bottom
## plt.tight_layout(pad=-1, w_pad=-10, h_pad=0.5)
fig.tight_layout()
Using this modified function, I get the following output:

Matplotlib compute values when plotting - python3

I want to plot only positive values when plotting a graph (like the RELU function in ML)
This may well be a dumb question. I hope not.
In the code below I iterate and change the underlying list data. I really want to only change the values when it's plot time and not change the source list data. Is that possible?
#create two lists in range -10 to 10
x = list(range(-10, 11))
y = list(range(-10, 11))
#this function changes the underlying data to remove negative values
#I really want to do this at plot time
#I don't want to change the source list. Can it be done?
for idx, val in enumerate(y):
y[idx] = max(0, val)
#a bunch of formatting to make the plot look nice
plt.figure(figsize=(6, 6))
plt.axhline(y=0, color='silver')
plt.axvline(x=0, color='silver')
plt.grid(True)
plt.plot(x, y, 'rx')
plt.show()
I'd suggest using numpy and filter the data when plotting:
import numpy as np
import matplotlib.pyplot as plt
#create two lists in range -10 to 10
x = list(range(-10, 11))
y = list(range(-10, 11))
x = np.array(x)
y = np.array(y)
#a bunch of formatting to make the plot look nice
plt.figure(figsize=(6, 6))
plt.axhline(y=0, color='silver')
plt.axvline(x=0, color='silver')
plt.grid(True)
# plot only those values where y is positive
plt.plot(x[y>0], y[y>0], 'rx')
plt.show()
This will not plot points with y < 0 at all. If instead, you want to replace any negative value by zero, you can do so as follows
plt.plot(x, np.maximum(0,y), 'rx')
It may look a bit complicated but filter the data on the fly:
plt.plot(list(zip(*[(x1,y1) for (x1,y1) in zip(x,y) if x1>0])), 'rx')
Explanation: it is safer to handle the data as pairs so that (x,y) stay in sync, and then you have to convert pairs back to separate xlist and ylist.

How to plot cdf on histogram in matplotlib

I currently have a script that will plot a histogram of relative frequency, given a pandas series. The code is:
def to_percent3(y, position):
s = str(100 * y)
if matplotlib.rcParams['text.usetex'] is True:
return s + r'$\%$'
else:
return s + '%'
df = pd.read_csv('mycsv.csv')
waypointfreq = df['Waypoint Frequency(Secs)']
cumfreq = df['Waypoint Frequency(Secs)']
perctile = np.percentile(waypointfreq, 95) # claculates 95th percentile
bins = np.arange(0,perctile+1,1) # creates list increasing by 1 to 96th percentile
plt.hist(waypointfreq, bins = bins, normed=True)
formatter = FuncFormatter(to_percent3) #changes y axis to percent
plt.gca().yaxis.set_major_formatter(formatter)
plt.axis([0, perctile, 0, 0.03]) #Defines the axis' by the 95th percentile and 10%Relative frequency
plt.xlabel('Waypoint Frequency(Secs)')
plt.xticks(np.arange(0, perctile, 15.0))
plt.title('Relative Frequency of Average Waypoint Frequency')
plt.grid(True)
plt.show()
It produces a plot that looks like this:
What I'd like is to overlay this plot with a line showing the cdf, plotted against a secondary axis. I know that I can create the cumulative graph with the command:
waypointfreq = df['Waypoint Frequency(Secs)']
perctile = np.percentile(waypointfreq, 95) # claculates 90th percentile
bins = np.arange(0,perctile+5,1) # creates list increasing by 2 to 90th percentile
plt.hist(waypointfreq, bins = bins, normed=True, histtype='stepfilled',cumulative=True)
formatter = FuncFormatter(to_percent3) #changes y axis to percent
plt.gca().yaxis.set_major_formatter(formatter)
plt.axis([0, perctile, 0, 1]) #Defines the axis' by the 90th percentile and 10%Relative frequency
plt.xlabel('Waypoint Frequency(Secs)')
plt.xticks(np.arange(0, perctile, 15.0))
plt.title('Cumulative Frequency of Average Waypoint Frequency')
plt.grid(True)
plt.savefig(r'output\4 Cumulative Frequency of Waypoint Frequency.png', bbox_inches='tight')
plt.show()
However, this is plotted on a separate graph, instead of over the previous one. Any help or insight would be appreciated.
Maybe this code snippet helps:
import numpy as np
from scipy.integrate import cumtrapz
from scipy.stats import norm
from matplotlib import pyplot as plt
n = 1000
x = np.linspace(-3,3, n)
data = norm.rvs(size=n)
data = data + abs(min(data))
data = np.sort(data)
cdf = cumtrapz(x=x, y=data )
cdf = cdf / max(cdf)
fig, ax = plt.subplots(ncols=1)
ax1 = ax.twinx()
ax.hist(data, normed=True, histtype='stepfilled', alpha=0.2)
ax1.plot(data[1:],cdf)
If your CDF is not smooth, you could fit a distribution

Normalizing CDF in Python

I want to calculate and plot the cumulative distribution function (CDF) of a given sample, new_dO18 and then overlay the CDF of a normal distribution with a given mean and standard deviation on the same plot. I am having problems normalizing the CDF. I should have values ranging between 0 and 1 on the x axis. Can someone guide me as to where I went wrong. I'm sure it's a simple fix but I'm very new to Python. I've included my steps so far. Thanks!
# Use np.histogram to get counts in each bin. See the help page or
# documentation on how to use this function, and what it returns.
# normalize the data new_dO18 using a for loop
norm_newdO18 = []
for element in new_dO18:
x = element
y = (x - np.mean(new_dO18))/np.std(new_dO18)
norm_newdO18.append(y)
print ('normalized dO18 values, excluding outliers:', norm_newdO18)
print()
# Use the histogram function to bin the data
num_bins = 20
counts, bin_edges = np.histogram(norm_newdO18, bins=num_bins, normed=0)
# Calculate and plot CDF of sample
cdf = np.cumsum(counts)
scale = 1.0/cdf[-1]
norm_cdf = scale * cdf
plt.plot(bin_edges[1:], norm_cdf, label = 'dO18 values')
plt.legend(bbox_to_anchor=(0, 1), loc='upper left', ncol=1)
plt.xlabel('normalized dO18 data')
plt.ylabel('frequency')
# Calculate and overlay the CDF of a normal distribution with sample mean and std
# as parameters.
# specific normally distributed function with mean and st. dev
mu, sigma = np.mean(new_dO18), np.std(new_dO18)
norm_theoretical = np.random.normal(mu, sigma, 1000)
# Calculate and plot CDF of theoretical sample
counts1, bin_edges1 = np.histogram(norm_theoretical, bins=20, normed=0)
cdft= np.cumsum(counts1)
scale = 1.0/cdft[-1]
norm_cdft = scale * cdf
plt.plot(bin_edges[1:], norm_cdft, label = 'theoretical values')
plt.legend(bbox_to_anchor=(0, 1), loc='upper left', ncol=1)
plt.show()

Resources