Distributing plots across a grid of variable axes length in python - python-3.x

I have written a few lines of Python 3 code to assist me in the automated analysis of data generated using a technique called calorimetry (for radiation dosimetry). In the enclosed example, the analysis of the input file returned eighth 'heating regions' (top panel), and in each region a pair of linear regressions (black segment, red segment) were made on portions of data to calculate the magnitude of the 'step', relative to the average value of my quantity of interest (the varying resistance of a thermistor), which is plotted in the bottom panel of the same figure.
automatic identification of 8 heating regions (top panel) and computed relative step magnitude (bottom panel)
Results of this type of analysis are summarized in a data frame (a ndarray from numpy at present) but, ideally, I would hope to produce also a graphical representation with some annotations in each subplot, including information from the corresponding line in the results dataframe:
Step analysis via a pair of linear regressions and further computation
The general output would look something like this last figure, with each subplot including the same essential information from the previous individual plot.
The output is, in this specific case, a grid (2,4) because there were exactly 8 regions to analyse
This was created by hand, without any iteration, using this portion of code in a Jupyter notebook:
%matplotlib inline
results_fig = pyplt.figure(figsize=(20,10))
results_grid = matplotlib.gridspec.GridSpec(2, 4, hspace=0.2, wspace=0.3)
results_fig.suptitle("Faceted presentation of calorimetric runs", fontsize=15)
ax1 = results_fig.add_subplot(results_grid[0,0])
ax1.scatter(time,resistance, marker ='o', s=20, c='blue')
ax1.plot(time[x1[0]:xmid[0]], line_pre[0], color='black', linewidth=3.0)
ax1.plot(time[xmid[0]:x4[0]], line_post[0], color='red', linewidth=3.0)
ax1.set_xlim(xlim1[0],xlim2[0])
ax1.set_ylabel("resistance [Ohm]")
# [... continues for each subplot in the grid ... ]
Given that the number of 'heating regions' may vary considerably from file to file, i.e. I cannot determine it before analyzing each experimental output datafile, here is my pair of questions:
How can I produce a grid of subplots without prior knowledge of how many subplots it will show? One of the dimensions of the grid could be four, as in the example provided here, but the other is unknown. I could iterate over the length of one of the axes of the numpy results array, but then I would need to span over two axes in my plot grid.
Without re-inventing the wheel, is there a python module that can assist in this direction?
Thanks

Here is how you create a grid of n x 4 subplots and iterate over them
numplots = 10 # number of plots to create
m = 4 # number of columns
n = int(np.ceil(numplots/4.)) # number of rows
fig, axes = plt.subplots(nrows=n,ncols=m)
fig.subplots_adjust(hspace=0.2, wspace=0.3)
for data, ax in zip(alldata, axes.flatten()):
ax.plot(data[0],data[1], color='black')
# further plotting, label setting etc.
# optionally, remove empty plots from grid
if n*m > numplots:
for ax in axes.flatten()[numplots:]:
ax.remove()
##or
#ax.set_visible(False)

Related

How to plot hyperparameter tuning results?

I have the result of a grid search as follows.
"trial","learning_rate","batch_size","accuracy","f1","loss"
1,0.000007,70,0.789,0.862,0.467
2,0.000008,100,0.710,0.822,0.563
3,0.000008,90,0.823,0.874,0.524
4,0.000007,90,0.833,0.878,0.492
5,0.000009,110,0.715,0.825,0.509
6,0.000006,90,0.883,0.885,0.932
7,0.000009,80,0.850,0.895,0.408
8,0.000006,110,0.683,0.812,0.593
9,0.000005,90,0.769,0.848,0.468
10,0.000005,80,0.816,0.868,0.462
11,0.000003,100,0.852,0.901,0.448
12,0.000004,100,0.705,0.818,0.512
13,0.000003,110,0.708,0.818,0.567
14,0.000002,90,0.683,0.812,0.552
15,0.000008,100,0.791,0.857,0.438
16,0.000006,110,0.683,0.812,0.604
17,0.000007,70,0.693,0.816,0.592
18,0.000005,110,0.830,0.883,0.892
19,0.000004,90,0.693,0.816,0.591
20,0.000008,70,0.696,0.818,0.570
I want to create a plot more or less similar to this using matplotlib. I know this is plotted using weights and biases but I cannot use that.
Though I don't care for the inference part. I just want the plot. I've been trying to do this using twinx but have not been successful. This is what I have so far.
from csv import DictReader
import matplotlib.pyplot as plt
trials = list(DictReader(open("hparams_trials.csv")))
trials = {f"trial_{trial['trial']}": [int(trial["batch_size"]),
float(trial["f1"]),
float(trial["loss"]),
float(trial["accuracy"]),
float(trial["learning_rate"])] for trial in trials}
items = ["batch_size", "f1", "loss", "accuracy", "learning_rate"]
host_y_values_index = 0
parts_y_values_indexes = [1, 2, 3, 4]
fig, host = plt.subplots(figsize=(8, 5)) # (width, height) in inches
fig.dpi = 300. # Figure resolution
# Removing extra spines
host.spines.top.set_visible(False)
host.spines.bottom.set_visible(False)
host.spines.right.set_visible(False)
# Creating subplots which share the same x axis.
parts = {index: host.twinx() for index in parts_y_values_indexes}
# Setting the limits of the host plot
host.set_xlim(0, len(trials["trial_1"]))
host.set_ylim(min([i[host_y_values_index] for i in trials.values()]),
max([i[host_y_values_index] for i in trials.values()]))
# Removing the extra spines from the other plots and setting y limits
for part in parts_y_values_indexes:
parts[part].spines.top.set_visible(False)
parts[part].spines.bottom.set_visible(False)
parts[part].set_ylim(min([trial[part] for trial in trials.values()]),
max([trial[part] for trial in trials.values()]))
# Colors of the trials
colors = ["gold", "lightcoral", "maroon", "springgreen", "cyan", "steelblue", "darkmagenta", "fuchsia", "crimson",
"lime", "mediumblue", "cadetblue", "dodgerblue", "olivedrab", "sandybrown", "bisque", "orangered", "black",
"rosybrown", "chocolate"]
# The plots
plots = []
# Plotting the trials. This is where I'm having problems with.
for index, trial in enumerate(trials):
plots.append(host.plot(items, trials[trial], color=colors[index], label=trial)[0])
# Creating the legend
host.legend(handles=plots, fancybox=True, loc='right', facecolor="snow", bbox_to_anchor=(1.02, 0.495), framealpha=1)
# Defining the positions of the spines.
spines_positions = [-104.85 * i for i in parts_y_values_indexes]
# Repositioning the spines
for part in parts_y_values_indexes:
parts[part].spines['right'].set_position(('outward', spines_positions[-part]))
# Adjust spacings around fig
fig.tight_layout()
host.grid(True)
# This is better than the one above but it appears on top of the legend.
# plt.grid(True)
plt.draw()
plt.show()
I'm having several problems with that code. First, I cannot place each value of a single trial based on a different spine and then connect them to one another. What I mean is that each trial has a batch size, an f1, a loss, accuracy and a learning rate. Each of those need to be plotted based on their own spine while connected to each other in that order. However, I cannot plot them based their dedicated spines and then connect them to one another to have a line plot per trial. Accordingly, for now I have placed everything in the host plot but I know that is wrong and have no idea what the correct approach is. Second problem, the ticks of the learning rate change. It gets shown as a range of 2 to 9 and then a 1e-6 appears at the top. I want to keep the original value. Third problem is probably part of the second one. The 1e-6 appears at the top right above the legend rather than above the spine for some reason. I'm struggling with resolving all three of these problems and would appreciate any help anyone can provide. If what I am doing is totally wrong, please help me in finding the correct solution. I'm somewhat going in circles here and haven't been able to find any working solutions so far.

Python visualization - histograms

the following two questions are regarding a histogram I am trying to build.
1) I want the bins to be as follows:
[0-10,10-20,...,580-590, 590-600]. I tried the following code:
bins_range=[]
for i in range(0,610,10):
bins_range.append(i)
plt.hist(df['something'], bins=bins_range, rwidth=0.95)
I expected to see bins as above with their corresponding amount of samples for each bin, but instead I got only 10 bins (as the default parameter).
2) How can I change the y-axis as follows: say my max bin contains 40 samples, so instead of 40 on the y-axis I want it to be 100%, and the others correspondly. I.e., 30 will be 75%, 20 will be 50% and so on.
Your code seems to be working OK. You can even pass the range command directly to the bins parameter of hist.
To get the y-axis as percentages, I think you need two passes: first calculate the bins to know how much the highest bin contains. Then, do the plotting using 1/highest as weights. There is a numpy np.hist that does all the calculations without plotting.
Use the PercentFormatter() to display the axis in percentages. It gets a parameter to tell how many 100% represents. Use PercentFormatter(max(hist)) to get the highest value as 100%. If you just want the total as 100%, just pass PercentFormatter(len(x)), without the need to calculate the histogram twice. As internally the y-axis is still in values, the ticks don't show up at the desired positions. You can use plt.yticks(np.linspace(0, max(hist), 11)) to have ticks for every 10%.
To get nicer separations between the bars, you can set an explicit edge color. Best without the rwidth=0.95
Example code:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
x = np.random.rayleigh(200, 50000)
hist, bins = np.histogram(x, bins=range(0, 610, 10))
plt.hist(x, bins=bins, ec='white', fc='darkorange')
plt.gca().yaxis.set_major_formatter(PercentFormatter(max(hist)))
plt.yticks(np.linspace(0, max(hist), 11))
plt.show()
PS: To use matplotlib's standard yticks, and having the y-axis also internally in percentages, you can use the weights parameter of hist. This can be handy when you want to interactively resize or zoom the plot, or need horizontal lines at specific percentages.
plt.hist(x, bins=bins, ec='white', fc='dodgerblue', weights=np.ones_like(x)/max(hist))
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))

Why is my notebook crashing when I run this for loop and what is the fix?

I have taken code in relation to the Kalman Filter and am attempting to iterate through each column of data. What I would like to have happen is:
The column data is fed into the filter
The filtered column data (xhat) is placed into another DataFrame (filtered)
The filtered column data (xhat) is used to produce a visual.
I have created a for loop to iterate through the column data, but when I run the cell, I crash the notebook. When it doesn't crash, I get this warning:
C:\Users\perso\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:45: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Thanks in advance for any help. I hope this question is detailed enough. I bombed on the last one.
'''A Python implementation of the example given in pages 11-15 of "An
Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
University of North Carolina at Chapel Hill, Department of Computer
Science, TR 95-041,
https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf'''
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
# dataframe created to hold filtered data
filtered = pd.DataFrame()
# intial parameters
for column in data:
n_iter = len(data.index) #number of iterations equal to sample numbers
sz = (n_iter,) # size of array
z = data[column] # observations
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 1.0**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = z[0]
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
# add new data to created dataframe
filtered.assign(a = [xhat])
#create visualization of noise reduction
plt.rcParams['figure.figsize'] = (10, 8)
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('column data')
plt.ylabel('Measurement')
This seems like a pretty straightforward error. The warning indicates that you have attempted to plot more figures than the current limit before a warning is created (a parameter you can change but which by default is set to 20). This is because in each iteration of your for loop, you create a new figure. Depending on the size of n_iter, you are opening potentially hundreds or thousands of figures. Each of these figures takes resources to generate and show, so you are creating a very large resource load on your system. Either it is processing very slowly due or is crashing altogether. In any case, the solution is to plot fewer figures.
I don't know exactly what you're plotting in your loop but it seems like each iteration of your loop corresponds to one time step and at each time step you'd like to plot the estimated and actual values. In this case, you need to define a figure and figure options once, outside of the loop, rather than at each iteration. But a better way to do this is probably to generate all of the data you want to plot ahead of time and store it in an easy-to-plot datatype like lists, then plot it once at the end.

How to change scatter plot marker color in plotting loop using pandas?

I'm trying to write a simple program that reads in a CSV with various datasets (all of the same length) and automatically plots them all (as a Pandas Dataframe scatter plot) on the same figure. My current code does this well, but all the marker colors are the same (blue). I'd like to figure out how to make a colormap so that in the future, if I have much larger data sets (let's say, 100+ different X-Y pairings), it will automatically color each series as it plots. Eventually, I would like for this to be a quick and easy method to run from the command line. I did not have luck reading the documentation or stack exchange, hopefully this is not a duplicate!
I've tried the recommendations from these posts:
1)Setting different color for each series in scatter plot on matplotlib
2)https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html
3) https://matplotlib.org/users/colormaps.html
However, the first one essentially grouped the data points according to their position on the x-axis and made those groups of data the same color (not what I want, each series of data is roughly a linearly increasing function). The second and third links seemed to have worked, but I don't like the colormap choices (e.g. "viridis", many colors are too similar and it's hard to distinguish data points).
This is a simplified version of my code so far (took out other lines that automatically named axes, etc. to make it easier to read). I've also removed any attempts I've made to specify a colormap, for more of a blank canvas feel:
''' Importing multiple scatter data and plotting '''
import pandas as pd
import matplotlib.pyplot as plt
### Data file path (please enter Dataframe however you like)
path = r'/Users/.../test_data.csv'
### Read in data CSV
data = pd.read_csv(path)
### List of headers
header_list = list(data)
### Set data type to float so modified data frame can be plotted
data = data.astype(float)
### X-axis limits
xmin = 1e-4;
xmax = 3e-3;
## Create subplots to be plotted together after loop
fig, ax = plt.subplots()
### Since there are multiple X-axes (every other column), this loop only plots every other x-y column pair
for i in range(len(header_list)):
if i % 2 == 0:
dfplot = data.plot.scatter(x = "{}".format(header_list[i]), y = "{}".format(header_list[i + 1]), ax=ax)
dfplot.set_xlim(xmin,xmax) # Setting limits on X axis
plot.show()
The dataset can be found in the google drive link below. Thanks for your help!
https://drive.google.com/drive/folders/1DSEs8D7lIDUW4NIPBl2qW2EZiZxslGyM?usp=sharing

How to divide the area between two co-ordinates into blocks and assign some values to those blocks?

Basically I have to create a heatmap of the crowd present in an area.
I have two coordinates. X starts from 0 and maximum is 119994. Y ranges from -14,000 to +27,000. I have to divide these coordinates into as many blocks blocks as I wish, count the number of people in each block and create a heatmap of this whole area.
Basically show the crowdedness of the area divided as blocks.
I have data in the below format:-
Employee_ID X_coord Y_coord_start Y_coord_end
23 1333 0 6000
45 3999 7000 17000
I tried dividing both the coordinate maximums by 100(to make 100 blocks) and tried finding the block coordinates but that was very complex.
As I have to make a heatmap I have to prepare a matrix of values in the form of blocks. Every block will have a count of people which I can count and find out from my data but the problem is how to make these blocks of coordinates?
I have another question regarding scatter plot:-
My data is:-
Batch_ID Pieces_Productivity
181031008780 4.578886
181031008781 2.578886
When I plot it using the following code:-
plt.scatter(list(df_books_location.Batch_ID),list(df_books_location['Pieces_productivity']), s=area, alpha=0.5)
It doesn't give me proper plot. But when I plot with small integers(0-1000) for Batch_ID I get good graph. How to handle large integers for plotting?
I don't know which of both Y_coord_-rows should give the actual Y coordinate, and also don't know whether your plot should be evaluate the data on a strict "grid", or perhaps rather smooth it out; hence I am using both an imshow() and a sns.kdeplot() in the code below:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
### generate some data
np.random.seed(0)
data = np.random.multivariate_normal([0, 0], [(1, .6), (.6, 1)], 100)
## this would e.g. be X,Y=df['X_coord'], df['Y_coord_start'] :
X,Y=data[:,0],data[:,1]
fig,ax=plt.subplots(nrows=1,ncols=3,figsize=(10,5))
ax[0].scatter(X,Y)
sns.kdeplot(X,Y, shade=True, ax=ax[1],cmap="viridis")
## the X,Y points are binned into 10x10 bins here, you will need
# to adjust the amount of bins so that it looks "nice" for you
heatmap, xedges, yedges = np.histogram2d(X, Y, bins=(10,10))
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
im=ax[2].imshow(heatmap.T, extent=extent,
origin="lower",aspect="auto",
interpolation="nearest") ## also play with different interpolations
## Loop over heatmap dimensions and create text annotations:
# note that we need to "push" the text from the lower left corner of each pixel
# into the center of each pixel
## also try to choose a text color which is readable on all pixels,
# or e.g. use vmin=… vmax= to adjust the colormap such that the colors
# don't clash with e.g. white text
pixel_center_x=(xedges[1]-xedges[0])/2.
pixel_center_y=(yedges[1]-yedges[0])/2.
for i in range(np.shape(heatmap)[1]):
for j in range(np.shape(heatmap)[0]):
text = ax[2].text(pixel_center_x+xedges[j], pixel_center_y+yedges[i],'{0:0.0f}'.format(heatmap[j, i]),
ha="center", va="center", color="w",fontsize=6)
plt.colorbar(im)
plt.show()
yields:

Resources