python pdpbox plot with unscaled feature values - python-3.x

I'm trying to plot the Partial Dependence and ICE plots for a Multi-layer perceptron classifier. I'm using the UCI Adult dataset. I have Label Encoded the categorical features and Scaled the overall dataframe and then performed a test-train split on the scaled dataframe.
Now when I'm trying to plot the PDP and ICE plots I get the Age values (column in the X Axis of the plot) as scaled and hence not comprehend-able. I want the age values to be the original values before scaling was performed on the data. How can I achieve this?
This is the code for the plots:
from pdpbox import pdp, info_plots
pdp_age = pdp.pdp_isolate(model=mlp, dataset=X_train, model_features=X_train.columns, feature='Age')
#PDP Plot
fig, axes = pdp.pdp_plot(pdp_age, 'Age', plot_lines=False, center=False, frac_to_plot=0.5, plot_pts_dist=True,x_quantile=True, show_percentile=True)
#ICE Plot
fig, axes = pdp.pdp_plot(pdp_age, 'Age', plot_lines=True, center=False, frac_to_plot=0.5, plot_pts_dist=True,x_quantile=True, show_percentile=True)
You can see from the plot that the Age values cannot be comprehended, I want the Age values to be in their true form. How can I do this ?

I was able to solve the above problem by using a Pipeline object. I used one hot encoding for the categorical variables and then pushed the scaling and the classifier operations into the Pipeline object. I was then able to use the encoded X_train without any issues for the partial dependence plot and I got the actual Age value ranges, what I was looking for.

Related

How to plot hyperparameter tuning results?

I have the result of a grid search as follows.
"trial","learning_rate","batch_size","accuracy","f1","loss"
1,0.000007,70,0.789,0.862,0.467
2,0.000008,100,0.710,0.822,0.563
3,0.000008,90,0.823,0.874,0.524
4,0.000007,90,0.833,0.878,0.492
5,0.000009,110,0.715,0.825,0.509
6,0.000006,90,0.883,0.885,0.932
7,0.000009,80,0.850,0.895,0.408
8,0.000006,110,0.683,0.812,0.593
9,0.000005,90,0.769,0.848,0.468
10,0.000005,80,0.816,0.868,0.462
11,0.000003,100,0.852,0.901,0.448
12,0.000004,100,0.705,0.818,0.512
13,0.000003,110,0.708,0.818,0.567
14,0.000002,90,0.683,0.812,0.552
15,0.000008,100,0.791,0.857,0.438
16,0.000006,110,0.683,0.812,0.604
17,0.000007,70,0.693,0.816,0.592
18,0.000005,110,0.830,0.883,0.892
19,0.000004,90,0.693,0.816,0.591
20,0.000008,70,0.696,0.818,0.570
I want to create a plot more or less similar to this using matplotlib. I know this is plotted using weights and biases but I cannot use that.
Though I don't care for the inference part. I just want the plot. I've been trying to do this using twinx but have not been successful. This is what I have so far.
from csv import DictReader
import matplotlib.pyplot as plt
trials = list(DictReader(open("hparams_trials.csv")))
trials = {f"trial_{trial['trial']}": [int(trial["batch_size"]),
float(trial["f1"]),
float(trial["loss"]),
float(trial["accuracy"]),
float(trial["learning_rate"])] for trial in trials}
items = ["batch_size", "f1", "loss", "accuracy", "learning_rate"]
host_y_values_index = 0
parts_y_values_indexes = [1, 2, 3, 4]
fig, host = plt.subplots(figsize=(8, 5)) # (width, height) in inches
fig.dpi = 300. # Figure resolution
# Removing extra spines
host.spines.top.set_visible(False)
host.spines.bottom.set_visible(False)
host.spines.right.set_visible(False)
# Creating subplots which share the same x axis.
parts = {index: host.twinx() for index in parts_y_values_indexes}
# Setting the limits of the host plot
host.set_xlim(0, len(trials["trial_1"]))
host.set_ylim(min([i[host_y_values_index] for i in trials.values()]),
max([i[host_y_values_index] for i in trials.values()]))
# Removing the extra spines from the other plots and setting y limits
for part in parts_y_values_indexes:
parts[part].spines.top.set_visible(False)
parts[part].spines.bottom.set_visible(False)
parts[part].set_ylim(min([trial[part] for trial in trials.values()]),
max([trial[part] for trial in trials.values()]))
# Colors of the trials
colors = ["gold", "lightcoral", "maroon", "springgreen", "cyan", "steelblue", "darkmagenta", "fuchsia", "crimson",
"lime", "mediumblue", "cadetblue", "dodgerblue", "olivedrab", "sandybrown", "bisque", "orangered", "black",
"rosybrown", "chocolate"]
# The plots
plots = []
# Plotting the trials. This is where I'm having problems with.
for index, trial in enumerate(trials):
plots.append(host.plot(items, trials[trial], color=colors[index], label=trial)[0])
# Creating the legend
host.legend(handles=plots, fancybox=True, loc='right', facecolor="snow", bbox_to_anchor=(1.02, 0.495), framealpha=1)
# Defining the positions of the spines.
spines_positions = [-104.85 * i for i in parts_y_values_indexes]
# Repositioning the spines
for part in parts_y_values_indexes:
parts[part].spines['right'].set_position(('outward', spines_positions[-part]))
# Adjust spacings around fig
fig.tight_layout()
host.grid(True)
# This is better than the one above but it appears on top of the legend.
# plt.grid(True)
plt.draw()
plt.show()
I'm having several problems with that code. First, I cannot place each value of a single trial based on a different spine and then connect them to one another. What I mean is that each trial has a batch size, an f1, a loss, accuracy and a learning rate. Each of those need to be plotted based on their own spine while connected to each other in that order. However, I cannot plot them based their dedicated spines and then connect them to one another to have a line plot per trial. Accordingly, for now I have placed everything in the host plot but I know that is wrong and have no idea what the correct approach is. Second problem, the ticks of the learning rate change. It gets shown as a range of 2 to 9 and then a 1e-6 appears at the top. I want to keep the original value. Third problem is probably part of the second one. The 1e-6 appears at the top right above the legend rather than above the spine for some reason. I'm struggling with resolving all three of these problems and would appreciate any help anyone can provide. If what I am doing is totally wrong, please help me in finding the correct solution. I'm somewhat going in circles here and haven't been able to find any working solutions so far.

Spacing out dates on the X-Axis in Matplotlib

I'm analyzing coronavirus data in my country and I want to plot the data on new deaths, and total deaths in one plot. Initially I plotted it using the inbuilt plot function on the dataframe.
reports.plot(x='date', y='new_deaths')
reports.plot(x='date', y='total_deaths')
plt.xlabel('date')
plt.ylabel('cases')
plt.show()
Which yielded these images.
However, I wanted to change the legends, and have them to be in one plot instead so I went ahead and used plt.subplot(111).
fig = plt.figure()
ax = plt.subplot(111)
ax.plot(ph_reports_april.date, ph_reports_april.new_deaths, label='New Deaths')
ax.plot(ph_reports_april.date, ph_reports_april.total_deaths, label='Total Deaths')
plt.xlabel('Date')
plt.ylabel('Cases')
ax.legend()
plt.show()
It got the job done except for one little problem.
The dates are congested. Is there a way to have the date similar to how the dataframe.plot() function works? I've browse through this site and haven't found anything of value regarding on spacing out the values of date in the x-axis.

How to change scatter plot marker color in plotting loop using pandas?

I'm trying to write a simple program that reads in a CSV with various datasets (all of the same length) and automatically plots them all (as a Pandas Dataframe scatter plot) on the same figure. My current code does this well, but all the marker colors are the same (blue). I'd like to figure out how to make a colormap so that in the future, if I have much larger data sets (let's say, 100+ different X-Y pairings), it will automatically color each series as it plots. Eventually, I would like for this to be a quick and easy method to run from the command line. I did not have luck reading the documentation or stack exchange, hopefully this is not a duplicate!
I've tried the recommendations from these posts:
1)Setting different color for each series in scatter plot on matplotlib
2)https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html
3) https://matplotlib.org/users/colormaps.html
However, the first one essentially grouped the data points according to their position on the x-axis and made those groups of data the same color (not what I want, each series of data is roughly a linearly increasing function). The second and third links seemed to have worked, but I don't like the colormap choices (e.g. "viridis", many colors are too similar and it's hard to distinguish data points).
This is a simplified version of my code so far (took out other lines that automatically named axes, etc. to make it easier to read). I've also removed any attempts I've made to specify a colormap, for more of a blank canvas feel:
''' Importing multiple scatter data and plotting '''
import pandas as pd
import matplotlib.pyplot as plt
### Data file path (please enter Dataframe however you like)
path = r'/Users/.../test_data.csv'
### Read in data CSV
data = pd.read_csv(path)
### List of headers
header_list = list(data)
### Set data type to float so modified data frame can be plotted
data = data.astype(float)
### X-axis limits
xmin = 1e-4;
xmax = 3e-3;
## Create subplots to be plotted together after loop
fig, ax = plt.subplots()
### Since there are multiple X-axes (every other column), this loop only plots every other x-y column pair
for i in range(len(header_list)):
if i % 2 == 0:
dfplot = data.plot.scatter(x = "{}".format(header_list[i]), y = "{}".format(header_list[i + 1]), ax=ax)
dfplot.set_xlim(xmin,xmax) # Setting limits on X axis
plot.show()
The dataset can be found in the google drive link below. Thanks for your help!
https://drive.google.com/drive/folders/1DSEs8D7lIDUW4NIPBl2qW2EZiZxslGyM?usp=sharing

Concatenating multiple barplots in seaborn

My data-frame contains the following column headers: subject, Group, MASQ_GDA, MASQ_AA, MASQ_GDD, MASQ_AD
I was successfully able to plot one of them using a bar plot with the following specifications:
bar_plot = sns.barplot(x="Group", y='MASQ_GDA', units="subject", ci = 68, hue="Group", data=demo_masq)
However, I am attempting to create several of such bar plot side by side. Might anyone know how I can accomplish this, for each plot to contain the remaining 3 variables (MASQ_AA, MASQ_GDD, MASQ_AD). Here is an example of what I am trying to achieve.
If you look in the documentation for sns.barplot(), you will see that the function accepts a parameter ax= allowing you to tell seaborn which Axes object to use to plot the result
ax : matplotlib Axes, optional
Axes object to draw the plot onto, otherwise uses the current Axes.
Therefore, the simple way to obtain the desired output is to create the Axes beforehand, and then calling sns.barplot() with the corresponding ax parameter
fig, axs = plt.subplots(1,4) # create 4 subplots on 1 row
for ax,col in zip(axs,["MASQ_GDA", "MASQ_AA", "MASQ_GDD", "MASQ_AD"]):
sns.barplot(x="Group", y=col, units="subject", ci = 68, hue="Group", data=demo_masq, ax=ax) # <- notice ax= argument
Another option, and maybe an option that is more in line with the philosophy of seaborn is to use a FacetGrid. This would allow you to automatically create the required number of subplots depending on the number of categories in your dataset. However, it requires to reshape your dataframe so that the content of your MASQ_* columns are on a single column, with a new column showing what category each value corresponds to.

Distributing plots across a grid of variable axes length in python

I have written a few lines of Python 3 code to assist me in the automated analysis of data generated using a technique called calorimetry (for radiation dosimetry). In the enclosed example, the analysis of the input file returned eighth 'heating regions' (top panel), and in each region a pair of linear regressions (black segment, red segment) were made on portions of data to calculate the magnitude of the 'step', relative to the average value of my quantity of interest (the varying resistance of a thermistor), which is plotted in the bottom panel of the same figure.
automatic identification of 8 heating regions (top panel) and computed relative step magnitude (bottom panel)
Results of this type of analysis are summarized in a data frame (a ndarray from numpy at present) but, ideally, I would hope to produce also a graphical representation with some annotations in each subplot, including information from the corresponding line in the results dataframe:
Step analysis via a pair of linear regressions and further computation
The general output would look something like this last figure, with each subplot including the same essential information from the previous individual plot.
The output is, in this specific case, a grid (2,4) because there were exactly 8 regions to analyse
This was created by hand, without any iteration, using this portion of code in a Jupyter notebook:
%matplotlib inline
results_fig = pyplt.figure(figsize=(20,10))
results_grid = matplotlib.gridspec.GridSpec(2, 4, hspace=0.2, wspace=0.3)
results_fig.suptitle("Faceted presentation of calorimetric runs", fontsize=15)
ax1 = results_fig.add_subplot(results_grid[0,0])
ax1.scatter(time,resistance, marker ='o', s=20, c='blue')
ax1.plot(time[x1[0]:xmid[0]], line_pre[0], color='black', linewidth=3.0)
ax1.plot(time[xmid[0]:x4[0]], line_post[0], color='red', linewidth=3.0)
ax1.set_xlim(xlim1[0],xlim2[0])
ax1.set_ylabel("resistance [Ohm]")
# [... continues for each subplot in the grid ... ]
Given that the number of 'heating regions' may vary considerably from file to file, i.e. I cannot determine it before analyzing each experimental output datafile, here is my pair of questions:
How can I produce a grid of subplots without prior knowledge of how many subplots it will show? One of the dimensions of the grid could be four, as in the example provided here, but the other is unknown. I could iterate over the length of one of the axes of the numpy results array, but then I would need to span over two axes in my plot grid.
Without re-inventing the wheel, is there a python module that can assist in this direction?
Thanks
Here is how you create a grid of n x 4 subplots and iterate over them
numplots = 10 # number of plots to create
m = 4 # number of columns
n = int(np.ceil(numplots/4.)) # number of rows
fig, axes = plt.subplots(nrows=n,ncols=m)
fig.subplots_adjust(hspace=0.2, wspace=0.3)
for data, ax in zip(alldata, axes.flatten()):
ax.plot(data[0],data[1], color='black')
# further plotting, label setting etc.
# optionally, remove empty plots from grid
if n*m > numplots:
for ax in axes.flatten()[numplots:]:
ax.remove()
##or
#ax.set_visible(False)

Resources