visualize predict_proba for multiclass classification - scikit-learn

With model.predict_proba(X) I just get a big array with lots of numbers.
I am looking for a way to visualize the probabilities of a classification for all classes (in my case 13). I use a RandomForestClassifier.
Any recommendation?

Heatmaps would be nice way to visualise a 2D matrix. Of-course, if the number of records in your X is large, it is hard to visualize everything in a single go. Probably you have to sample records otherwise. Here I'm showing the visuals for first 10 records, labelling the predicted classes if the predicted probability is greater than 0.1.
Check out this example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
X, y = make_classification(n_samples=10000,n_features=40,
n_informative=30, n_classes=13,
n_redundant=0, n_clusters_per_class=1,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)
forest = RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train, y_train)
pred = forest.predict_proba(X_test)[:10]
fig, ax = plt.subplots(figsize= (20,8))
im = ax.imshow(pred, cmap='Blues')
ax.grid(axis='y')
ax.set_xticklabels([])
ax.set_yticks(np.arange(pred.shape[0]))
plt.ylabel('Records', fontsize='xx-large')
plt.xlabel('Classes', fontsize='xx-large')
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
for i in range(pred.shape[0]):
for j in range(13):
if pred[i, j] >.1:
ax.text(j, i, j,
ha="center", va="center", color="w", fontsize=30)

If your input space is 2D, or if you use some dimensionality reduction technique to embed it in 2D, you could plot the multiclass decision surface:
# generate toy data
X, y = sklearn.datasets.make_blobs(n_samples=1000, centers=13)
# fit classifier
clf = sklearn.ensemble.RandomForestClassifier().fit(X, y)
# create decision surface
xx, yy = np.meshgrid(np.linspace(-13, 12, 100),
np.linspace(-13, 12, 100))
Z = clf.predict(np.array([xx.ravel(), yy.ravel()]).T)
Z = Z.reshape(xx.shape)
fig, ax = plt.subplots(1,1, figsize=(8,8))
ax.scatter(X[:,0], X[:,1], c=y, cmap='Paired')
ax.contourf(xx, yy, Z, cmap='Paired', alpha=0.5)
Note this is only shading per label (predict not predict_proba) but you may be able to extend this to shade differently based on the probability.

Related

Generate Random Forest feature importance plots from 3D arrays

After carrying our a librosa MFCC feature extraction on 1000 audio files, I end up with an X_test array of size 1000 x 40 x 174 (40 features as I set n_mfcc=40). In order for me to pass this through the random forest classifier, I scaled and then flattened the array. My new X_test now has a size of 1000 x 6960. How do I go about correctly generating the feature importance histogram?
This is the code that I used for the feature importance plot but not sure if this is the correct approach:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
x_train # this has a shape of 1000 x 40 x 174
X_train_scaled = scaler.fit_transform(x_train.reshape(-1, x_train.shape[-1])).reshape(x_train.shape)
X_train = np.array([features_2d.flatten() for features_2d in X_train_scaled])
X = pd.DataFrame(X_train) # X_train here is already flattened to 1000 x 6960
feature_names = [f"feature {i}" for i in range(X.shape[1])]
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
forest_importances = pd.Series(importances, index=feature_names)
fig, ax = plt.subplots()
plt.figure(figsize=(13, 10))
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
With this code, I get this plot:
Can you tell me if this is the correct approach? If this approach is correct, how can I generate a more "readable" plot for the Feature Importance? Thanks!

Residual plot for MultiOutputRegressor with yellowbrick

I am dealing with a multi-output regression problem and applied "MultiOutputRegressor" accompanied by "XGBRegressor" algorithms on the corresponding data.
import numpy as np
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
# Create a random dataset
rng = np.random.RandomState(1)
X = np.sort(200 * rng.rand(600, 1) - 100, axis=0)
y = np.array([np.pi * np.sin(X).ravel(), np.pi *
np.cos(X).ravel()]).T
y += 0.5 - rng.rand(*y.shape)
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=400, test_size=200, random_state=4)
regr_multi = MultiOutputRegressor(XGBRegressor())
regr_multi.fit(X_train, y_train)
y_pred = regr_multi.predict(X_test)
What I would like to visualize is the residual of model prediction using ResidualPlot from yellowbrick package.
When I use the following code
from yellowbrick.regressor import ResidualsPlot
vis = ResidualsPlot(regr_multi)
vis.fit(X_train, y_train)
vis.score(X_test, y_test)
vis.show()
I faced with an error mentioned The 'color' keyword argument must have one color per dataset, but 2 datasets and 1 colors were provided.
I was wondering that MultiOutput Residual plot is supported by yellowbriks or it is just an error that can be solved easily?

Confidence Interval from RandomForestRegressor in scikit-learn

scikit-learn has a quantile regression based confidence interval implementation for GBM (example form the docs).
Is there a reason why it doesn't provide a similar quantile based loss implementation for RandomForestRegressor?
There is an scikit-learn compatible/compliant Quantile Regression Forest implementation that can be used to generate confidence intervals here: https://github.com/zillow/quantile-forest
Setup should be as easy as:
pip install quantile-forest
Then, as an example, to generate CIs on a full dataset:
import matplotlib.pyplot as plt
import numpy as np
from quantile_forest import RandomForestQuantileRegressor
from sklearn import datasets
from sklearn.model_selection import KFold
X, y = datasets.fetch_california_housing(return_X_y=True)
qrf = RandomForestQuantileRegressor(n_estimators=100, random_state=0)
kf = KFold(n_splits=5)
kf.get_n_splits(X)
y_true = []
y_pred = []
y_pred_lower = []
y_pred_upper = []
for train_index, test_index in kf.split(X):
X_train, X_test, y_train, y_test = (
X[train_index], X[test_index], y[train_index], y[test_index]
)
qrf.set_params(max_features=X_train.shape[1] // 3)
qrf.fit(X_train, y_train)
# Get predictions at 95% prediction intervals and median.
y_pred_i = qrf.predict(X_test, quantiles=[0.025, 0.5, 0.975])
y_true = np.concatenate((y_true, y_test))
y_pred = np.concatenate((y_pred, y_pred_i[:, 1]))
y_pred_lower = np.concatenate((y_pred_lower, y_pred_i[:, 0]))
y_pred_upper = np.concatenate((y_pred_upper, y_pred_i[:, 2]))
fig = plt.figure(figsize=(10, 4))
y_pred_interval = y_pred_upper - y_pred_lower
sort_idx = np.argsort(y_pred_interval)
y_true = y_true[sort_idx]
y_pred_lower = y_pred_lower[sort_idx]
y_pred_upper = y_pred_upper[sort_idx]
# Center data, with the mean of the prediction interval at 0.
mean = (y_pred_lower + y_pred_upper) / 2
y_true -= mean
y_pred_lower -= mean
y_pred_upper -= mean
plt.plot(y_true, marker=".", ms=5, c="r", lw=0)
plt.fill_between(
np.arange(len(y_pred_upper)),
y_pred_lower,
y_pred_upper,
alpha=0.2,
color="gray",
)
plt.plot(np.arange(len(y)), y_pred_lower, marker="_", c="0.2", lw=0)
plt.plot(np.arange(len(y)), y_pred_upper, marker="_", c="0.2", lw=0)
plt.xlim([0, len(y)])
plt.xlabel("Ordered Samples")
plt.ylabel("Observed Values and Prediction Intervals (Centered)")
plt.show()
There seems to be contributed scikit-learn package (example copy pasted from there for RandomForestRegressor)
I had to install development version in order to have correct path to current scikit-learn by:
pip install git+git://github.com/scikit-learn-contrib/forest-confidence-interval.git
https://github.com/scikit-learn-contrib/forest-confidence-interval
Example (copy pasted from the link above):
# Regression Forest Example
import numpy as np
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
import sklearn.model_selection as xval
from sklearn.datasets import fetch_openml
import forestci as fci
# retreive mpg data from machine learning library
mpg_data = fetch_openml('autompg')
# separate mpg data into predictors and outcome variable
mpg_X = mpg_data["data"]
mpg_y = mpg_data["target"]
# remove rows where the data is nan
not_null_sel = np.invert(
np.sum(np.isnan(mpg_data["data"]), axis=1).astype(bool))
mpg_X = mpg_X[not_null_sel]
mpg_y = mpg_y[not_null_sel]
# split mpg data into training and test set
mpg_X_train, mpg_X_test, mpg_y_train, mpg_y_test = xval.train_test_split(mpg_X, mpg_y,
test_size=0.25,
random_state=42)
# Create RandomForestRegressor
n_trees = 2000
mpg_forest = RandomForestRegressor(n_estimators=n_trees, random_state=42)
mpg_forest.fit(mpg_X_train, mpg_y_train)
mpg_y_hat = mpg_forest.predict(mpg_X_test)
# Plot predicted MPG without error bars
plt.scatter(mpg_y_test, mpg_y_hat)
plt.plot([5, 45], [5, 45], 'k--')
plt.xlabel('Reported MPG')
plt.ylabel('Predicted MPG')
plt.show()
# Calculate the variance
mpg_V_IJ_unbiased = fci.random_forest_error(mpg_forest, mpg_X_train,
mpg_X_test)
# Plot error bars for predicted MPG using unbiased variance
plt.errorbar(mpg_y_test, mpg_y_hat, yerr=np.sqrt(mpg_V_IJ_unbiased), fmt='o')
plt.plot([5, 45], [5, 45], 'k--')
plt.xlabel('Reported MPG')
plt.ylabel('Predicted MPG')
plt.show()

Confidence score too low

Im wondering why the model score is very low, only 0.13, i already make sure the data is clean, scaled, and also have high correlation between each features but the model score using linear regression is very low, why is this happening and how to solve this? this is my code
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
path = r"D:\python projects\avocado.csv"
df = pd.read_csv(path)
df = df.reset_index(drop=True)
df.set_index('Date', inplace=True)
df = df.drop(['Unnamed: 0','year','type','region','AveragePrice'],1)
df.rename(columns={'4046':'Small HASS sold',
'4225':'Large HASS sold',
'4770':'XLarge HASS sold'},
inplace=True)
print(df.head)
sns.heatmap(df.corr())
sns.pairplot(df)
df.plot()
_=plt.xticks(rotation=20)
forecast_line = 35
df['target'] = df['Total Volume'].shift(-forecast_line)
X = np.array(df.drop(['target'], 1))
X = preprocessing.scale(X)
X_lately = X[-forecast_line:]
X = X[:-forecast_line]
df.dropna(inplace=True)
y = np.array(df['target'])
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
lr = LinearRegression()
lr.fit(X_train,y_train)
confidence = lr.score(X_test,y_test)
print(confidence)
this is the link of the dataset i use https://www.kaggle.com/neuromusic/avocado-prices
So the score function you are using is:
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual
sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum
of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible
score is 1.0 and it can be negative (because the model can be
arbitrarily worse). A constant model that always predicts the expected
value of y, disregarding the input features, would get a R^2 score of
0.0.
So as you realise you are already above the the constant prediction.
My advice try to plot your data, to see what kind of regression you should use. Here you can see an overview which type of linear regression are available: https://scikit-learn.org/stable/modules/linear_model.html
Logistic regression makes sense if your data has a logistic curve, which means that your points are either close to 0 or to 1, and in the middle are not so many points.

What does "n_features" and "centers" parameters mean in make_blobs in SciKit?

I have gone through the documents about n_features and centers parameters in make_blobs function in SciKit. However, every explanation I've seen doesn't sound so clear to me since I am new to SciKit and Mathematics. I am wondering what do these two parameters: n_features, centers do in make_blobs function as below.
make_blobs(n_samples=50, n_features=2, centers=2, random_state=75)
Thank you in advance.
The make_blobs function is a part of sklearn.datasets.samples_generator. All methods in the package, help us to generate data samples or datasets. In machine learning, which scikit-learn all about, datasets are used to evaluate performance of machine learning models. This is an example on how to evaluate a KNN classifier:
from sklearn.datasets.samples_generator import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_blobs(n_features=2, centers=3)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print('accuracy: {}%'.format(acc))
Now, as you mentioned, n_features determined how many columns or features the generated datasets will have. In machine learning, features correspond to numerical characteristics data. For example, in Iris Dataset, there are 4 features (Sepal Length, Sepal Width, Petal Length and Petal Width) so there are 4 numerical columns in the dataset. So by increasing n_features in make_blobs, we are adding more features hence increase the complexity of generated dataset.
As for the centers, it is easier to understand by visualizing the generated dataset. I use matplotlib to help us on that:
from sklearn.datasets.samples_generator import make_blobs
import matplot
# plot 1
X, y = make_blobs(n_features=2, centers=1)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.savefig('centers_1.png')
plt.title('centers = 1')
# plot 2
X, y = make_blobs(n_features=2, centers=2)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 2')
# plot 3
X, y = make_blobs(n_features=2, centers=3)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 3')
plt.show()
If you run the code above you can easily see that centers corresponds to number of classes generated in the data. It uses centers as a term because samples that belong to same class, tend to gather close to a center (coordinate).

Resources