Generate Random Forest feature importance plots from 3D arrays - python-3.x

After carrying our a librosa MFCC feature extraction on 1000 audio files, I end up with an X_test array of size 1000 x 40 x 174 (40 features as I set n_mfcc=40). In order for me to pass this through the random forest classifier, I scaled and then flattened the array. My new X_test now has a size of 1000 x 6960. How do I go about correctly generating the feature importance histogram?
This is the code that I used for the feature importance plot but not sure if this is the correct approach:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
x_train # this has a shape of 1000 x 40 x 174
X_train_scaled = scaler.fit_transform(x_train.reshape(-1, x_train.shape[-1])).reshape(x_train.shape)
X_train = np.array([features_2d.flatten() for features_2d in X_train_scaled])
X = pd.DataFrame(X_train) # X_train here is already flattened to 1000 x 6960
feature_names = [f"feature {i}" for i in range(X.shape[1])]
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
forest_importances = pd.Series(importances, index=feature_names)
fig, ax = plt.subplots()
plt.figure(figsize=(13, 10))
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
With this code, I get this plot:
Can you tell me if this is the correct approach? If this approach is correct, how can I generate a more "readable" plot for the Feature Importance? Thanks!

Related

visualize predict_proba for multiclass classification

With model.predict_proba(X) I just get a big array with lots of numbers.
I am looking for a way to visualize the probabilities of a classification for all classes (in my case 13). I use a RandomForestClassifier.
Any recommendation?
Heatmaps would be nice way to visualise a 2D matrix. Of-course, if the number of records in your X is large, it is hard to visualize everything in a single go. Probably you have to sample records otherwise. Here I'm showing the visuals for first 10 records, labelling the predicted classes if the predicted probability is greater than 0.1.
Check out this example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
X, y = make_classification(n_samples=10000,n_features=40,
n_informative=30, n_classes=13,
n_redundant=0, n_clusters_per_class=1,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)
forest = RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train, y_train)
pred = forest.predict_proba(X_test)[:10]
fig, ax = plt.subplots(figsize= (20,8))
im = ax.imshow(pred, cmap='Blues')
ax.grid(axis='y')
ax.set_xticklabels([])
ax.set_yticks(np.arange(pred.shape[0]))
plt.ylabel('Records', fontsize='xx-large')
plt.xlabel('Classes', fontsize='xx-large')
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
for i in range(pred.shape[0]):
for j in range(13):
if pred[i, j] >.1:
ax.text(j, i, j,
ha="center", va="center", color="w", fontsize=30)
If your input space is 2D, or if you use some dimensionality reduction technique to embed it in 2D, you could plot the multiclass decision surface:
# generate toy data
X, y = sklearn.datasets.make_blobs(n_samples=1000, centers=13)
# fit classifier
clf = sklearn.ensemble.RandomForestClassifier().fit(X, y)
# create decision surface
xx, yy = np.meshgrid(np.linspace(-13, 12, 100),
np.linspace(-13, 12, 100))
Z = clf.predict(np.array([xx.ravel(), yy.ravel()]).T)
Z = Z.reshape(xx.shape)
fig, ax = plt.subplots(1,1, figsize=(8,8))
ax.scatter(X[:,0], X[:,1], c=y, cmap='Paired')
ax.contourf(xx, yy, Z, cmap='Paired', alpha=0.5)
Note this is only shading per label (predict not predict_proba) but you may be able to extend this to shade differently based on the probability.

Plotting AUC score for multiple model for multiclass classification in Python

I am doing a multiclass classification problem. There are a total of 46 unique classes in my dataset. I have computed the AUC sore for all the class and plot it but I want to plot my AUC score for different types of models in one graph means I want to plot my graph for LogisticRegression, XGBoost and 2 more which is used to solve the multiclass problem. My code what I have done till-
n_classes = 46
best_C =1000
best_gamma =0.0001
svc_model_grid_param = SVC(C=best_C, kernel="rbf", gamma= best_gamma, )
model_OVR_svc = OneVsRestClassifier(svc_model_grid_param)
y_score = model_OVR_svc.fit(X_train, y_train).decision_function(X_valid)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
# calculate dummies once
y_test_dummies = pd.get_dummies(y_valid, drop_first=False).values
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test_dummies[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
Plotting--
import matplotlib.pylab as plt
lists = sorted(roc_auc.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
plt.xlabel('Class')
plt.ylabel('AUC Score')
plt.plot(x, y)
plt.show()
Graph--
What I want to do--
Can anyone help me to do this.. Thanks in advance

What does "n_features" and "centers" parameters mean in make_blobs in SciKit?

I have gone through the documents about n_features and centers parameters in make_blobs function in SciKit. However, every explanation I've seen doesn't sound so clear to me since I am new to SciKit and Mathematics. I am wondering what do these two parameters: n_features, centers do in make_blobs function as below.
make_blobs(n_samples=50, n_features=2, centers=2, random_state=75)
Thank you in advance.
The make_blobs function is a part of sklearn.datasets.samples_generator. All methods in the package, help us to generate data samples or datasets. In machine learning, which scikit-learn all about, datasets are used to evaluate performance of machine learning models. This is an example on how to evaluate a KNN classifier:
from sklearn.datasets.samples_generator import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_blobs(n_features=2, centers=3)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print('accuracy: {}%'.format(acc))
Now, as you mentioned, n_features determined how many columns or features the generated datasets will have. In machine learning, features correspond to numerical characteristics data. For example, in Iris Dataset, there are 4 features (Sepal Length, Sepal Width, Petal Length and Petal Width) so there are 4 numerical columns in the dataset. So by increasing n_features in make_blobs, we are adding more features hence increase the complexity of generated dataset.
As for the centers, it is easier to understand by visualizing the generated dataset. I use matplotlib to help us on that:
from sklearn.datasets.samples_generator import make_blobs
import matplot
# plot 1
X, y = make_blobs(n_features=2, centers=1)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.savefig('centers_1.png')
plt.title('centers = 1')
# plot 2
X, y = make_blobs(n_features=2, centers=2)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 2')
# plot 3
X, y = make_blobs(n_features=2, centers=3)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 3')
plt.show()
If you run the code above you can easily see that centers corresponds to number of classes generated in the data. It uses centers as a term because samples that belong to same class, tend to gather close to a center (coordinate).

ValueError: Expected 2D array, got 1D array instead:

While practicing Simple Linear Regression Model I got this error,
I think there is something wrong with my data set.
Here is my data set:
Here is independent variable X:
Here is dependent variable Y:
Here is X_train
Here Is Y_train
This is error body:
ValueError: Expected 2D array, got 1D array instead:
array=[ 7. 8.4 10.1 6.5 6.9 7.9 5.8 7.4 9.3 10.3 7.3 8.1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
And this is My code:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
Thank you
You need to give both the fit and predict methods 2D arrays. Your x_train and x_test are currently only 1 dimensional. What is suggested by the console should work:
x_train= x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
This uses numpy's reshape to transform your array. For example, x = [1, 2, 3] wopuld be transformed to a matrix x' = [[1], [2], [3]] (-1 gives the x dimension of the matrix, inferred from the length of the array and remaining dimensions, 1 is the y dimension - giving us a n x 1 matrix where n is the input length).
Questions about reshape have been answered in the past, this for example should answer what reshape(-1,1) fully means: What does -1 mean in numpy reshape? (also some of the other below answers explain this very well too)
A lot of times when doing linear regression problems, people like to envision this graph
On the input, we have an X of X = [1,2,3,4,5]
However, many regression problems have multidimensional inputs. Consider the prediction of housing prices. It's not one attribute that determines housing prices. It's multiple features (ex: number of rooms, location, etc. )
If you look at the documentation you will see this
It tells us that rows consist of the samples while the columns consist of the features.
However, consider what happens when he have one feature as our input. Then we need an n x 1 dimensional input where n is the number of samples and the 1 column represents our only feature.
Why does the array.reshape(-1, 1) suggestion work? -1 means choose a number of rows that works based on the number of columns provided. See the image for how it changes in the input.
If you look at documentation of LinearRegression of scikit-learn.
fit(X, y, sample_weight=None)
X : numpy array or sparse matrix of shape [n_samples,n_features]
predict(X)
X : {array-like, sparse matrix}, shape = (n_samples, n_features)
As you can see X has 2 dimensions, where as, your x_train and x_test clearly have one.
As suggested, add:
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
Before fitting and predicting the model.
Use
y_pred = regressor.predict([[x_test]])
I would suggest to reshape X at the beginning before you do the split into train and test dataset:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
# Here is the trick
x = x.reshape(-1,1)
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
This is what I use
X_train = X_train.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)
This is the solution
regressor.predict([[x_test]])
And for polynomial regression:
regressor_2.predict(poly_reg.fit_transform([[x_test]]))
Modify
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
to
regressor.fit(x_train.values.reshape(-1,1),y_train)
y_pred = regressor.predict(x_test.values.reshape(-1,1))

How to binarize RandomForest to plot a ROC in python?

I have 21 classes. I am using RandomForest. I want to plot a ROC curve, so I checked the example in scikit ROC with SVM
The example uses SVM. SVM has parameters like: probability and decision_function_shape which RF does not.
So how can I binarize RandomForest and plot a ROC?
Thank you
EDIT
To create the fake data. So there are 20 features and 21 classes (3 samples for each class).
df = pd.DataFrame(np.random.rand(63, 20))
label = np.arange(len(df)) // 3 + 1
df['label']=label
df
#TO TRAIN THE MODEL: IT IS A STRATIFIED SHUFFLED SPLIT
clf = make_pipeline(RandomForestClassifier())
xSSSmean10 = []
for i in range(10):
sss = StratifiedShuffleSplit(y, 10, test_size=0.1, random_state=i)
scoresSSS = cross_validation.cross_val_score(clf, x, y , cv=sss)
xSSSmean10.append(scoresSSS.mean())
result_list.append(xSSSmean10)
print("")
For multilabel random forest, each of your 21 labels has a binary classification, and you can create a ROC curve for each of the 21 classes.
Your y_train should be a matrix of 0 and 1 for each label.
Assume you fit a multilabel random forest from sklearn and called it rf, and have a X_test and y_test after a test train split. You can plot the ROC curve in python for your first label using this:
from sklearn import metrics
probs = rf.predict_proba(X_test)
fpr, tpr, threshs = metrics.roc_curve(y_test['name_of_your_first_tag'],probs[0][:,1])
Hope this helps. If you provide your code and data I could write this more specifically.

Resources