Removing the redundant feature from classification dataset ( make_classification ) - scikit-learn

In the make_classification method,
X,y = make_classification(n_samples=10, n_features=8, n_informative=7, n_redundant=1, n_repeated=0 , n_classes=2,random_state=6)
Docstring about n_redundant: The number of redundant features. These features are generated as
random linear combinations of the informative features.
Docstring about n_repeated: The number of duplicated features, drawn randomly from the informative
n_repeated features are picked easily as they are highly correlated with informative features.
The docstring for repeated and redundant features indicates that both are drawn from informative features.
My question is: how redundant features can be removed/highlighted, what are their characteristics.
Attached is the correlation heatmap among all the features, Which feature in the image is redundant.
Please help.

To check how many independent columns use np.linalg.matrix_rank(X)
To find indices of linearly independent rows of matrix X use sympy.Matrix(X).rref()
DEMO
Generate dataset and check number of independent columns (matrix rank):
from sklearn.datasets import make_classification
from sympy import Matrix
X, _ = make_classification(
n_samples=10, n_features=8, n_redundant=2,random_state=6
)
np.linalg.matrix_rank(X, tol=1e-3)
# 6
Find indices of linearly independent columns:
_, inds = Matrix(X).rref(iszerofunc=lambda x: abs(x)<1e-3)
inds
#(0, 1, 2, 3, 6, 7)
Remove dependent columns and check matrix rank (num of independent columns):
#linearly independent
X_independent = X[:,inds]
assert np.linalg.matrix_rank(X_independent, tol=1e-3) == X_independent.shape[1]

Related

Scaling row-wise with MinMaxScaler from Sklearn

By default, scalers from Sklearn work column-wise. But i need my data to be scaled line-wise, so i did the following:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np
# %% Generating sample data
x = np.array([[-1, 4, 2], [-0.5, 8, 9], [3, 2, 3]])
y = np.array([1, 2, 3])
#%% Train/Test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train.T).T # scaling line-wise
x_test = scaler.transform(x_test) <-------- Error here
But i am getting the following error:
ValueError: X has 3 features, but MinMaxScaler is expecting 2 features as input.
I don't understand whats wrong here. Why it says it is expecting 2 features, when all my X (x, x_train and x_test) has 3 features? How can i fix this?
StandardScaler is stateful: when you fit it, it calculates and saves the columns' means and standard deviations; when transforming (train or test sets), it uses those saved statistics. Your transpose trick doesn't work with that: each row has saved statistics, and then your test set doesn't have the same rows, so transform cannot work correctly (throwing an error if different number of rows, or silently mis-scaling if the same number of rows).
What you want isn't stateful: test sets should be transformed completely independently of the training set. Indeed, every row should be transformed independently of each other. So you could just do this kind of transformation before splitting, or using fit_transform on the test set('s transpose).
For l2 normalization of rows, there's a builtin for this: Normalizer (docs). I don't think there's an analogue for min-max normalization, but I think you could write a FunctionTransformer to do it.
This is possible to do. I can think of a scenario where this would be useful. Normally, MinMaxScaler would scale each x, y, and z with respect to other observations of that feature. That's the "series" scaling. Now imagine that instead, you wanted to map each point constrained by x+y+z = 1. I think this is what OP is asking for. I have done this in the past, I will describe how I did it.
You need to treat your individual observations as a column multi-index and treat it like a higher-dimensional feature. Then, you need to build a pipeline within which the observations are transformed from column-wise to row wise, post which you do the min/max scaling. This gets you to x+y+z=1, but you still need to get back to the original shape of the data, for which you will need to track the index of each observation. Within the pipeline, you'll need to use something like a DataframeFunctionTransformer which I have seen on the interwebs, reproducing it below. This way you can use pandas functions to shape the data and merge back in with the indices.
class DataframeFunctionTransformer():
def __init__(self, func):
self.func = func
def transform(self, input_df, **transform_params):
return self.func(input_df)
def fit(self, X, y=None, **fit_params):
return self
Regarding the statefulness of MinMaxScaler, I think in a scenario such as this, the state of MinMaxScaler doesn't get used, it is purely acting as a transformer that maps these points to a different space meeting the constraint that x, y, and z are scaled such that they add up to 1.
#Murilo hope this gets you started with a solution. Must be an interesting problem.

Sklearn TruncatedSVD not showing explained variance ration in descending order, or first number means something else?

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
digits = datasets.load_digits()
X = digits.data
X = X - X.mean() # centering the data
#### svd
svd = TruncatedSVD(n_components=5)
svd.fit(X)
print(svd.explained_variance_ration)
#### PCA
pca = PCA(n_components=5)
pca.fit(X)
print(pca.explained_variance_ratio_)
svd output is:
array([0.02049911, 0.1489056 , 0.13534811, 0.11738598, 0.08382797])
pca output is:
array([0.14890594, 0.13618771, 0.11794594, 0.08409979, 0.05782415])
is there a bug in the TruncatedSVD implementation? or why is the first explained variance (0.02...) behaving like this? or what is the meaning
Summary:
That is because TruncatedSVD and PCA use different SVD functions!.
Note: Your case is due to Reason 2 below, yet I included another reason for future readers.
Details:
Reason 1: The solver set by user in each algorithm, is different:
PCA internally uses scipy.linalg.svd which sorts singular values, hence the explained_variance_ratio_ is sorted.
Part of Scikit Implementation of PCA:
# Center data
U, S, Vt = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, Vt = svd_flip(U, Vt)
components_ = Vt
# Get variance explained by singular values
explained_variance_ = (S ** 2) / (n_samples - 1)
total_var = explained_variance_.sum()
explained_variance_ratio_ = explained_variance_ / total_var
Screenshot from the above-mentioned scipy.linalg.svd link:
On the other hand, TruncatedSVD uses scipy.sparse.linalg.svds which relies on the ARPACK solver for decomposition.
Screenshot from the above-mentioned scipy.sparse.linalg.svds link:
Reason 2: The TruncatedSVD operates differently compared to PCA:
In your case you chose randomized as a solver (which is set by default) in both algorithms, yet you obtained different results with regards to the order of the variance.
That is because in PCA, the variance is obtained from the actual singular values (called Sigma or S in Scikit-Learn implementation), which are already sorted:
On the other hand, the variance in TruncatedSVD is obtained from X_transformed which results from multiplying the data matrix by the components. The latter does not necessarily preserve order because data are not centered, nor is it the purpose of TruncatedSVD which it is used in first place for sparse matrices:
Now if you center your data, you will get them sorted (note that you did not center data properly, because centering requires dividing by standard deviation):
from sklearn import datasets
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits()
X = digits.data
sc = StandardScaler()
X = sc.fit_transform(X)
### SVD
svd = TruncatedSVD(n_components=5, algorithm='randomized', random_state=2021)
svd.fit(X)
print(svd.explained_variance_ratio_)
Output
[0.12033916 0.09561054 0.08444415 0.06498406 0.04860093]
Important: Further read.

PySpark: Get Threshold (cuttoff) values for each point in ROC curve

I'm starting with PySpark, building binary classification models (logistic regression), and I need to find the optimal threshold (cuttoff) point for my models.
I want to use the ROC curve to find this point, but I don't know how to extract the threshold value for each point in this curve. Is there a way to find this values?
Things I've found:
This post shows how to extract the ROC curve, but only the values for the TPR and FPR. It's useful for plotting and for selecting the optimal point, but I can't find the threshold value.
I know I can find the threshold values for each point in the ROC curve using H2O (I've done it before), but I'm working on Pyspark.
Here is a post describing how to do it with R... but, again, I need to do it with Pyspark
Other facts
I'm using Apache Spark 2.4.0.
I'm working with Data Frames (I really don't know - yet - how to work with RDDs, but I'm not afraid to learn ;) )
If you specifically need to generate ROC curves for different thresholds, one approach could be to generate a list of threshold values you're interested in and fit/transform on your dataset for each threshold. Or you could manually calculate the ROC curve for each threshold point using the probability field in the response from model.transform(test).
Alternatively, you can use BinaryClassificationMetrics to extract a curve plotting various metrics (F1 score, precision, recall) by threshold.
Unfortunately it appears the PySpark version doesn't implement most of the methods the Scala version does, so you'd need to wrap the class to do it in Python.
For example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
Results in:
Here's an example of an F1 score curve by threshold value if you aren't married to ROC:
One way is to use sklearn.metrics.roc_curve.
First use your fitted model to make predictions:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)
Then collect your scores and labels1:
preds = predictions.select('label','probability')\
.rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
.collect()
Now transform preds to work with roc_curve
from sklearn.metrics import roc_curve
y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)
Notes:
I am not 100% certain that the probabilities vector will always be ordered such that the positive label will be at index 1. However in a binary classification problem, you'll know right away if your AUC is less than 0.5. In that case, just take 1-p for the probabilities (since the class probabilities sum to 1).

The dimension of dual_coef_ in sklearn. SVC

In SVC() for multi-classification, the one-vs-one classifiers are trained. So there are supposed to be n_class * (n_class - 1)/2 classifiers in total. But why clf.dual_coef_ returns me only (n_class - 1) * n_SV? What does each row represent then?
The dual coefficients of a sklearn.svm.SVC in the multiclass setting are tricky to interpret. There is an explanation in the scikit-learn documentation. The sklearn.svm.SVC uses libsvm for the calculations and adopts the same data structure for the dual coefficients. Another explanation of the organization of these coefficients is in the FAQ. In the case of the coefficients you find in the fitted SVC classifier, interpretation goes as follows:
The support vectors identified by the SVC each belong to a certain class. In the dual coefficients, they are ordered according to the class they belong to.
Given a fitted SVC estimator, e.g.
from sklearn.svm import SVC
svc = SVC()
svc.fit(X, y)
you will find
svc.classes_ # represents the unique classes
svc.n_support_ # represents the number of support vectors per class
The support vectors are organized according to these two variables. Each support vector being clearly identified with one class, it becomes evident that it can be implied in at most n_classes-1 one-vs-one problems, viz every comparison with all the other classes. But it is entirely possible that a given support vector will not be implied in all one-vs-one problems.
Taking a look at
support_indices = np.cumsum(svc.n_support_)
svc.dual_coef_[0:support_indices[0]] # < ---
# weights on support vectors of class 0
# for problems 0v1, 0v2, ..., 0v(n-1)
# so n-1 columns for each of the
# svc.n_support_[0] support vectors
svc.dual_coef_[support_indices[1]:support_indices[2]]
# ^^^
# weights on support vectors of class 1
# for problems 0v1, 1v2, ..., 1v(n-1)
# so n-1 columns for each of the
# svc.n_support_[1] support vectors
...
svc.dual_coef_[support_indices[n_classes - 2]:support_indices[n_classes - 1]]
# ^^^
# weights on support vectors of class n-1
# for problems 0vs(n-1), 1vs(n-1), ..., (n-2)v(n-1)
# so n-1 columns for each of the
# svc.n_support_[-1] support vectors
gives you the weights of the support vectors for the classes 0, 1, ..., n-1 in their respective one-vs-one problems. Comparisons to all other classes except its own are made, resulting in n_classes - 1 columns. The order in which this happens follows the order of the unique classes exposed above. There are as many rows in each group as there are support vectors.
Possibly what you are looking for are the primal weights, which live in feature space, in order to inspect them as to their "importance" for classification. This is only possible with a linear kernel. Try this
from sklearn.svm import SVC
svc = SVC(kernel="linear")
svc.fit(X, y) # X is your data, y your labels
Then take a look at
svc.coef_
This is an array of shape ((n_class * (n_class -1) / 2), n_features) and represents the aforementioned weights.
According to the doc the weights are ordered as:
class 0 vs class 1
class 0 vs class 2
...
class 0 vs class n-1
class 1 vs class 2
class 1 vs class 3
...
...
class n-2 vs class n-1

How to find key trees/features from a trained random forest?

I am using Scikit-Learn Random Forest Classifier and trying to extract the meaningful trees/features in order to better understand the prediction results.
I found this method which seems relevant in the documention (http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.get_params), but couldn't find an example how to use it.
I am also hoping to visualize those trees if possible, any relevant code would be great.
Thank you!
I think you're looking for Forest.feature_importances_. This allows you to see what the relative importance of each input feature is to your final model. Here's a simple example.
import random
import numpy as np
from sklearn.ensemble import RandomForestClassifier
#Lets set up a training dataset. We'll make 100 entries, each with 19 features and
#each row classified as either 0 and 1. We'll control the first 3 features to artificially
#set the first 3 features of rows classified as "1" to a set value, so that we know these are the "important" features. If we do it right, the model should point out these three as important.
#The rest of the features will just be noise.
train_data = [] ##must be all floats.
for x in range(100):
line = []
if random.random()>0.5:
line.append(1.0)
#Let's add 3 features that we know indicate a row classified as "1".
line.append(.77)
line.append(.33)
line.append(.55)
for x in range(16):#fill in the rest with noise
line.append(random.random())
else:
#this is a "0" row, so fill it with noise.
line.append(0.0)
for x in range(19):
line.append(random.random())
train_data.append(line)
train_data = np.array(train_data)
# Create the random forest object which will include all the parameters
# for the fit. Make sure to set compute_importances=True
Forest = RandomForestClassifier(n_estimators = 100, compute_importances=True)
# Fit the training data to the training output and create the decision
# trees. This tells the model that the first column in our data is the classification,
# and the rest of the columns are the features.
Forest = Forest.fit(train_data[0::,1::],train_data[0::,0])
#now you can see the importance of each feature in Forest.feature_importances_
# these values will all add up to one. Let's call the "important" ones the ones that are above average.
important_features = []
for x,i in enumerate(Forest.feature_importances_):
if i>np.average(Forest.feature_importances_):
important_features.append(str(x))
print 'Most important features:',', '.join(important_features)
#we see that the model correctly detected that the first three features are the most important, just as we expected!
To get the relative feature importances, read the relevant section of the documentation along with the code of the linked examples in that same section.
The trees themselves are stored in the estimators_ attribute of the random forest instance (only after the call to the fit method). Now to extract a "key tree" one would first require you to define what it is and what you are expecting to do with it.
You could rank the individual trees by computing there score on held out test set but I don't know what expect to get out of that.
Do you want to prune the forest to make it faster to predict by reducing the number of trees without decreasing the aggregate forest accuracy?
Here is how I visualize the tree:
First make the model after you have done all of the preprocessing, splitting, etc:
# max number of trees = 100
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
Make predictions:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
Then make the plot of importances. The variable dataset is the name of the original dataframe.
# get importances from RF
importances = classifier.feature_importances_
# then sort them descending
indices = np.argsort(importances)
# get the features from the original data set
features = dataset.columns[0:26]
# plot them with a horizontal bar chart
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
This yields a plot as below:

Resources