Selecting Samples in Scikit-Learn - scikit-learn

Is there any way of automatically selecting the 'training samples' from the collection of features for better fit of the model (DT or SVM)? I know about selecting the 'features'. But I am talking about selecting the 'samples' after selecting the features.

There are a couple different ways to split your set into training, testing, and cross validation sets. Check out sklearn.cross_validation.train_test_split. But also take a look at the plethora of advanced splitting methods that are also available in SK-Learn.
Here's an example with test_train_split:
In:
import numpy as np
from sklearn.cross_validation import train_test_split
a, b = np.arange(10).reshape((5, 2)), range(5)
a
Out:
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
In:
list(b)
Out:
[0, 1, 2, 3, 4]
In:
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.33, random_state=42)
a_train
Out:
array([[4, 5],
[0, 1],
[6, 7]])
In:
b_train
Out:
[2, 0, 3]
In:
a_test
Out:
array([[2, 3],
[8, 9]])
In:
b_test
Out:
[1, 4]

There are generally two ways to do feature selections: Univariate Feature Selection and L1-based Sparse Feature Selection.
from sklearn.datasets import make_classification
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
import numpy as np
# simulate some artificial data: 2000 obs, features: 1000-dim
# but only 2 out 1000 features are informative, the rest 998 features are noises
X, y = make_classification(n_samples=2000, n_features=1000, n_informative=2, random_state=0)
X.shape
Out[153]: (2000, 1000)
# Univariate Feature Selection: select 20 best from 1000 features
# ==========================================================================
# classification F-test
X_selected = SelectKBest(f_classif, k=20).fit_transform(X, y)
X_selected.shape
# or to visualize each f-score/p-value of 1000 features
X_f_scores, X_f_pval = f_classif(X, y)
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(X_f_scores)
ax.set_title('Univariate Feature Selection: Classification F-Score')
ax.set_xlabel('features')
ax.set_ylabel('F-score')
# which features are most important: top 10
np.argsort(X_f_scores)[-10:] # argsort is from smallest to largest
Out[154]: array([940, 163, 574, 969, 994, 977, 360, 291, 838, 524])
# L1-based Sparse Feature Selection: any algo implementation penalty 'l1'
# ==========================================================================
# use LinearSVC for example here
# other popular choices: logistic regression, Lasso (for regression)
feature_selector = LinearSVC(C=0.01, penalty='l1', dual=False)
feature_selector.fit(X, y)
# get features with non-zero coefficients: exactly 2
(feature_selector.coef_ != 0.0).sum()
Out[155]: 2
X_selected_l1 = feature_selector.transform(X)
# or X[:, feature_selector.coef_ != 0.0]

Related

Get features for maximum value of Scikit-learn estimator

I have the following very simple code trying to model a simple dataset:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
data = {'Feature_A': [1, 2, 3, 4], 'Feature_B': [7, 8, 9, 10], 'Feature_C': [2, 3, 4, 5], 'Label': [7, 7, 8, 9]}
data = pd.DataFrame(data)
data_labels = data['Label']
data = data.drop(columns=['Label'])
pipeline = Pipeline([('imputer', SimpleImputer()),
('std_scaler', StandardScaler())])
data_prepared = pipeline.fit_transform(data)
lin_reg = LinearRegression()
lin_grid = {"n_jobs": [20, 50]}
error = "max_error"
grid_search = GridSearchCV(lin_reg, param_grid=lin_grid, verbose=3, cv=2, refit=True, scoring=error, return_train_score=True)
grid_search.fit(data_prepared, data_labels)
print(grid_search.best_estimator_.coef_)
print(grid_search.best_estimator_.intercept_)
print(list(data_labels))
print(list(grid_search.best_estimator_.predict(data_prepared)))
That gives me the following results:
[0.2608746 0.2608746 0.2608746]
7.75
[7, 7, 8, 9]
[6.7, 7.4, 8.1, 8.799999999999999]
From there, is there a way of computing the values of the features that would give me the maximum label, within the boundaries of the dataset?
If I understand your question correctly, this should work:
import numpy as np
id_max = np.argmax(grid_search.predict(data)) # find id of the maximum predicted label
print(data.loc[id_max])

LDA covariance matrix not match calculated covariance matrix

I'm looking to better understand the covariance_ attribute returned by scikit-learn's LDA object.
I'm sure I'm missing something, but I expect it to be the covariance matrix associated with the input data. However, when I compare .covariance_ against the covariance matrix returned by numpy.cov(), I get different results.
Can anyone help me understand what I am missing? Thanks and happy to provide any additional information.
Please find a simple example illustrating the discrepancy below.
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Sample Data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 0, 0, 0])
# Covariance matrix via np.cov
print(np.cov(X.T))
# Covariance matrix via LDA
clf = LinearDiscriminantAnalysis(store_covariance=True).fit(X, y)
print(clf.covariance_)
In sklearn.discrimnant_analysis.LinearDiscriminantAnalysis, the covariance is computed as follow:
In [1]: import numpy as np
...: cov = np.zeros(shape=(X.shape[1], X.shape[1]))
...: for c in np.unique(y):
...: Xg = X[y == c, :]
...: cov += np.count_nonzero(y==c) / len(y) * np.cov(Xg.T, bias=1)
...: print(cov)
array([[0.66666667, 0.33333333],
[0.33333333, 0.22222222]])
So it corresponds to the sum of the covariance of each individual class multiplied by a prior which is the class frequency. Note that this prior is a parameter of LDA.

sklearn train_test_split returns some elements in both test/train

I have a data-set X with 260 unique observations.
when running x_train,x_test,_,_=test_train_split(X,y,test_size=0.2) I would assume that
[p for p in x_test if p in x_train] would be empty, but it is not. Actually it turns out that only two observations in x_test is not in x_train.
Is that intended or...?
EDIT (posted the data I am using):
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test if p in x_train]) #is not 0
EDIT 2.0: Showing that the test works
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[1,2,3],[11,12,13]])
len([p for p in a if p in b]) #1
This is not a bug with the implementation of train_test_split in sklearn, but a weird peculiarity of how the in operator works on numpy arrays. The in operator first does an elementwise comparison between two arrays, and returns True if ANY of the elements match.
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5]])
a in b # True
The correct way to test for this kind of overlap is using the equality operator and np.all and np.any. As a bonus, you also get the indices that overlap for free.
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5], [7, 8, 9]])
a in b # True
z = np.any(np.all(a == b[:, None, :], -1)) # False
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [1, 2, 3], [7, 8, 9]])
a in b # True
overlap = np.all(a == b[:, None, :], -1)
z = np.any(overlap) # True
indices = np.nonzero(overlap) # (1, 0)
You need to check using the following:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test.tolist() if p in x_train.tolist()])
0
Using x_test.tolist() the in operator will work as intended.
Reference: testing whether a Numpy array contains a given row

Train/fit a Linear Regression in sklearn with only one feature/variable

So I am understanding lasso regression and I don't understand why it needs two input values to predict another value when it's just a 2 dimensional regression.
It says in the documentation that
clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
which I don't understand. Why is it [0,0] or [1,1] and not just [0] or [1]?
[[0,0], [1, 1], [2, 2]]
means that you have 3 samples/observations and each is characterised by 2 features/variables (2 dimensional).
Indeed, you could have these 3 samples with only 1 features/variables and still be able to fit a model.
Example using 1 feature.
from sklearn import datasets
from sklearn import linear_model
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :1] # we only take the feature
y = iris.target
clf = linear_model.Lasso(alpha=0.1)
clf.fit(X,y)
print(clf.coef_)
print(clf.intercept_)

array-like shape (n_samples,) vs [n_samples] in sklearn documents

For the sample_weight, the requirement of its shape is array-like shape (n_samples,), sometimes is array-like shape [n_samples]. Does (n_samples,) means 1d array? and [n_samples] means list? Or they're equivalent to each other?
Both forms can be seen here: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
You can use a simple example to test this:
import numpy as np
from sklearn.naive_bayes import GaussianNB
#create some data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
#create the model and fit it
clf = GaussianNB()
clf.fit(X, Y)
#check the type of some attributes
type(clf.class_prior_)
type(clf.class_count_)
#check the shapes of these attributes
clf.class_prior_.shape
clf.class_count_
Or more advanced searching:
#verify that it is a numpy nd array and NOT a list
isinstance(clf.class_prior_, np.ndarray)
isinstance(clf.class_prior_, list)
Similarly, you can check all the attributes.
Results
numpy.ndarray
numpy.ndarray
(2,)
array([ 3., 3.])
True
False
The results indicate that these atributes are numpy nd arrays.

Resources