For the sample_weight, the requirement of its shape is array-like shape (n_samples,), sometimes is array-like shape [n_samples]. Does (n_samples,) means 1d array? and [n_samples] means list? Or they're equivalent to each other?
Both forms can be seen here:
You can use a simple example to test this:
import numpy as np
from sklearn.naive_bayes import GaussianNB
#create some data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
#create the model and fit it
clf = GaussianNB(), Y)
#check the type of some attributes
#check the shapes of these attributes
Or more advanced searching:
#verify that it is a numpy nd array and NOT a list
isinstance(clf.class_prior_, np.ndarray)
isinstance(clf.class_prior_, list)
Similarly, you can check all the attributes.
array([ 3., 3.])
The results indicate that these atributes are numpy nd arrays.
So I mean something where you have a categorical feature $X$ (suppose you have turned it into ints already) and say you want to embed that in some dimension using the features $A$ where $A$ is arity x n_embed.
What is the usual way to do this? Is using a for loop and vmap correct? I do not want something like jax.nn, something more efficient like
For example consider high arity and low embedding dim.
Is it jnp.take as in the flax.linen implementation here?
Indeed the typical way to do this in pure jax is with jnp.take. Given array A of embeddings of shape (num_embeddings, num_features) and categorical feature x of integers shaped (n,) then the following gives you the embedding lookup.
jnp.take(A, x, axis=0) # shape: (n, num_features)
If using Flax then the recommended way would be to use the flax.linen.Embed module and would achieve the same effect:
import flax.linen as nn
class Model(nn.Module):
def __call__(self, x):
emb = nn.Embed(num_embeddings, num_features)(x) # shape
Suppose that A is the embedding table and x is any shape of indices.
A[x], which is like jnp.take(A, x, axis=0) but simpler.
vmap-ed A[x], which parallelizes along axis 0 of x.
nested vmap-ed A[x], which parallelizes along all axes of x.
Here are the source code for your reference.
import jax
import jax.numpy as jnp
embs = jnp.array([[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]], dtype=jnp.float32)
x = jnp.array([[3, 1], [2, 0]], dtype=jnp.int32)
print("\ntake\n", jnp.take(embs, x, axis=0))
print("\nuse []\n", embs[x])
jax.vmap(lambda embs, x: embs[x], in_axes=[None, 0], out_axes=0)(embs, x),
"\nnested vmap\n",
jax.vmap(lambda embs, x: embs[x], in_axes=[None, 0], out_axes=0),
in_axes=[None, 0],
)(embs, x),
BTW, I learned the nested-vmap trick from the IREE GPT2 model code by James Bradbury.
I'm looking to better understand the covariance_ attribute returned by scikit-learn's LDA object.
I'm sure I'm missing something, but I expect it to be the covariance matrix associated with the input data. However, when I compare .covariance_ against the covariance matrix returned by numpy.cov(), I get different results.
Can anyone help me understand what I am missing? Thanks and happy to provide any additional information.
Please find a simple example illustrating the discrepancy below.
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Sample Data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 0, 0, 0])
# Covariance matrix via np.cov
# Covariance matrix via LDA
clf = LinearDiscriminantAnalysis(store_covariance=True).fit(X, y)
In sklearn.discrimnant_analysis.LinearDiscriminantAnalysis, the covariance is computed as follow:
In [1]: import numpy as np
...: cov = np.zeros(shape=(X.shape[1], X.shape[1]))
...: for c in np.unique(y):
...: Xg = X[y == c, :]
...: cov += np.count_nonzero(y==c) / len(y) * np.cov(Xg.T, bias=1)
...: print(cov)
array([[0.66666667, 0.33333333],
[0.33333333, 0.22222222]])
So it corresponds to the sum of the covariance of each individual class multiplied by a prior which is the class frequency. Note that this prior is a parameter of LDA.
If I've already called vectorizer.fit_transform(corpus), is the only way to later print the document-term matrix to call vectorizer.fit_transform(corpus) again?
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit_transform(corpus) # Returns the document-term matrix
My understanding is by doing above, I've now saved terms into the vectorizer object. I assume this because I can now call vectorizer.vocabulary_ without passing in corpus again.
So I wondered why there is not a method like .document_term_matrix?
Its seems weird that I have to pass in the corpus again if the data is now already stored in vectorizer object. But per the docs, only .fit, .transform, and .fit_transformreturn the mattrix.
Other Info:
I'm using Anaconda and Jupyter Notebook.
You can simply assign the fit to a variable dtm, and, since it is a Scipy sparse matrix, use the toarray method to print it:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
dtm = vectorizer.fit_transform(corpus)
# vectorizer object is still fit:
# {'brown': 0, 'fox': 1, 'quick': 2}
# array([[0, 0, 0],
# [0, 0, 1],
# [1, 0, 0],
# [0, 1, 0]], dtype=int64)
although I guess for any realistic document-term matrix this will be really impractical... You could use the nonzero method instead:
# (array([1, 2, 3], dtype=int32), array([2, 0, 1], dtype=int32))
Is there any way of automatically selecting the 'training samples' from the collection of features for better fit of the model (DT or SVM)? I know about selecting the 'features'. But I am talking about selecting the 'samples' after selecting the features.
There are a couple different ways to split your set into training, testing, and cross validation sets. Check out sklearn.cross_validation.train_test_split. But also take a look at the plethora of advanced splitting methods that are also available in SK-Learn.
Here's an example with test_train_split:
import numpy as np
from sklearn.cross_validation import train_test_split
a, b = np.arange(10).reshape((5, 2)), range(5)
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
[0, 1, 2, 3, 4]
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.33, random_state=42)
array([[4, 5],
[0, 1],
[6, 7]])
[2, 0, 3]
array([[2, 3],
[8, 9]])
[1, 4]
There are generally two ways to do feature selections: Univariate Feature Selection and L1-based Sparse Feature Selection.
from sklearn.datasets import make_classification
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
import numpy as np
# simulate some artificial data: 2000 obs, features: 1000-dim
# but only 2 out 1000 features are informative, the rest 998 features are noises
X, y = make_classification(n_samples=2000, n_features=1000, n_informative=2, random_state=0)
Out[153]: (2000, 1000)
# Univariate Feature Selection: select 20 best from 1000 features
# ==========================================================================
# classification F-test
X_selected = SelectKBest(f_classif, k=20).fit_transform(X, y)
# or to visualize each f-score/p-value of 1000 features
X_f_scores, X_f_pval = f_classif(X, y)
fig, ax = plt.subplots(figsize=(8,6))
ax.set_title('Univariate Feature Selection: Classification F-Score')
# which features are most important: top 10
np.argsort(X_f_scores)[-10:] # argsort is from smallest to largest
Out[154]: array([940, 163, 574, 969, 994, 977, 360, 291, 838, 524])
# L1-based Sparse Feature Selection: any algo implementation penalty 'l1'
# ==========================================================================
# use LinearSVC for example here
# other popular choices: logistic regression, Lasso (for regression)
feature_selector = LinearSVC(C=0.01, penalty='l1', dual=False), y)
# get features with non-zero coefficients: exactly 2
(feature_selector.coef_ != 0.0).sum()
Out[155]: 2
X_selected_l1 = feature_selector.transform(X)
# or X[:, feature_selector.coef_ != 0.0]
I have a training set of data. The python script for creating the model also calculates the attributes into a numpy array (It's a bit vector). I then want to use VarianceThreshold to eliminate all features that have 0 variance (eg. all 0 or 1). I then run get_support(indices=True) to get the indices of the select columns.
My issue now is how to get only the selected features for the data I want to predict. I first calculate all features and then use array indexing but it does not work:
x_predict_all = getAllFeatures(suppl_predict)
x_predict = x_predict_all[indices] #only selected features
indices is a numpy array.
The returned array x_predict has the correct length len(x_predict) but wrong shape x_predict.shape[1] which is still the original length. My classifier then throws an error due to wrong shape
prediction = gbc.predict(x_predict)
File "C:\Python27\lib\site-packages\sklearn\ensemble\", li
ne 1032, in _init_decision_function
self.n_features, X.shape[1]))
ValueError: X.shape[1] should be 1855, not 2090.
How can I solve this issue?
You can do it like this:
Test data
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
selector = VarianceThreshold()
Alternative 1
>>> idxs = selector.get_support(indices=True)
>>> X[:, idxs]
array([[2, 0],
[1, 4],
[1, 1]])
Alternative 2
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])