If I've already called vectorizer.fit_transform(corpus), is the only way to later print the document-term matrix to call vectorizer.fit_transform(corpus) again?
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit_transform(corpus) # Returns the document-term matrix
My understanding is by doing above, I've now saved terms into the vectorizer object. I assume this because I can now call vectorizer.vocabulary_ without passing in corpus again.
So I wondered why there is not a method like .document_term_matrix?
Its seems weird that I have to pass in the corpus again if the data is now already stored in vectorizer object. But per the docs, only .fit, .transform, and .fit_transformreturn the mattrix.
Docs: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit
Other Info:
I'm using Anaconda and Jupyter Notebook.
You can simply assign the fit to a variable dtm, and, since it is a Scipy sparse matrix, use the toarray method to print it:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
dtm = vectorizer.fit_transform(corpus)
# vectorizer object is still fit:
vectorizer.vocabulary_
# {'brown': 0, 'fox': 1, 'quick': 2}
dtm.toarray()
# array([[0, 0, 0],
# [0, 0, 1],
# [1, 0, 0],
# [0, 1, 0]], dtype=int64)
although I guess for any realistic document-term matrix this will be really impractical... You could use the nonzero method instead:
dtm.nonzero()
# (array([1, 2, 3], dtype=int32), array([2, 0, 1], dtype=int32))
Related
I'm trying to calculate the 'accuracy' of one-hot label encoded tensors, such that for the following example, I'd get 0.5.
tensor([[0,0,1], [1,0,0]]) == tensor([[0,0,1], [0,1,0]])
I want to know what proportion of the predictions are correctly labelled.
What's the most elegant way to achieve this with a pytorch tensor?
I would suggest using torchmetrics for computing metrics out-of-the-box:
import torch
import torchmetrics
a = torch.tensor([[0, 0, 1], [1, 0, 0]])
b = torch.tensor([[0, 0, 1], [0, 1, 0]])
torchmetrics.functional.accuracy(a, b, subset_accuracy=True)
output:
tensor(0.5000)
If I understand correctly. You want all values to match for each row to be considered as a correct prediction then it should be something like this.
(tensor([[0,0,1], [1,0,0]]) == tensor([[0,0,1], [0,1,0]])).all(dim=1).float().mean()
I'm doing a binary classification on time-series data. Since it's for an academic project, I want to test classical ML models such as RandomForestClassifier as well.
However, while using TimeSeriesSplit K-fold Cross-Validation, it is possible that while training; labels have only one class instead of both, which is raising ValueError.
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(class_weight={1:10, 0:1})
rfc.fit([[0, 0, 1], [1, 0, 1]], [0, 0])
This gives,
ValueError: Class label 1 not present.
I know it doesn't make sense to train with only one label, but then it works fine if we don't specify class_weight. Is this a bug?
How do I get around this programmatically if I'm automating my testing?
I think the issue has been fixed on 11 Mar 2022.
I am able to run below code without any problem today:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(class_weight={1:10, 0:1})
rfc.fit([[0, 0, 1], [1, 0, 1]], [0, 1])
rfc.predict([[1, 1, 1]])
Output: array([1])
rfc.fit([[0, 0, 1], [1, 0, 1]], [0, 0])
rfc.predict([[1, 1, 1]])
Output: array([0])
I'm puzzeled about how does cosine metric works in sklearn's clustering algorithoms.
For example, DBSCAN has a parameter eps and it specified maximum distance when clustering. However, bigger cosine similarity means two vectors are closer, which is just the opposite to our distance concept.
I found that there are cosine_similarity and cosine_distance( just 1-cos() ) in pairwise_metric, and when we specified the metric is cosine we use cosine_similarity.
So, when clustering, how does DBSCAN compares the cosine_similarity and #param eps to decide whether two vectors have the same label?
An example
import numpy as np
from sklearn.cluster import DBSCAN
samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
clf = DBSCAN(metric='cosine', eps=0.1)
result = clf.fit_predict(samples)
print(result)
it outputs [-1, -1, -1, -1] which means these four points are in the same cluster
However,
for points pair [1,1], [2, 2],
its cosine_similarity is 4/(4) = 1,
the cosine distance will be 1-1 = 0, so they are in the same cluster
for points pair[1,1], [1,0],
its cosine_similarity is 1/sqrt(2),
the cosine distance will be 1-1/sqrt(2) = 0.29289321881345254, this distance is bigger than our eps 0.1, why DBSCAN clustered them into the same cluster?
Thanks for #Stanislas Morbieu 's answer, and I finally understand the cosine metric means cosine_distance which is 1-cosine
The implementation of DBSCAN in scikit-learn rely on NearestNeighbors (see the implementation of DBSCAN).
Here is an example to see how it works with cosine metric:
import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
neigh = NearestNeighbors(radius=0.1, metric='cosine')
neigh.fit(samples)
rng = neigh.radius_neighbors([[1, 1]])
print([samples[i] for i in rng[1][0]])
It outputs [[1, 1], [2, 2]], i.e. the points which are closest to [1, 1] in a radius of 0.1.
So points which have a cosine distance smaller than eps in DBSCAN tend to be in the same cluster.
The parameter min_samples of DBSCAN plays an important role. Since by default, it is set to 5, no points can be considered as core point.
Setting it to 1, the example code:
import numpy as np
from sklearn.cluster import DBSCAN
samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
clf = DBSCAN(metric='cosine', eps=0.1, min_samples=1)
result = clf.fit_predict(samples)
print(result)
outputs [0 1 2 2] which means that [1, 1] and [2, 2] are in the same cluster (numbered 2).
By the way, the output [-1, -1, -1, -1] doesn't mean that points are in the same cluster, but that all points are in no cluster.
I have a training set of data. The python script for creating the model also calculates the attributes into a numpy array (It's a bit vector). I then want to use VarianceThreshold to eliminate all features that have 0 variance (eg. all 0 or 1). I then run get_support(indices=True) to get the indices of the select columns.
My issue now is how to get only the selected features for the data I want to predict. I first calculate all features and then use array indexing but it does not work:
x_predict_all = getAllFeatures(suppl_predict)
x_predict = x_predict_all[indices] #only selected features
indices is a numpy array.
The returned array x_predict has the correct length len(x_predict) but wrong shape x_predict.shape[1] which is still the original length. My classifier then throws an error due to wrong shape
prediction = gbc.predict(x_predict)
File "C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py", li
ne 1032, in _init_decision_function
self.n_features, X.shape[1]))
ValueError: X.shape[1] should be 1855, not 2090.
How can I solve this issue?
You can do it like this:
Test data
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
selector = VarianceThreshold()
Alternative 1
>>> selector.fit(X)
>>> idxs = selector.get_support(indices=True)
>>> X[:, idxs]
array([[2, 0],
[1, 4],
[1, 1]])
Alternative 2
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])
How should I best use scikit-learn for the following supervised classification problem (simplified), with binary features:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
train_data = np.array([[0, 0, 1, 0],
[1, 0, 1, 1],
[0, 1, 1, 1]], dtype=bool)
train_targets = np.array([0, 1, 2])
c = DecisionTreeClassifier()
c.fit(train_data, train_targets)
p = c.predict(np.array([1, 1, 1, 1], dtype=bool))
print(p)
# -> [1]
That works fine. However, suppose now that I know a priori that the presence of feature 0 excludes class 1. Can additional information of this kind be easily included in the classification process?
Currently, I'm just doing some (problem-specific and heuristic) postprocessing to adjust the resulting class. I could perhaps also manually preprocess and split the dataset into two according to the feature, and train two classifiers separately (but with K such features, this ends up in 2^K splitting).
Can additional information of this kind be easily included in the classification process?
Domain-specific hacks are left to the user. The easiest way to do this is to predict probabilities...
>>> prob = c.predict_proba(X)
and then rig the probabilities to get the right class out.
>>> invalid = (prob[:, 1] == 1) & (X[:, 0] == 1)
>>> prob[invalid, 1] = -np.inf
>>> pred = c.classes_[np.argmax(prob, axis=1)]
That's -np.inf instead of 0 so the 1 label doesn't come up as a result of tie-breaking vs. other zero-probability classes.