I'm trying to calculate the 'accuracy' of one-hot label encoded tensors, such that for the following example, I'd get 0.5.
tensor([[0,0,1], [1,0,0]]) == tensor([[0,0,1], [0,1,0]])
I want to know what proportion of the predictions are correctly labelled.
What's the most elegant way to achieve this with a pytorch tensor?
I would suggest using torchmetrics for computing metrics out-of-the-box:
import torch
import torchmetrics
a = torch.tensor([[0, 0, 1], [1, 0, 0]])
b = torch.tensor([[0, 0, 1], [0, 1, 0]])
torchmetrics.functional.accuracy(a, b, subset_accuracy=True)
output:
tensor(0.5000)
If I understand correctly. You want all values to match for each row to be considered as a correct prediction then it should be something like this.
(tensor([[0,0,1], [1,0,0]]) == tensor([[0,0,1], [0,1,0]])).all(dim=1).float().mean()
Related
So I mean something where you have a categorical feature $X$ (suppose you have turned it into ints already) and say you want to embed that in some dimension using the features $A$ where $A$ is arity x n_embed.
What is the usual way to do this? Is using a for loop and vmap correct? I do not want something like jax.nn, something more efficient like
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding
For example consider high arity and low embedding dim.
Is it jnp.take as in the flax.linen implementation here? https://github.com/google/flax/blob/main/flax/linen/linear.py#L624
Indeed the typical way to do this in pure jax is with jnp.take. Given array A of embeddings of shape (num_embeddings, num_features) and categorical feature x of integers shaped (n,) then the following gives you the embedding lookup.
jnp.take(A, x, axis=0) # shape: (n, num_features)
If using Flax then the recommended way would be to use the flax.linen.Embed module and would achieve the same effect:
import flax.linen as nn
class Model(nn.Module):
#nn.compact
def __call__(self, x):
emb = nn.Embed(num_embeddings, num_features)(x) # shape
Suppose that A is the embedding table and x is any shape of indices.
A[x], which is like jnp.take(A, x, axis=0) but simpler.
vmap-ed A[x], which parallelizes along axis 0 of x.
nested vmap-ed A[x], which parallelizes along all axes of x.
Here are the source code for your reference.
import jax
import jax.numpy as jnp
embs = jnp.array([[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]], dtype=jnp.float32)
x = jnp.array([[3, 1], [2, 0]], dtype=jnp.int32)
print("\ntake\n", jnp.take(embs, x, axis=0))
print("\nuse []\n", embs[x])
print(
"\nvmap\n",
jax.vmap(lambda embs, x: embs[x], in_axes=[None, 0], out_axes=0)(embs, x),
)
print(
"\nnested vmap\n",
jax.vmap(
jax.vmap(lambda embs, x: embs[x], in_axes=[None, 0], out_axes=0),
in_axes=[None, 0],
out_axes=0,
)(embs, x),
)
BTW, I learned the nested-vmap trick from the IREE GPT2 model code by James Bradbury.
If I've already called vectorizer.fit_transform(corpus), is the only way to later print the document-term matrix to call vectorizer.fit_transform(corpus) again?
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit_transform(corpus) # Returns the document-term matrix
My understanding is by doing above, I've now saved terms into the vectorizer object. I assume this because I can now call vectorizer.vocabulary_ without passing in corpus again.
So I wondered why there is not a method like .document_term_matrix?
Its seems weird that I have to pass in the corpus again if the data is now already stored in vectorizer object. But per the docs, only .fit, .transform, and .fit_transformreturn the mattrix.
Docs: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit
Other Info:
I'm using Anaconda and Jupyter Notebook.
You can simply assign the fit to a variable dtm, and, since it is a Scipy sparse matrix, use the toarray method to print it:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
dtm = vectorizer.fit_transform(corpus)
# vectorizer object is still fit:
vectorizer.vocabulary_
# {'brown': 0, 'fox': 1, 'quick': 2}
dtm.toarray()
# array([[0, 0, 0],
# [0, 0, 1],
# [1, 0, 0],
# [0, 1, 0]], dtype=int64)
although I guess for any realistic document-term matrix this will be really impractical... You could use the nonzero method instead:
dtm.nonzero()
# (array([1, 2, 3], dtype=int32), array([2, 0, 1], dtype=int32))
I'm trying to test my Scikit-learn machine learning algorithm with a simple R^2 score, but for some reason it always returns zero.
import numpy
from sklearn.metrics import r2_score
prediction = numpy.array([0.1567, 4.7528, 1.1260, 0.2294]).reshape(1, -1)
training = numpy.array([0, 3, 1, 0]).reshape(1, -1)
r2 = r2_score(training, prediction, multioutput="raw_values")
print r2
[ 0. 0. 0. 0.]
This is a single four-part value, not four separate values. How do I get proper R^2 scores?
If you are trying to calculate the r2 value between two vectors you should just pass two one dimensional arrays. See the documentation
In the example you provided, the first item is compared to the first item, but note you only have one list in each the prediction and training, so it is calculating R2 for 0.1567 to 0, which is 0, then it calculates it for 4.7528 to 3 which is also 0 and so on... It sounds like you want the R2 for the two vectors like the following:
prediction = numpy.array([0.1567, 4.7528, 1.1260, 0.2294])
training = numpy.array([0, 3, 1, 0])
print(r2_score(training, prediction))
0.472439485
If you have multi-dimensional arrays you can use the multioutput flag to determine what the output should look like:
#modified from the scikit-learn example
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
print(r2_score(y_true, y_pred, multioutput='raw_values'))
array([ 0.96543779, 0.90816327])
Here the output is where the first item of each list in y_true is compared to the first item in each list of y_pred, the second item to the second and so on
I just did an experiment. I provided only two training cases [0, 1] and [1, 0]. They belong to two different categories. The test cases is [0, 0], which is on the decision boundary. The classifier assigns it to class 0. Is it because class 0 is the first class? Does it really make sense?
>>> X=numpy.array([[0,1],[1,0]])
>>> y=numpy.array([0,1])
>>> clf.fit_transform(X,y)
array([[0, 1],
[1, 0]])
>>> clf.predict(numpy.array([[0,0]]))
array([0])
>>> clf.decision_function(numpy.array([[0,0]]))
array([ 0.])
>>> clf.coef_
array([[ 0.66666667, -0.66666667]])
>>> clf.predict(numpy.array([[0,1]]))
array([0])
>>> clf.decision_function(numpy.array([[0,1]]))
array([-0.66666667])
>>> clf.intercept_
array([ 0.])
>>> clf.intercept_ > 0
array([False], dtype=bool)
Personally, I would take your experiment as the answer to the question.
Points sitting on the decision boundary are ambiguous. What should the behavior be? Should it predict one of the two classes? Error out? Predict NaN?
By your experiment, scikit-learn predicts 0. I would take that to mean that in the general case, it chooses the first (in lexicographical order) for the boundary cases.
If the boundary case matters for your application, you will have to write special code that checks the decision function for exact 0, and does something different. Like this:
scores = clf.decision_function( X )
predictions = scores > 0
preidctions[ scores==0 ] = np.nan
How should I best use scikit-learn for the following supervised classification problem (simplified), with binary features:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
train_data = np.array([[0, 0, 1, 0],
[1, 0, 1, 1],
[0, 1, 1, 1]], dtype=bool)
train_targets = np.array([0, 1, 2])
c = DecisionTreeClassifier()
c.fit(train_data, train_targets)
p = c.predict(np.array([1, 1, 1, 1], dtype=bool))
print(p)
# -> [1]
That works fine. However, suppose now that I know a priori that the presence of feature 0 excludes class 1. Can additional information of this kind be easily included in the classification process?
Currently, I'm just doing some (problem-specific and heuristic) postprocessing to adjust the resulting class. I could perhaps also manually preprocess and split the dataset into two according to the feature, and train two classifiers separately (but with K such features, this ends up in 2^K splitting).
Can additional information of this kind be easily included in the classification process?
Domain-specific hacks are left to the user. The easiest way to do this is to predict probabilities...
>>> prob = c.predict_proba(X)
and then rig the probabilities to get the right class out.
>>> invalid = (prob[:, 1] == 1) & (X[:, 0] == 1)
>>> prob[invalid, 1] = -np.inf
>>> pred = c.classes_[np.argmax(prob, axis=1)]
That's -np.inf instead of 0 so the 1 label doesn't come up as a result of tie-breaking vs. other zero-probability classes.