sklearn dcg_score not working as expected - scikit-learn

This is my code:
from sklearn.metrics import dcg_score
true_relevance = np.asarray([[10]])
scores = np.asarray([[.1]])
dcg_score(true_relevance, scores)
The below code should produce 10 as the dcg_score. The formula from wikipedia gives 10/log2 = 10 But, instead I get ValueError: Only ('multilabel-indicator', 'continuous-multioutput', 'multiclass-multioutput') formats are supported. Got binary instead
Did anyone encounter this?

Since computing dcg on a single element is not meaningful, the sklearn library requires at least two y_true and y_score elements in the corresponding arrays.
You can check this by exploring the sklearn code (or through debugging): https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b611bf873bd5836748647221480071a87/sklearn/utils/multiclass.py#L158
Like:
true_relevance = np.asarray([[10, 5]])
scores = np.asarray([[.1, .2]])
dcg_score(true_relevance, scores)

Related

only one element tensors can be converted to Python scalars error with pca.fit_transform

I am trying to perform dimensionality reduction using PCA, where outputs is a list of tensors where each tensor has a shape of (1, 3, 32,32). Here is the code:
from sklearn.decomposition import PCA
pca = PCA(10)
pca_result = pca.fit_transform(output)
But I keep getting this error, regardless of whatever I tried:
ValueError: only one element tensors can be converted to Python scalars
I know that the tensors with size(1,3, 32,32) is making the issue, since its looking for 1 element as the error puts it, but do not know how to solve it.
I have tried flattening each tensor with looping over output (don't know if its the right way of solving this issue), using the following code but it leads to error in pca:
new_outputs = []
for i in outputs:
for j in i:
j = j.cpu()
j = j.detach().numpy()
j = j.flatten()
new_outputs.append(j)
pca_result = pca.fit_transform(new_output)
I would appreciate if anybody can help with this error whether the flattening approach I took, is correct.
PS:I have read the existing posts (post1,post2) discussing this error but none of them could solve my problem.
Assuming your Tensors are stored in a matrix with shape like (10, 3, 32, 32) where 10 corresponds to number of Tensors, you should flatten each like that:
import torch
from sklearn.decomposition import PCA
data= torch.rand((10, 3, 32, 32))
pca = PCA(10)
pca_result = pca.fit_transform(data.flatten(start_dim=1))
data.flatten(start_dim=1) makes your data to be in shape (10, 3*32*32)
The error you posted is actually related to one of the post you linked. The PCA estimator expects array-like object with fit() method and you provided a list of Tensors.

Sklearn TruncatedSVD not showing explained variance ration in descending order, or first number means something else?

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
digits = datasets.load_digits()
X = digits.data
X = X - X.mean() # centering the data
#### svd
svd = TruncatedSVD(n_components=5)
svd.fit(X)
print(svd.explained_variance_ration)
#### PCA
pca = PCA(n_components=5)
pca.fit(X)
print(pca.explained_variance_ratio_)
svd output is:
array([0.02049911, 0.1489056 , 0.13534811, 0.11738598, 0.08382797])
pca output is:
array([0.14890594, 0.13618771, 0.11794594, 0.08409979, 0.05782415])
is there a bug in the TruncatedSVD implementation? or why is the first explained variance (0.02...) behaving like this? or what is the meaning
Summary:
That is because TruncatedSVD and PCA use different SVD functions!.
Note: Your case is due to Reason 2 below, yet I included another reason for future readers.
Details:
Reason 1: The solver set by user in each algorithm, is different:
PCA internally uses scipy.linalg.svd which sorts singular values, hence the explained_variance_ratio_ is sorted.
Part of Scikit Implementation of PCA:
# Center data
U, S, Vt = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, Vt = svd_flip(U, Vt)
components_ = Vt
# Get variance explained by singular values
explained_variance_ = (S ** 2) / (n_samples - 1)
total_var = explained_variance_.sum()
explained_variance_ratio_ = explained_variance_ / total_var
Screenshot from the above-mentioned scipy.linalg.svd link:
On the other hand, TruncatedSVD uses scipy.sparse.linalg.svds which relies on the ARPACK solver for decomposition.
Screenshot from the above-mentioned scipy.sparse.linalg.svds link:
Reason 2: The TruncatedSVD operates differently compared to PCA:
In your case you chose randomized as a solver (which is set by default) in both algorithms, yet you obtained different results with regards to the order of the variance.
That is because in PCA, the variance is obtained from the actual singular values (called Sigma or S in Scikit-Learn implementation), which are already sorted:
On the other hand, the variance in TruncatedSVD is obtained from X_transformed which results from multiplying the data matrix by the components. The latter does not necessarily preserve order because data are not centered, nor is it the purpose of TruncatedSVD which it is used in first place for sparse matrices:
Now if you center your data, you will get them sorted (note that you did not center data properly, because centering requires dividing by standard deviation):
from sklearn import datasets
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits()
X = digits.data
sc = StandardScaler()
X = sc.fit_transform(X)
### SVD
svd = TruncatedSVD(n_components=5, algorithm='randomized', random_state=2021)
svd.fit(X)
print(svd.explained_variance_ratio_)
Output
[0.12033916 0.09561054 0.08444415 0.06498406 0.04860093]
Important: Further read.

Building vocabulary using document vector

I am not able to build vocabulary and getting an error:
TypeError: 'int' object is not iterable
Here is my code that is based on medium article:
https://towardsdatascience.com/implementing-multi-class-text-classification-with-doc2vec-df7c3812824d
I tried to provide pandas series, list to build_vocab function.
import pandas as pd
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split
import multiprocessing
import nltk
from nltk.corpus import stopwords
def tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if len(word) < 2:
continue
tokens.append(word.lower())
return tokens
df = pd.read_csv("https://raw.githubusercontent.com/RaRe-Technologies/movie-plots-by-genre/master/data/tagged_plots_movielens.csv")
tags_index = {
"sci-fi": 1,
"action": 2,
"comedy": 3,
"fantasy": 4,
"animation": 5,
"romance": 6,
}
df["tindex"] = df.tag.replace(tags_index)
df = df[["plot", "tindex"]]
mylist = list()
for i, q in df.iterrows():
mylist.append(
TaggedDocument(tokenize_text(str(q["plot"])), tags=q["tindex"])
)
df["tdoc"] = mylist
X = df[["tdoc"]]
y = df["tindex"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cores = multiprocessing.cpu_count()
model_doc2vec = Doc2Vec(
dm=1,
vector_size=300,
negative=5,
hs=0,
min_count=2,
sample=0,
workers=cores,
)
model_doc2vec.build_vocab([x for x in X_train["tdoc"]])
The documentation is very confusing for this method.
Doc2Vec needs an iterable sequence of TaggedDocument-like objects for its corpus (as is fed to build_vocab() or train()).
When showing an error, you should also show the full stack that accompanied it, so that it is clear what line-of-code, and surrounding call-frames, are involved.
But, it's unclear if what you've fed into the dataframe, then out via dataframe-bracket-access, then through the train_test_split(), is actually that.
So I'd suggest assigning things to descriptive interim variables, and verifying that they contain the right sorts of things at each step.
Is X_train["tdoc"][0] a proper TaggedDocument, with a words property that is a list-of-strings, and tags property a list-of-tags? (And, where each tag is probably a string, but could perhaps be a plain-int, counting upward from 0.)
Is mylist[0] a proper TaggedDocument?
Separately: many online examples of Doc2Vec use have egregious errors, and the Medium article you link is no exception. Its practice of calling train() multiple times in a loop is usually unneeded, and very error-prone, and in fact in that article results in severe learning-rate alpha mismanagement. (For example, deducting 0.002 from the starting-default alpha of 0.025 30 times results in a negative effective alpha, which is never justified and means the model is making itself worse with every example. This may be a factor contributing to the awful reported classifier accuracy.)
I would disregard that article entirely and seek better examples elsewhere.

k means cluster method score negative

guys. I am yet a beginner trying to learn ML so do forgive me for such a simple question. I had a dataset from UCI ML Repository. So, started applying all kinds of unsupervised algorithm in which i also applied K Means Cluster algorithm. When I printed out the accuracy score it was negative, not just once but many times. As far as I know scores aren't negative. So could you please help me as to why it's negative.
Any help is appreciated.
import pandas as pd
import numpy as np
a = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', names = ["a", "b", "c", "d","e","f","g","h","i"])
b = a
c = b.filter(a.columns[[8]], axis=1)
a.drop(a.columns[[8]], axis=1, inplace=True)
from sklearn.preprocessing import LabelEncoder
le1 = LabelEncoder()
le1.fit(a.a)
a.a = le1.transform(a.a)
from sklearn.preprocessing import OneHotEncoder
x = np.array(a)
y = np.array(c)
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit(x)
x = ohe.transform(x).toarray()
from sklearn.model_selection import train_test_split
xtr, xts, ytr, yts = train_test_split(x,y,test_size=0.2)
from sklearn import cluster
kmean = cluster.KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=10)
kmean.fit(xtr,ytr)
print(kmean.score(xts,yts))
Thank you!!
The k-means score is an indication of how far the points are from the centroids.
In scikit learn, the score is better the closer to zero it is.
Bad scores will return a large negative number, whereas good scores return close to zero. Generally, you will want to take the absolute value of the output from the scores method for better visualization.
Clustering is not classification.
Note that the 'y' argument of fit is ignored. Kmeans will always predict 0,1,...,k-1. So it will never make a correct label on this data set, because it doesn't even know what a label is supposed to look like. It really doesn't work to transfer what you did in classification to clustering. You need to relearn this from scratch. Different workflow, different evaluation.
It was explained in a book called "Hands-on Machine Learning with Scikit Learn Keras and TensorFlow" by Geron Aurelien.
On page 243 of the book (Chapter 9), it says that "The score() method returns the negative inertia. Why negative? Because a predictor’s score() method must always respect Scikit-Learn’s “greater is better” rule: if a predictor is better than another, its score() method should return a greater score."
Hope this helped!

Cosine Similarity score in scikit learn for two different vectorization technique is same

I am recently working on an assignment where the task is to use 20_newgroups dataset and use 3 different vectorization technique (Bag of words, TF, TFIDF) to represent documents in vector format and then trying to analyze the difference between average cosine similarity between each class in 20_Newsgroups data set. So here is what I am trying to do in python. I am reading data and passing it to sklearn.feature_extraction.text.CountVectorizer class's fit() and transform() function for Bag of Words technique and TfidfVectorizer for TFIDF technique.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity,cosine_distances
import numpy
import math
import csv
===============================================================================================================================================
categories = ['alt.atheism','comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.sys.mac.hardware', 'comp.windows.x','misc.forsale','rec.autos','rec.motorcycles','rec.sport.baseball','rec.sport.hockey',
'sci.crypt','sci.electronics','sci.med','sci.space','soc.religion.christian','talk.politics.guns',
'talk.politics.mideast','talk.politics.misc','talk.religion.misc']
twenty_newsgroup = fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'),shuffle=True, random_state=42)
dataset_groups = []
for group in range(0,20):
category = []
category.append(categories[group])
dataset_groups.append(fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'),shuffle=True,random_state=42,categories=category))
===============================================================================================================================================
bag_of_word_vect = CountVectorizer(stop_words='english',analyzer='word') #,min_df = 0.09
bag_of_word_vect = bag_of_word_vect.fit(twenty_newsgroup.data,twenty_newsgroup.target)
datamatrix_bow_groups = []
for group in dataset_groups:
datamatrix_bow_groups.append(bag_of_word_vect.transform(group.data))
similarity_matrix = []
for i in range(0,20):
means = []
for j in range(i,20):
result_of_group_ij = cosine_similarity(datamatrix_bow_groups[i], datamatrix_bow_groups[j])
means.append(numpy.mean(result_of_group_ij))
similarity_matrix.append(means)
===============================================================================================================================================
tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=False) #,sublinear_tf=True
tf_vectorizer = tf_vectorizer.fit(twenty_newsgroup.data)
datamatrix_tf_groups = []
for group in dataset_groups:
datamatrix_tf_groups.append(tf_vectorizer.transform(group.data))
similarity_matrix = []
for i in range(0,20):
means = []
for j in range(i,20):
result_of_group_ij = cosine_similarity(datamatrix_tf_groups[i], datamatrix_tf_groups[j])
means.append(numpy.mean(result_of_group_ij))
similarity_matrix.append(means)
Both should technically give different similarity_matrix but they are yeilding the same. More precisiosly tf_vectorizer should create similarity_matrix which have values more closed to 1.
The problem here is, Vector created by both technique for the same document of the same class for example (alt.atheism) is different and it should be. but when I calculating a similarity score between documents of one class and another class, Cosine similarity scorer giving me same value. If we understand theoretically then TFIDF is representing a document in a more finer sense in vector space so cosine value should be more near to 1 then what I get from BAG OF WORD technique right? But it is giving same similarity score. I tried by printing values of matrices created by BOW & TFIDF technique. It would a great help if somebody can give me a good reason to resolve this issue or strong argument in support what is happening?
I am new to this platform so please ignore any mistakes and let me know if you need more info.
Thanks & Regards,
Darshan Sonagara
The problem is this line in your code.
tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=False) #,sublinear_tf=True
You have set use_idf to False. This means the inverse document frequency is not calculated.So only the term frequency is calculated. Basicaly you are using the TfidfVectorizer like a CountVectorizer. Hence the output of both is the same: resulting in the same cosine distances.
using tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=True) Will result in a cosine similarity matrix for tfidf that is different from the countvectorizer.

Resources