I'm using Support Vector Machines to classify phrases. Before using the SVM, I understand I should do some kind of normalization on the phrase-vectors. One popular method is TF-IDF.
The terms with the highest TF-IDF score are often the terms that best characterize the topic of the document.
But isn't that exactly what SVM does anyway? Giving the highest weight to the terms that best characterize the document?
Thanks in advance :-)
The weight of a term (as assigned by an SVM classifier) may or may not be directly proportional to the relevance of that term to a particular class. This depends on the kernel of the classifier as well as the regularization used. SVM does NOT assign weights to terms that best characterize a single document.
Term-frequency (tf) and inverse document frequency (idf) are used to encode the value of a term in a document vector. This is independent of the SVM classifier.
Related
TL;DR How can the Pearson correlation coefficient between ground truth labels and cosine similarity scores evaluate the performance of a sentence embedding model? A positive/negative linear relationship between the two doesn't necessarily indicate that a model is accurate, just that they move together, which to me doesn't seem like a good way to evaluate the performance of a sentence embedding model.
I'm training a model to be able to tell if two questions are similar or not. I first continue pre-training using MLM (masked language modeling) and finally fine-tune on the STS dataset. For fine-tuning, I'm using this example python file https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py. At the end of the file, it says to "load the stored model and evaluate its performance on STS benchmark dataset", and it uses this file to evaluate the performance of the model https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py.
The second file has a few metrics for evaluation (cosine similarity being one of them), and it uses the Pearson correlation coefficient and Spearman correlation coefficient for each metric to evaluate the performance of the model. What I'm not understanding is: how does calculating the relationship (correlation coefficient) between the ground truth labels and the cosine similarity contribute to measuring the performance of the model? Even if the two have similar movement patterns i.e. a high correlation coefficient, that doesn't mean the model is performing well, does it?
Basically what I want is to know how similar a specific sentence/document is to my training corpus.
I think I might have half an idea of how to approach this but I'm not too sure.
So my idea is to calculate an average vector for the document and then somehow calculating the similarity like that. I just don't know how I would calculate the similarity then.
So say I have a training corpus filled with text about dogs. If I then want to check how similar the sentence, "The airplane has 100 seats.", is to the training corpus I want is to output a low similarity score.
This is a semantic textual similarity problem. You can have a look at state-of-the-art models here https://nlpprogress.com/english/semantic_textual_similarity.html
Usually you would pass your document in an encoder to create a representation (embedding of the document) then do the same with the sentence (usually using the same encoder). The vectors could be fed into further layers for further processing. A similarity metric like cosine could then be used on the vectors (embeddings) or a joint final representation could be used for classification.
You can use some pretrained language models in the encoding step and fine tune them for your use-case.
What is the difference between word2vec and glove?
Are both the ways to train a word embedding? if yes then how can we use both?
Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.
Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.
GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.
Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)
I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)
Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.
You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.
Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.
The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.
The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.
The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.
If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:
As one might expect, ice co-occurs more frequently with solid than it
does with gas, whereas steam co-occurs more frequently with gas than
it does with solid.
Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.
However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.
In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.
I am looking at various semantic similarity methods such as word2vec, word mover distance (WMD), and fastText. fastText is not better than Word2Vec as for as semantic similarity is concerned. WMD and Word2Vec have almost similar results.
I was wondering if there is an alternative which has outperformed the Word2Vec model for semantic accuracy?
My use case:
Finding word embeddings for two sentences, and then use cosine similarity to find their similarity.
Whether any technique "outperforms" another will depend highly on your training data, the specific metaparameter options you choose, and your exact end-task. (Even "semantic similarity" may have many alternate aspects depending on the application.)
There's no one way to go from word2vec word-vectors to a sentence/paragraph vector. You could add the raw vectors. You could average the unit-normalized vectors. You could perform some other sort of weighted-average, based on other measures of word-significance. So your implied baseline is unclear.
Essentially you have to try a variety of methods and parameters, for your data and goal, with your custom evaluation.
Word Mover's Distance doesn't reduce each text to a single vector, and the pairwise calculation between two texts can be expensive, but it has reported very good performance on some semantic-similarity tasks.
FastText is essentially word2vec with some extra enhancements and new modes. Some modes with the extras turned off are exactly the same as word2vec, so using FastText word-vectors in some wordvecs-to-textvecs scheme should closely approximate using word2vec word-vectors in the same scheme. Some modes might help the word-vector quality for some purposes, but make the word-vectors less effective inside a wordvecs-to-textvecs scheme. Some modes might make the word-vector better for sum/average composition schemes – you should look especially at the 'classifier' mode, which trains word-vecs to be good, when averaged, at a classification task. (To the extent you may have any semantic labels for your data, this might make the word-vecs more composable for semantic-similarity tasks.)
You may also want to look at the 'Paragraph Vectors' technique (available in gensim as Doc2Vec), or other research results that go by the shorthand names 'fastSent' or 'sent2vec'.
One can measure goodness of fit of a statistical model using Akaike Information Criterion (AIC), which accounts for goodness of fit and for the number of parameters that were used for model creation. AIC involves calculation of maximized value of likelihood function for that model (L).
How can one compute L, given prediction results of a classification model, represented as a confusion matrix?
It is not possible to calculate the AIC from a confusion matrix since it doesn't contain any information about the likelihood. Depending on the model you are using it may be possible to calculate the likelihood or quasi-likelihood and hence the AIC or QIC.
What is the classification problem that you are working on, and what is your model?
In a classification context often other measures are used to do GoF testing. I'd recommend reading through The Elements of Statistical Learning by Hastie, Tibshirani and Friedman to get a good overview of this kind of methodology.
Hope this helps.
Information-Based Evaluation Criterion for Classifier's Performance by Kononenko and Bratko is exactly what I was looking for:
Classification accuracy is usually used as a measure of classification performance. This measure is, however, known to have several defects. A fair evaluation criterion should exclude the influence of the class probabilities which may enable a completely uninformed classifier to trivially achieve high classification accuracy. In this paper a method for evaluating the information score of a classifier''s answers is proposed. It excludes the influence of prior probabilities, deals with various types of imperfect or probabilistic answers and can be used also for comparing the performance in different domains.