How Latent Semantic Analysis Handle Semantics - nlp

I have gone through LSA method. It is said that LSA can be used for semantic analysis. But I can not understand how it is working in LSA. Can anyone please tell me how LSA handle semantics.

Are you familiar with the vector space model (VSM)?
In LSA you can compute document similarity as well as type (i.e. word) similarity just as you would with the traditional VSM. That is, you compute the cosine between two type-vectors or two document-vectors (actually LSA allows you to compute also type-document similarity).
The problem with the VSM is that the cosine similarity of documents which do not share a single word equals to 0.
In LSA, the singular value decomposition (SVD) reveals latent semantic dimensions which allow
you to compute the cosine similarity between documents with no words in common, but with some common characteristics.

Related

calculating semantic similarity between sets of sentences

I have two sets of short messages, I want to compute the similarity between these two sets and identify if they are talking about the same sub-topic based on their semantic similarity. I know how to use pairwise similarity, my problem I want to compute the overall similarity among all the sentences in the two sets not for 2 sentences. Is there a way to use tf-idf or word2vec/doc2vec with cosine similarity to calculate the overall score?
Basically what I did is, take the vectors of each word in each sentence.
Then take the average of the two vectors and do cosine similarity.
Of course before you do that you need a trained word2vec model. doc2vec's similarity is doing the same thing, as internally it keeps a word2vec model.
So you have two options, train a doc2vec, and use its build in similarity, or train a word2vec and do the work by yourself.
Infersent helps in finding semantic similarity

What is the difference between word2vec, glove, and elmo? [duplicate]

What is the difference between word2vec and glove?
Are both the ways to train a word embedding? if yes then how can we use both?
Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.
Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.
GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.
Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)
I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)
Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.
You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.
Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.
The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.
The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.
The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.
If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:
As one might expect, ice co-occurs more frequently with solid than it
does with gas, whereas steam co-occurs more frequently with gas than
it does with solid.
Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.
However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.
In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.

Calculating the similarity between two vectors

I did LDA over a corpus of documents with topic_number=5. As a result, I have five vectors of words, each word associates with a weight or degree of importance, like this:
Topic_A = {(word_A1,weight_A1), (word_A2, weight_A2), ... ,(word_Ak, weight_Ak)}
Topic_B = {(word_B1,weight_B1), (word_B2, weight_B2), ... ,(word_Bk, weight_Bk)}
.
.
Topic_E = {(word_E1,weight_E1), (word_E2, weight_E2), ... ,(word_Ek, weight_Ek)}
Some of the words are common between documents. Now, I want to know, how I can calculate the similarity between these vectors. I can calculate cosine similarity (and other similarity measures) by programming from scratch, but I was thinking, there might be an easier way to do it. Any help would be appreciated. Thank you in advance for spending time on this.
I am programming with Python 3.6 and gensim library (but I am open to any other library)
I know someone else has asked similar question (Cosine Similarity and LDA topics) but becasue he didn't get the answer, I ask it again
After LDA you have topics characterized as distributions on words. If you plan to compare these probability vectors (weight vectors if you prefer), you can simply use any cosine similarity implemented for Python, sklearn for instance.
However, this approach will only tell you which topics have in general similar probabilities put in the same words.
If you want to measure similarities based on semantic information instead of word occurrences, you may want to use word vectors (as those learned by Word2Vec, GloVe or FastText).
They learned vectors for representing the words as low dimensional vectors, encoding certain semantic information. They're easy to use in Gensim, and the typical approach is loading a pre-trained model, learned in Wikipedia articles or News.
If you have topics defined by words, you can represent these words as vectors and obtain an average of the cosine similarities between the words in two topics (we did it for a workshop). There are some sources using these Word Vectors (also called Word Embeddings) to represent somehow topics or documents. For instance, this one.
There are some recent publications combining Topic Models and Word Embeddings, you can look for them if you're interested.

different approach for document similarity(LDA, LSA, cosine)

I have set of short documents(1 or 2 paragraph each). I have used three different approaches for document similarity:
- simple cosine similarity on tfidf matrix
- applying LDA on the whole corpus and then using the LDA model to create the vector for each document then I applied cosine similarity.
-applying LSA on the whole corpus and then using the LSA model to create the vector for each document then I applied cosine similarity.
Based on experiments I am getting better result on simple cosine similarty on tfidf matrix without any LDA or LSA. Based on what I read LDA or LSA should improve the result, but in my case it is not!
Is there any idea why LDA or LSA have worse results?
both LDA and LSA when trained for more than 1000 rounds find similarity between some documents with probability higher than 90% which are totally unrelated!
Is there any justification for that?
Thanks
I have used LDA4j implementation and got better results than TFIDF, and similarly for LSI i have used semantic-vector implementation. If you have your own implementation share the model sketch. One more thing you should need to normalize the corpus for better results.

Is TF-IDF necessary when using SVM?

I'm using Support Vector Machines to classify phrases. Before using the SVM, I understand I should do some kind of normalization on the phrase-vectors. One popular method is TF-IDF.
The terms with the highest TF-IDF score are often the terms that best characterize the topic of the document.
But isn't that exactly what SVM does anyway? Giving the highest weight to the terms that best characterize the document?
Thanks in advance :-)
The weight of a term (as assigned by an SVM classifier) may or may not be directly proportional to the relevance of that term to a particular class. This depends on the kernel of the classifier as well as the regularization used. SVM does NOT assign weights to terms that best characterize a single document.
Term-frequency (tf) and inverse document frequency (idf) are used to encode the value of a term in a document vector. This is independent of the SVM classifier.

Resources