different approach for document similarity(LDA, LSA, cosine) - text

I have set of short documents(1 or 2 paragraph each). I have used three different approaches for document similarity:
- simple cosine similarity on tfidf matrix
- applying LDA on the whole corpus and then using the LDA model to create the vector for each document then I applied cosine similarity.
-applying LSA on the whole corpus and then using the LSA model to create the vector for each document then I applied cosine similarity.
Based on experiments I am getting better result on simple cosine similarty on tfidf matrix without any LDA or LSA. Based on what I read LDA or LSA should improve the result, but in my case it is not!
Is there any idea why LDA or LSA have worse results?
both LDA and LSA when trained for more than 1000 rounds find similarity between some documents with probability higher than 90% which are totally unrelated!
Is there any justification for that?
Thanks

I have used LDA4j implementation and got better results than TFIDF, and similarly for LSI i have used semantic-vector implementation. If you have your own implementation share the model sketch. One more thing you should need to normalize the corpus for better results.

Related

calculating semantic similarity between sets of sentences

I have two sets of short messages, I want to compute the similarity between these two sets and identify if they are talking about the same sub-topic based on their semantic similarity. I know how to use pairwise similarity, my problem I want to compute the overall similarity among all the sentences in the two sets not for 2 sentences. Is there a way to use tf-idf or word2vec/doc2vec with cosine similarity to calculate the overall score?
Basically what I did is, take the vectors of each word in each sentence.
Then take the average of the two vectors and do cosine similarity.
Of course before you do that you need a trained word2vec model. doc2vec's similarity is doing the same thing, as internally it keeps a word2vec model.
So you have two options, train a doc2vec, and use its build in similarity, or train a word2vec and do the work by yourself.
Infersent helps in finding semantic similarity

Sentence similarity using word2vev

Basically what I want is to know how similar a specific sentence/document is to my training corpus.
I think I might have half an idea of how to approach this but I'm not too sure.
So my idea is to calculate an average vector for the document and then somehow calculating the similarity like that. I just don't know how I would calculate the similarity then.
So say I have a training corpus filled with text about dogs. If I then want to check how similar the sentence, "The airplane has 100 seats.", is to the training corpus I want is to output a low similarity score.
This is a semantic textual similarity problem. You can have a look at state-of-the-art models here https://nlpprogress.com/english/semantic_textual_similarity.html
Usually you would pass your document in an encoder to create a representation (embedding of the document) then do the same with the sentence (usually using the same encoder). The vectors could be fed into further layers for further processing. A similarity metric like cosine could then be used on the vectors (embeddings) or a joint final representation could be used for classification.
You can use some pretrained language models in the encoding step and fine tune them for your use-case.

Using cosine similarity for classifying documents

I have a set of files for five different categories and most of them are not labelled correctly.Objective is to predict the correct category of the file whenever the same is uploaded.I used cosine similarity along with tf -idf to predict the class of the document with which cosine similarity is the maximum as of now i am getting good results but really not sure how well this will work down the road. Also why isnt cosine similarity used in building document classifiers instead of machine learning models when the categories of files are labelled correctly?Would really appreciate your feedback on my approach as well as your answer to the question.
Cosine similarity is used for calculating the angle between two n-dimensional vectors. These vectors are mostly produced by Embeddings. They are pretrained models which produce word embeddings or fixed size vectors.
Cosine similarity is mostly used with vectors produced by word
embeddings. If you are using something like Doc2Vec, then you get a
vector for the whole document. These vectors could be categorized by
using cosine similarity.
In your case, you should try a LSTM text classifier using Embedding layers. 1D Convolution layers can also be useful.
Also, referring to TF-IDF, it is useful for text classification which is dependent on certain words in the corpus. The words with higher term frequency and less document frequency have a higher TF-IDF score. The model learns to classify texts based on such scores.
In most cases, RNNs are the best to classify texts. The use of pretrained embeddings makes the model efficient.
Also, not the least, you can give Bayes text classification a try. It has been super useful in spam classification.
Tip:
You can implement the above methods with each other, creating a text classification system. Following the process like,
Generate embeddings from Doc2Vec.
Comparing the similarity of the input with other texts and thereby determine its class.
Using the embedding in a LSTM network to produce class probabilities.
Apply Bayes text classification.
The steps 2 , 3 , 4 give three predictions. If the majority prediction was CLASS1, then we can make the output of the system as CLASS1!.

I want to classify some sentences on the basis of their semantic meaning.How can I use Doc2Vec in this? Or is there a better approach than this?

I want to implement doc2vec on various reviews which we extracted from a source.And I want to classify these reviews into different classes defined by the user. How can I do this?
I consider this as one of the interesting question. I will be giving you some approaches depending on size of observations/reviews.
You can apply LSA (SVD on DTM (either incidence or TF-IDF vectors) you will be getting three vectors as outputs -- USV. The V transpose is the sentence embedding).
Use this embeddings as input to your model for classification.
I recommend to use LSA when your corpus size is large.
Resources: link
In the similar way instead of using LSA, You can use pre trained embeddings say glove, here you will be getting word embeddings for creating document vectors use inverse weighted frequency method. Use this document vectors for classification.
Resources: link

Calculating the similarity between two vectors

I did LDA over a corpus of documents with topic_number=5. As a result, I have five vectors of words, each word associates with a weight or degree of importance, like this:
Topic_A = {(word_A1,weight_A1), (word_A2, weight_A2), ... ,(word_Ak, weight_Ak)}
Topic_B = {(word_B1,weight_B1), (word_B2, weight_B2), ... ,(word_Bk, weight_Bk)}
.
.
Topic_E = {(word_E1,weight_E1), (word_E2, weight_E2), ... ,(word_Ek, weight_Ek)}
Some of the words are common between documents. Now, I want to know, how I can calculate the similarity between these vectors. I can calculate cosine similarity (and other similarity measures) by programming from scratch, but I was thinking, there might be an easier way to do it. Any help would be appreciated. Thank you in advance for spending time on this.
I am programming with Python 3.6 and gensim library (but I am open to any other library)
I know someone else has asked similar question (Cosine Similarity and LDA topics) but becasue he didn't get the answer, I ask it again
After LDA you have topics characterized as distributions on words. If you plan to compare these probability vectors (weight vectors if you prefer), you can simply use any cosine similarity implemented for Python, sklearn for instance.
However, this approach will only tell you which topics have in general similar probabilities put in the same words.
If you want to measure similarities based on semantic information instead of word occurrences, you may want to use word vectors (as those learned by Word2Vec, GloVe or FastText).
They learned vectors for representing the words as low dimensional vectors, encoding certain semantic information. They're easy to use in Gensim, and the typical approach is loading a pre-trained model, learned in Wikipedia articles or News.
If you have topics defined by words, you can represent these words as vectors and obtain an average of the cosine similarities between the words in two topics (we did it for a workshop). There are some sources using these Word Vectors (also called Word Embeddings) to represent somehow topics or documents. For instance, this one.
There are some recent publications combining Topic Models and Word Embeddings, you can look for them if you're interested.

Resources