I have two sets of short messages, I want to compute the similarity between these two sets and identify if they are talking about the same sub-topic based on their semantic similarity. I know how to use pairwise similarity, my problem I want to compute the overall similarity among all the sentences in the two sets not for 2 sentences. Is there a way to use tf-idf or word2vec/doc2vec with cosine similarity to calculate the overall score?
Basically what I did is, take the vectors of each word in each sentence.
Then take the average of the two vectors and do cosine similarity.
Of course before you do that you need a trained word2vec model. doc2vec's similarity is doing the same thing, as internally it keeps a word2vec model.
So you have two options, train a doc2vec, and use its build in similarity, or train a word2vec and do the work by yourself.
Infersent helps in finding semantic similarity
Related
Basically what I want is to know how similar a specific sentence/document is to my training corpus.
I think I might have half an idea of how to approach this but I'm not too sure.
So my idea is to calculate an average vector for the document and then somehow calculating the similarity like that. I just don't know how I would calculate the similarity then.
So say I have a training corpus filled with text about dogs. If I then want to check how similar the sentence, "The airplane has 100 seats.", is to the training corpus I want is to output a low similarity score.
This is a semantic textual similarity problem. You can have a look at state-of-the-art models here https://nlpprogress.com/english/semantic_textual_similarity.html
Usually you would pass your document in an encoder to create a representation (embedding of the document) then do the same with the sentence (usually using the same encoder). The vectors could be fed into further layers for further processing. A similarity metric like cosine could then be used on the vectors (embeddings) or a joint final representation could be used for classification.
You can use some pretrained language models in the encoding step and fine tune them for your use-case.
I have a set of files for five different categories and most of them are not labelled correctly.Objective is to predict the correct category of the file whenever the same is uploaded.I used cosine similarity along with tf -idf to predict the class of the document with which cosine similarity is the maximum as of now i am getting good results but really not sure how well this will work down the road. Also why isnt cosine similarity used in building document classifiers instead of machine learning models when the categories of files are labelled correctly?Would really appreciate your feedback on my approach as well as your answer to the question.
Cosine similarity is used for calculating the angle between two n-dimensional vectors. These vectors are mostly produced by Embeddings. They are pretrained models which produce word embeddings or fixed size vectors.
Cosine similarity is mostly used with vectors produced by word
embeddings. If you are using something like Doc2Vec, then you get a
vector for the whole document. These vectors could be categorized by
using cosine similarity.
In your case, you should try a LSTM text classifier using Embedding layers. 1D Convolution layers can also be useful.
Also, referring to TF-IDF, it is useful for text classification which is dependent on certain words in the corpus. The words with higher term frequency and less document frequency have a higher TF-IDF score. The model learns to classify texts based on such scores.
In most cases, RNNs are the best to classify texts. The use of pretrained embeddings makes the model efficient.
Also, not the least, you can give Bayes text classification a try. It has been super useful in spam classification.
Tip:
You can implement the above methods with each other, creating a text classification system. Following the process like,
Generate embeddings from Doc2Vec.
Comparing the similarity of the input with other texts and thereby determine its class.
Using the embedding in a LSTM network to produce class probabilities.
Apply Bayes text classification.
The steps 2 , 3 , 4 give three predictions. If the majority prediction was CLASS1, then we can make the output of the system as CLASS1!.
I am trying to build a Fake news classifier and I am quite new in this field. I have a column "title_1_en" which has the title for fake news and another column called "title_2_en". There are 3 target labels; "agreed", "disagreed", and "unrelated" if the title of the news in column "title_2_en" agrees, disagrees or is unrelated to that in the first column.
I have tried calculating basic cosine similarity between the two titles after converting the words of the sentences into vectors. This has resulted in the the cosine similarity score but this needs a lot of improvement as synonyms and semantic relationship has not been considered at all.
def L2(vector):
norm_value = np.linalg.norm(vector)
return norm_value
def Cosine(fr1, fr2):
cos = np.dot(fr1, fr2)/(L2(fr1)*L2(fr2))
return cos
The most important thing here is how you convert the two sentences into vectors. There are multiple ways to do that and the most naive way is:
Convert each and every word into a vector - this can be done using standard pre-trained vectors such as word2vec or GloVe.
Now every sentence is just a bag of word vectors. This needs to be converted into a single vector, ie., mapping a full sentence text to a vector. There are many ways to do this too. For a start, just take the average of the bag of vectors in the sentence.
Compute cosine similarity between the two sentence vectors.
Spacy's similarity is a good place to start which does the averaging technique. From the docs:
By default, spaCy uses an average-of-vectors algorithm, using
pre-trained vectors if available (e.g. the en_core_web_lg model). If
not, the doc.tensor attribute is used, which is produced by the
tagger, parser and entity recognizer. This is how the en_core_web_sm
model provides similarities. Usually the .tensor-based similarities
will be more structural, while the word vector similarities will be
more topical. You can also customize the .similarity() method, to
provide your own similarity function, which can be trained using
supervised techniques.
I did LDA over a corpus of documents with topic_number=5. As a result, I have five vectors of words, each word associates with a weight or degree of importance, like this:
Topic_A = {(word_A1,weight_A1), (word_A2, weight_A2), ... ,(word_Ak, weight_Ak)}
Topic_B = {(word_B1,weight_B1), (word_B2, weight_B2), ... ,(word_Bk, weight_Bk)}
.
.
Topic_E = {(word_E1,weight_E1), (word_E2, weight_E2), ... ,(word_Ek, weight_Ek)}
Some of the words are common between documents. Now, I want to know, how I can calculate the similarity between these vectors. I can calculate cosine similarity (and other similarity measures) by programming from scratch, but I was thinking, there might be an easier way to do it. Any help would be appreciated. Thank you in advance for spending time on this.
I am programming with Python 3.6 and gensim library (but I am open to any other library)
I know someone else has asked similar question (Cosine Similarity and LDA topics) but becasue he didn't get the answer, I ask it again
After LDA you have topics characterized as distributions on words. If you plan to compare these probability vectors (weight vectors if you prefer), you can simply use any cosine similarity implemented for Python, sklearn for instance.
However, this approach will only tell you which topics have in general similar probabilities put in the same words.
If you want to measure similarities based on semantic information instead of word occurrences, you may want to use word vectors (as those learned by Word2Vec, GloVe or FastText).
They learned vectors for representing the words as low dimensional vectors, encoding certain semantic information. They're easy to use in Gensim, and the typical approach is loading a pre-trained model, learned in Wikipedia articles or News.
If you have topics defined by words, you can represent these words as vectors and obtain an average of the cosine similarities between the words in two topics (we did it for a workshop). There are some sources using these Word Vectors (also called Word Embeddings) to represent somehow topics or documents. For instance, this one.
There are some recent publications combining Topic Models and Word Embeddings, you can look for them if you're interested.
I have set of short documents(1 or 2 paragraph each). I have used three different approaches for document similarity:
- simple cosine similarity on tfidf matrix
- applying LDA on the whole corpus and then using the LDA model to create the vector for each document then I applied cosine similarity.
-applying LSA on the whole corpus and then using the LSA model to create the vector for each document then I applied cosine similarity.
Based on experiments I am getting better result on simple cosine similarty on tfidf matrix without any LDA or LSA. Based on what I read LDA or LSA should improve the result, but in my case it is not!
Is there any idea why LDA or LSA have worse results?
both LDA and LSA when trained for more than 1000 rounds find similarity between some documents with probability higher than 90% which are totally unrelated!
Is there any justification for that?
Thanks
I have used LDA4j implementation and got better results than TFIDF, and similarly for LSI i have used semantic-vector implementation. If you have your own implementation share the model sketch. One more thing you should need to normalize the corpus for better results.