Methods of calculating text string similarity? - nlp

Let’s say I have an array of strings and I need to sort them into clusters. I am currently doing the analysis using n-grams, e.g.:
Cluster 1:
Pipe fixing
Pipe fixing in Las Vegas
Movies about Pipe fixing
Cluster 2:
Classical music
Why classical music is great
What is classical music
etc.
Let’s say within this array I have these two strings of text (among others):
Japanese students
Students from Japan
Now, the N-gram method will obviously not put these two strings together, as they do not share the same tokenized structure. I tried using Damerau-Levenshtein distance calculation and TF/IDF, but both grab too much outer noise. Which other techniques can I use to understand that these two strings belong within a single cluster?

You can use the simple bag-of-words representation of the phrases, taking both unigrams and bigrams(possibly after stemming) and putting them into a feature vector, then measuring the similarity between vectors using, e.g., the cosine. See here or here. This is meant to work for longer documents, but it might work well enough for your purposes.
A more sophisticated technique is to train a distributed bag-of-words model from a corpus of documents, and then use it to find similarities in pairs of documents.
[edit]
You can use distributed BoW models also using word2vec. For example, in Python with the gensim library and a pre-trained word2vec Google News model:
from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
print model.n_similarity(['students', 'Japan'], ['Japanese', 'students'])
Output:
0.8718219720170907

You have a normalization problem. String equivalence drives your matching algorithms and "Japan" and "Japanese" are not string equivalent. Several options:
1) Normalize tokens into a root form so "Japanese" is normalized to "Japan" or something like that. Normalization comes with some problems like you don't want "Jobs" to be normalized to "Job" when talking about "Steve Jobs." Porter stemmer, other morphological tools etc can help with this.
2) Use character n-grams for your string equivalence. If you did 3-5 grams there would be instances of "Japan" for both sentences to cluster around. I am a big fan of this for classification, less sure for clustering.
3) Use latent techniques to help cluster like Latent Dirichelet Allocation. Roughly speaking you associate "Japan" with "Japanese" via other string equivalent words strongly associated with those words like "Tokyo" or the like.
Breck

Related

How to convert small dataset into word embeddings instead of one-hot encoding?

I have a dataset of 33 words that are a mix of verbs and nouns, for eg. father, sing, etc. I have tried converting them to 1-hot encoding but for my use case, it has been suggested to look into word2vec embedding. I have looked in gensim and glove but struggling to make it work.
How could I convert my data into an embedding? Such that two words that may be semantically closer may have a lesser distance between their respective vectors. How may this be achieved or any helpful material on the same?
Such as this
Since your dataset is quite small, and I'm assuming it doesn't contain any jargon, it's best to use a pre-trained model in order to save up on training time.
With gensim, it's as simple as:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
The 'word2vec-google-news-300' model has been pre-trained on a part of the Google News Dataset and generalizes well enough to most tasks. Following this, you can create word embeddings/vectors like so:
vec = wv['father']
And, finally, for computing word similarity:
similarity_score = wv.similarity('father', 'sing')
Lastly, one major limitation of Word2Vec is it's inability to deal with words that are OOV(out of vocabulary). For such cases, it's best to train a custom model for your corpus.

Text representations : How to differentiate between strings of similar topic but opposite polarities?

I have been doing clustering of a certain corpus, and obtaining results that group sentences together by obtaining their tf-idf, checking similarity weights > a certain threshold value from the gensim model.
tfidf_dic = DocSim.get_tf_idf()
ds = DocSim(model,stopwords=stopwords, tfidf_dict=tfidf_dic)
sim_scores = ds.calculate_similarity(source_doc, target_docs)
The problem is that despite putting high threshold values, sentences of similar topics but opposite polarities get clustered together as such:
Here is an example of the similarity weights obtained between "don't like it" & "i like it"
Are there any other methods, libraries or alternative models that can differentiate the polarities effectively by assigning them very low similarities or opposite vectors?
This is so that the outputs "i like it" and "dont like it" are in separate clusters.
PS: Pardon me if there are any conceptual errors as I am rather new to NLP. Thank you in advance!
The problem is in how you represent your documents. Tf-idf is good for representing long documents where keywords play a more important role. Here, it is probably the idf part of tf-idf that disregards the polarity because negative particles like "no" or "not" will appear in most documents and they will always receive a low weight.
I would recommend trying some neural embeddings that might capture the polarity. If you want to keep using Gensim, you can try doc2vec but you would need quite a lot of training data for that. If you don't have much data to estimate the representation, I would use some pre-trained embeddings.
Even averaging word embeddings (you can load FastText embeddings in Gensim). Alternatively, if you want a stronger model, you can try BERT or another large pre-trained model from the Transformers package.
Unfortunately, simple text representations based merely on the sets-of-words don't distinguish such grammar-driven reversals-of-meaning very well.
The method needs to be sensitive to meaningful phrases, and the hierarchical, grammar-driven inter-word dependencies, to model that.
Deeper neural networks using convolutional/recurrent techniques do better, or methods which tree-model sentence-structure.
For ideas see for example...
"Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank"
...or a more recent summary presentation...
"Representations for Language: From Word Embeddings to Sentence Meanings"

Cluster similar words using word2vec

I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:
vegan
vegetarian
pizza
burger
transportation
coffee
Bookstores
Oil and Lube
I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.
I tried using word2vec but, some of them have more than one word and could not figure out a right way.
Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.
Word2Vec could help with this, but key factors to consider are:
How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.
The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)
Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?
OK, thank you for the details.
In order to train on word2vec you should take into account the following facts :
You need a huge and variate text dataset. Review your training set and make sure it contains the useful data you need in order to obtain what you want.
Set one sentence/phrase per line.
For preprocessing, you need to delete punctuation and set all strings to lower case.
Do NOT lemmatize or stemmatize, because the text will be less complex!
Try different settings:
5.1 Algorithm: I used word2vec and I can say BagOfWords (BOW) provided better results, on different training sets, than SkipGram.
5.2 Number of layers: 200 layers provide good result
5.3 Vector size: Vector length = 300 is OK.
Now run the training algorithm. The, use the obtained model in order to perform different tasks. For example, in your case, for synonymy, you can compare two words (i.e. vectors) with cosine (or similarity). From my experience, cosine provides a satisfactory result: the distance between two words is given by a double between 0 and 1. Synonyms have high cosine values, you must find the limit between words which are synonyms and others that are not.

Semantic Similarity across multiple languages

I am using word embeddings for finding similarity between two sentences. Using word2vec, I also get a similarity measure if one sentence is in English and the other one in Dutch (though not very good).
So I started wondering if it's possible to compute the similarity between two sentences in two different languages (without an explicit translation), especially if the languages have some similarities (Englis/Dutch)?
Let's assume that your sentence-similarity scheme uses only word-vectors as an input – as in simple word-vector averaging schemes, or Word Mover's Distance.
It should be possible to do what you've suggested, provided that:
you have good sets of word-vectors for each language's words
the coordinate spaces of the word-vectors are compatible, meaning the words for the exact-same things in both languages have nearly-identical coordinates (and other words with similar meanings have close coordinates)
That second quality is not automatically assured. In fact, given the random initialization of word2vec models, and other randomization introduced by the algorithm/implementation, even subsequent training runs on the exact same data won't place words into the exact same places. So word-vectors trained on totally-separate English/Dutch corpuses won't likely place equivalent words at the same coordinates.
But, you can learn an algebraic-transformation between two spaces, based on certain anchor/reference word-pairs (that you know should have similar vectors). You can then apply that transformation to all words in one of the two sets, which results in you having vectors for those 'foreign' words within the comparable coordinate-space of the 'canonical' word-set.
In fact this very idea was used in one of the first word2vec papers:
"Exploiting Similarities among Languages for Machine Translation"
If you were to apply a similar transformation on one of your language word-vector sets, then use those transformed vectors as inputs to your sentence-vector scheme, those sentence-vectors would likely have some useful comparability to sentence-vectors in the other language, bootstrapped from word-vectors in the same coordinate-space.
Update: There's a very interesting recent paper that manages to train word-vectors in multiple languages simultaneously, using a corpus that includes both raw sentences in each single language, and a (smaller) set of aligned-sentences that are known to mean the same in both languages. Gensim doesn't yet support this mode, but there's discussion of supporting it in a future refactor.
I've recently produced a Python implementation of the technique mentioned in the paper from #gojomo's answer: transvec.
You'll need to provide word translation pairs as training data (I just threw words from my corpus into Google Translate to get as many such pairs as I can) and then you can use a wrapper model from transvec to produce comparable word embeddings for multiple languages. Here's an example:
import gensim.downloader
from transvec.transformers import TranslationWordVectorizer
# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")
# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]
bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)
# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]
Don't worry about the weird POS tags on the Russian words - this is just a quirk of the particular pre-trained model I used.
For the case of documents rather than words, things are a little trickier because Doc2Vec can't use pre-trained Word2Vec models as a starting point. However, you can get an approximate document vector by simply taking the mean of all the word vectors from that document. If you provide a 2d array to TranslationWordVectorizer's transform method, it will do exactly this and provide you with an approximate document vector so you can find documents with similar meaning even if the languages are different.

word2vec lemmatization of corpus before training

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.
I think it really matters about what you want to solve with this. It depends on the task.
Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.
But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.
Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.
The current project I am working on involves identifying gene names within Biology papers abstracts using the vector space created by Word2Vec. When we run the algorithm without lemmatizing the Corpus mainly 2 problems arise:
The vocabulary gets way too big, since you have words in different forms which in the end have the same meaning.
As noted above, your space get less sparse, since you get more representatives of a certain "meaning", but at the same time, some of these meanings might get split among its representatives, let me clarify with an example
We are currently interest in a gene recognized by the acronym BAD. At the same time, "bad" is a english word which has different forms (badly, worst, ...). Since Word2vec build its vectors based on the context (its surrounding words) probability, when you don't lemmatize some of these forms, you might end up losing the relationship between some of these words. This way, in the BAD case, you might end up with a word closer to gene names instead of adjectives in the vector space.

Resources