Paraphrase recognition using sentence level similarity - nlp

I'm a new entrant to NLP (Natural Language Processing). As a start up project, I'm developing a paraphrase recognizer (a system which can recognize two similar sentences). For that recognizer I'm going to apply various measures at three levels, namely: lexical, syntactic and semantic. At the lexical level, there are multiple similarity measures like cosine similarity, matching coefficient, Jaccard coefficient, et cetera. For these measures I'm using the simMetrics package developed by the University of Sheffield which contains a lot of similarity measures. But for the Levenshtein distance and Jaro-Winkler distance measures, the code is only at character level, whereas I require code at the sentence level (i.e. considering a single word as a unit instead of character-wise). Additionally, there is no code for computing the Manhattan distance in SimMetrics. Are there any suggestions for how I could develop the required code (or someone provide me the code) at the sentence level for the above mentioned measures?
Thanks a lot in advance for your time and effort helping me.

I have been working in the area of NLP for a few years now, and I completely agree with those who have provided answers/comments. This really is a hard nut to crack! But, let me still provide a few pointers:
(1) Lexical similarity: Instead of trying to generalize Jaro-Winkler distance to sentence-level, it is probably much more fruitful if you develop a character-level or word-level language model, and compute the log-likelihood. Let me explain further: train your language model based on a corpus. Then take a whole lot of candidate sentences that have been annotated as similar/dissimilar to the sentences in the corpus. Compute the log-likelihood for each of these test sentences, and establish a cut-off value to determine similarity.
(2) Syntactic similarity: So far, only stylometric similarities can manage to capture this. For this, you will need to use PCFG parse trees (or TAG parse trees. TAG = tree adjoining grammar, a generalization of CFGs).
(3) Semantic similarity: off the top of my head, I can only think of using resources such as Wordnet, and identifying the similarity between synsets. But this is not simple either. Your first problem will be to identify which words from the two (or more) sentences are "corresponding words", before you can proceed to check their semantics.

As Chris suggests, this is a non-trivial project for a beginner. I would suggest you start of something simpler (if relatively boring) such as chunking.
Have a look at the docs and books for the Python NLTK library - there are some samples that are close to what you are looking for. For example, containment: is it plausible that one statement contains another. note the 'plausible' there, the state of the art isn't good enough for a simple yes/no or even a probability.

Related

Domain-specific word similarity

Does anyone know how of an accurate tool or method that can be used to compute word embeddings or find similarity among domain-specific words? I'm working on an NLP project that involves computing cosine similarity between technical terms, such as "address" and "socket", but pre-trained models like word2vec aren't giving useful embeddings or accurate cosine similarities because they aren't specific to technical terms. Since the more general-nontechnical meanings of "address" and "socket" aren't similar to one another, these pretrained models aren't giving them sufficiently high similarity scores for the purposes of my project. Would appreciate any advice people would be able to offer. Thank you!
With sufficient data from your specific domain, you can train your own word2vec model - whose resulting word-vectors, being only influenced by your domain data, will be far more reflective of the in-domain meanings.
Similarly, if you have a mixture of data where you have hints that some word uses are for different senses of a polysemous word, you could try preprocessing your text, using those hints, replacing the ambiguous tokens (like say 'address') with a larger number of distinct tokens (like 'address*networking', 'address*delivery', etc). Even with a lot of error in such a process, its results might be sufficient for a specific purpose.
For example, maybe you'd assume all docs of a certain type – like articles from a particular publication – always mean 'address*networking' when they write 'address'. That crude replacement, on just some subset of docs sufficient to collect enough varied examples of 'address*networking' usage, might leave you with a good-enough word-vector for 'address*networking'.
(More generally, deciding which word sense of multiple candidates is meant by a particular word is called "word sense disambiguation", and it might be possible to use other preexisting code for performing that to help preprocess texts - replacing ambiguous tokens with more-speciific stand-ins – before performing word2vec training.)
Even without such assistive pre-processing, there've been a number of research attempts to extend word2vec to better model words with multiple contrasting meanings. Googling for [word2vec polysemy] or [polysemous embeddings] should turn up a bunch of examples.
But I don't know any of those techniques that have become widely-used, or that are explicitly supported by major word2vec libraries, so I can't specifically recommend or show working code for any. I don't know a standard best-practice or off-the-shelf solution – you'd have to treat adopting those ideas from research papers as an R&D project, performing a lot of your own implementation/evaluation to see if any help with your goals.

Cluster similar words using word2vec

I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:
vegan
vegetarian
pizza
burger
transportation
coffee
Bookstores
Oil and Lube
I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.
I tried using word2vec but, some of them have more than one word and could not figure out a right way.
Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.
Word2Vec could help with this, but key factors to consider are:
How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.
The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)
Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?
OK, thank you for the details.
In order to train on word2vec you should take into account the following facts :
You need a huge and variate text dataset. Review your training set and make sure it contains the useful data you need in order to obtain what you want.
Set one sentence/phrase per line.
For preprocessing, you need to delete punctuation and set all strings to lower case.
Do NOT lemmatize or stemmatize, because the text will be less complex!
Try different settings:
5.1 Algorithm: I used word2vec and I can say BagOfWords (BOW) provided better results, on different training sets, than SkipGram.
5.2 Number of layers: 200 layers provide good result
5.3 Vector size: Vector length = 300 is OK.
Now run the training algorithm. The, use the obtained model in order to perform different tasks. For example, in your case, for synonymy, you can compare two words (i.e. vectors) with cosine (or similarity). From my experience, cosine provides a satisfactory result: the distance between two words is given by a double between 0 and 1. Synonyms have high cosine values, you must find the limit between words which are synonyms and others that are not.

Unsupervised sentiment Analysis using doc2vec

Folks,
I have searched Google for different type of papers/blogs/tutorials etc but haven't found anything helpful. I would appreciate if anyone can help me. Please note that I am not asking for code step-by-step but rather an idea/blog/paper or some tutorial.
Here's my problem statement:
Just like sentiment analysis is used for identifying positive and
negative tone of a sentence, I want to find whether a sentence is
forward-looking (future outlook) statement or not.
I do not want to use bag of words approach to sum up the number of forward-looking words/phrases such as "going forward", "in near future" or "In 5 years from now" etc. I am not sure if word2vec or doc2vec can be used. Please enlighten me.
Thanks.
It seems what you are interested in doing is finding temporal statements in texts.
Not sure of your final output, but let's assume you want to find temporal phrases or sentences which contain them.
One methodology could be the following:
Create list of temporal terms [days, years, months, now, later]
Pick only sentences with key terms
Use sentences in doc2vec model
Infer vector and use distance metric for new sentence
GMM Cluster + Limit
Distance from average
Another methodology could be:
Create list of temporal terms [days, years, months, now, later]
Do Bigram and Trigram collocation extraction
Keep relevant collocations with temporal terms
Use relevant collocations in a kind of bag-of-collocations approach
Matched binary feature vectors for relevant collocations
Train classifier to recognise higher level text
This sounds like a good case for a Bootstrapping approach if you have large amounts of texts.
Both are semi-supervised really, since there is some need for finding initial temporal terms, but even that could be automated using a word2vec scheme and bootstrapping

Calculating grammar similarity between two sentences

I'm making a program which provides some english sentences which user has to learn more.
For example:
First, I provide a sentence "I have to go school today" to user.
Then if the user wants to learn more sentences like that, I find some sentences which have high grammar similarity with that sentence.
I think the only way for providing sentences is to calculate similarity.
Is there a way to calculate grammar similarity between two sentences?
or is there a better way to make that algorithm?
Any advice or suggestions would be appreciated. Thank you.
My approach for solving this problem would be to do a Part Of Speech Tagging of using a tool like NLTK and compare the trees structure of your phrase with your database.
Other way, if you already have a training dataset, use the WEKA to use a machine learn approach to connect the phrases.
You can parse your sentence as either a constituent or dependency tree and use these representations to formulate some form of query that you can use to find candidate sentences with similar structures.
You can check this available tool from Stanford NLP:
Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions"). Tregex comes with Tsurgeon, a tree transformation language. Also included from version 2.0 on is a similar package which operates on dependency graphs (class SemanticGraph, called semgrex.

Article "Matching" Algorithm

I've rather specific question, at least it is so for me. Specific because after doing quite a lot searching I couldn't find anything useful. So as the title says, I am looking for an algorithm, that finds if two articles given in input "match", but not in the sense of usual string matching, instead, what I want to find is, if they talk for the same argument. Now what I predict, the "match" should be compared against some threshold, and using some kind of weights to determine how much do they "match", therefore the concept is fuzzy, so we can't talk about a complete "match", but we will talk about degree of "match".
Sadly, I don't have anything more. I would be really grateful if someone of you helps me in the topic, also theoretical ideas are welcome.
Thanks you.
There are many ways to find 'similarity' of articles, and it really depends on what you know on the articles, and what you use as your test case to show how good your results are.
One simple solution is using Jaccard Similarity on the vocabulary used by these documents. Pseudo code:
similarity(doc1,doc2):
set1 <- getWords(doc1)
set2 <- getWords(doc2)
intersection <- set_intersection(set1,set2)
union <- set_union(set1,set2)
return size(intersection)/size(union)
Note that instead of getWords you can use also bigrams,trigrams,...n-grams.
More complex unsupervised solution could be building a language model from each document, and calculate their Jensen-Shannon divergence to judge if they are similar or not, based on the language models.
A simple language model is P(word|document) = #occurances(word,document)/size(document)
Usually we use some smoothing techniques to make sure no word has probability 0.
Other solutions are using supervised learning algorithms such as SVM. Your features can be the words (tf-idf model / bag of words model /...) and use these features to classify if doc1,doc2 are 'similar'. This requires obtaining a 'training set' that is basically a set of samples (doc1,doc2) and lables that tells you if (doc1,doc2) are 'smilar' or not. Feed the training data to a learner and build a model - that will later be used to classify new pairs of documents.

Resources