Article "Matching" Algorithm - string

I've rather specific question, at least it is so for me. Specific because after doing quite a lot searching I couldn't find anything useful. So as the title says, I am looking for an algorithm, that finds if two articles given in input "match", but not in the sense of usual string matching, instead, what I want to find is, if they talk for the same argument. Now what I predict, the "match" should be compared against some threshold, and using some kind of weights to determine how much do they "match", therefore the concept is fuzzy, so we can't talk about a complete "match", but we will talk about degree of "match".
Sadly, I don't have anything more. I would be really grateful if someone of you helps me in the topic, also theoretical ideas are welcome.
Thanks you.

There are many ways to find 'similarity' of articles, and it really depends on what you know on the articles, and what you use as your test case to show how good your results are.
One simple solution is using Jaccard Similarity on the vocabulary used by these documents. Pseudo code:
similarity(doc1,doc2):
set1 <- getWords(doc1)
set2 <- getWords(doc2)
intersection <- set_intersection(set1,set2)
union <- set_union(set1,set2)
return size(intersection)/size(union)
Note that instead of getWords you can use also bigrams,trigrams,...n-grams.
More complex unsupervised solution could be building a language model from each document, and calculate their Jensen-Shannon divergence to judge if they are similar or not, based on the language models.
A simple language model is P(word|document) = #occurances(word,document)/size(document)
Usually we use some smoothing techniques to make sure no word has probability 0.
Other solutions are using supervised learning algorithms such as SVM. Your features can be the words (tf-idf model / bag of words model /...) and use these features to classify if doc1,doc2 are 'similar'. This requires obtaining a 'training set' that is basically a set of samples (doc1,doc2) and lables that tells you if (doc1,doc2) are 'smilar' or not. Feed the training data to a learner and build a model - that will later be used to classify new pairs of documents.

Related

Domain-specific word similarity

Does anyone know how of an accurate tool or method that can be used to compute word embeddings or find similarity among domain-specific words? I'm working on an NLP project that involves computing cosine similarity between technical terms, such as "address" and "socket", but pre-trained models like word2vec aren't giving useful embeddings or accurate cosine similarities because they aren't specific to technical terms. Since the more general-nontechnical meanings of "address" and "socket" aren't similar to one another, these pretrained models aren't giving them sufficiently high similarity scores for the purposes of my project. Would appreciate any advice people would be able to offer. Thank you!
With sufficient data from your specific domain, you can train your own word2vec model - whose resulting word-vectors, being only influenced by your domain data, will be far more reflective of the in-domain meanings.
Similarly, if you have a mixture of data where you have hints that some word uses are for different senses of a polysemous word, you could try preprocessing your text, using those hints, replacing the ambiguous tokens (like say 'address') with a larger number of distinct tokens (like 'address*networking', 'address*delivery', etc). Even with a lot of error in such a process, its results might be sufficient for a specific purpose.
For example, maybe you'd assume all docs of a certain type – like articles from a particular publication – always mean 'address*networking' when they write 'address'. That crude replacement, on just some subset of docs sufficient to collect enough varied examples of 'address*networking' usage, might leave you with a good-enough word-vector for 'address*networking'.
(More generally, deciding which word sense of multiple candidates is meant by a particular word is called "word sense disambiguation", and it might be possible to use other preexisting code for performing that to help preprocess texts - replacing ambiguous tokens with more-speciific stand-ins – before performing word2vec training.)
Even without such assistive pre-processing, there've been a number of research attempts to extend word2vec to better model words with multiple contrasting meanings. Googling for [word2vec polysemy] or [polysemous embeddings] should turn up a bunch of examples.
But I don't know any of those techniques that have become widely-used, or that are explicitly supported by major word2vec libraries, so I can't specifically recommend or show working code for any. I don't know a standard best-practice or off-the-shelf solution – you'd have to treat adopting those ideas from research papers as an R&D project, performing a lot of your own implementation/evaluation to see if any help with your goals.

How to solve difficult sentences for nlp sentiment analysis

Such as the following sentence,
"Don't pay attention to people if they say it's no good."
As humans, we understand the overall sentiment from the sentence is positive.
Technique of "Bag of Words" or BOW
Then, we have the two categories of "positive" words as Polarity of 1, "negative" words of Polarity of 0.
In this case, the word of "good" fits into category, but here it is accidentally correct.
Thus, this technique is eliminated.
Still use BOW technique (sort of "Word Embedding")
But take into consideration of its surrounding words, in this case, the "no" word preceding it, thus, it's "no good", not the adj alone "good". However, "no good" is not what the author intended from the context of the entire sentence.
Thus, this question. Thanks in advance.
Word embeddings are one possible way to try to take into account the complexity coming from the sequence of terms in your example. Using pre-trained models on general English such as BERT should give you interesting results for your sentiment analysis problem. You can leverage on several implementation provided by Hugging face library.
Another approach, that doesn't rely on compute intensive techniques (such as word embeddings), would be to use n-gram which will capture the sequence aspect and should provide good features for sentiment estimation. You can try different depth (unigram, bigrams, trigrams...) and combine with different types of preprocesing and/or tokenizers. Scikit-learn provides a good reference implementation for n-gramss in its CountVectorizer class.

Unsupervised sentiment Analysis using doc2vec

Folks,
I have searched Google for different type of papers/blogs/tutorials etc but haven't found anything helpful. I would appreciate if anyone can help me. Please note that I am not asking for code step-by-step but rather an idea/blog/paper or some tutorial.
Here's my problem statement:
Just like sentiment analysis is used for identifying positive and
negative tone of a sentence, I want to find whether a sentence is
forward-looking (future outlook) statement or not.
I do not want to use bag of words approach to sum up the number of forward-looking words/phrases such as "going forward", "in near future" or "In 5 years from now" etc. I am not sure if word2vec or doc2vec can be used. Please enlighten me.
Thanks.
It seems what you are interested in doing is finding temporal statements in texts.
Not sure of your final output, but let's assume you want to find temporal phrases or sentences which contain them.
One methodology could be the following:
Create list of temporal terms [days, years, months, now, later]
Pick only sentences with key terms
Use sentences in doc2vec model
Infer vector and use distance metric for new sentence
GMM Cluster + Limit
Distance from average
Another methodology could be:
Create list of temporal terms [days, years, months, now, later]
Do Bigram and Trigram collocation extraction
Keep relevant collocations with temporal terms
Use relevant collocations in a kind of bag-of-collocations approach
Matched binary feature vectors for relevant collocations
Train classifier to recognise higher level text
This sounds like a good case for a Bootstrapping approach if you have large amounts of texts.
Both are semi-supervised really, since there is some need for finding initial temporal terms, but even that could be automated using a word2vec scheme and bootstrapping

How word2Vec or wod2Doc understand user sentiments

I have gone through numerous documents to read about doc2Vec and word2Vec. I do understand how powerful it is to represent the words as a vector and to perform simple operations like vector addition , subtraction to yield meaningful analogy between the words.
Although one thing I am still not able to understand is how this technique can be used to understand user sentiments .
Can someone please elaborate as to how user sentiments are analysed using these techniques ?
Thanks
Samir
By representing a document or set of words with feature vectors, you can process text in other machine learning tasks. For example if you have a dataset which labeled each document x with its sentiment y, you can use the pretraind embedding as feature vectorisation to represent x as input to your machine learning method and test if these features help your task.

Paraphrase recognition using sentence level similarity

I'm a new entrant to NLP (Natural Language Processing). As a start up project, I'm developing a paraphrase recognizer (a system which can recognize two similar sentences). For that recognizer I'm going to apply various measures at three levels, namely: lexical, syntactic and semantic. At the lexical level, there are multiple similarity measures like cosine similarity, matching coefficient, Jaccard coefficient, et cetera. For these measures I'm using the simMetrics package developed by the University of Sheffield which contains a lot of similarity measures. But for the Levenshtein distance and Jaro-Winkler distance measures, the code is only at character level, whereas I require code at the sentence level (i.e. considering a single word as a unit instead of character-wise). Additionally, there is no code for computing the Manhattan distance in SimMetrics. Are there any suggestions for how I could develop the required code (or someone provide me the code) at the sentence level for the above mentioned measures?
Thanks a lot in advance for your time and effort helping me.
I have been working in the area of NLP for a few years now, and I completely agree with those who have provided answers/comments. This really is a hard nut to crack! But, let me still provide a few pointers:
(1) Lexical similarity: Instead of trying to generalize Jaro-Winkler distance to sentence-level, it is probably much more fruitful if you develop a character-level or word-level language model, and compute the log-likelihood. Let me explain further: train your language model based on a corpus. Then take a whole lot of candidate sentences that have been annotated as similar/dissimilar to the sentences in the corpus. Compute the log-likelihood for each of these test sentences, and establish a cut-off value to determine similarity.
(2) Syntactic similarity: So far, only stylometric similarities can manage to capture this. For this, you will need to use PCFG parse trees (or TAG parse trees. TAG = tree adjoining grammar, a generalization of CFGs).
(3) Semantic similarity: off the top of my head, I can only think of using resources such as Wordnet, and identifying the similarity between synsets. But this is not simple either. Your first problem will be to identify which words from the two (or more) sentences are "corresponding words", before you can proceed to check their semantics.
As Chris suggests, this is a non-trivial project for a beginner. I would suggest you start of something simpler (if relatively boring) such as chunking.
Have a look at the docs and books for the Python NLTK library - there are some samples that are close to what you are looking for. For example, containment: is it plausible that one statement contains another. note the 'plausible' there, the state of the art isn't good enough for a simple yes/no or even a probability.

Resources