NLP text distances - nlp

what is best way to calculate distance between words for semantic meaning. For example.. assume we are searching for word "fraud" in documented associated with 2 nouns - "person A" and "person B". Text is something like below.
......"PersonA".....fraud.............."PersonB".........................................................................."fraud"
conslusion in "Noun - "PersonA is more likely to be adjective "fraud" since "fraud" is nearer to "PersonA" than "PersonB". Is there any good algorithm/statistical model to measure this for "text mining"

First of all, it seems that the measure you're trying to obtain isn't an ordinary 'semantic meaning' distance, or semantic similarity. It's more likely to be association measure.
So, if you have a lot of occurrences of words to be processed, then look at PMI or other distributional similarities (e.g. 8 week lectures of Natural Language Processing course).
If you have just several occurrences, then I'd suggest to perform syntax parsing and measure ordinary distance in parse tree.

Related

Why word embedding technique works

I have look into some word embedding techniques, such as
CBOW: from context to single word. Weight matrix produced used as embedding vector
Skip gram: from word to context (from what I see, its acutally word to word, assingle prediction is enough). Again Weight matrix produced used as embedding
Introduction to these tools would always quote "cosine similarity", which says words of similar meanning would convert to similar vector.
But these methods all based on the 'context', account only for words around a target word. I should say they are 'syntagmatic' rather than 'paradigmatic'. So why the close in distance in a sentence indicate close in meaning? I can think of many counter example that frequently occurs
"Have a good day". (good and day are vastly different, though close in distance).
"toilet" "washroom" (two words of similar meaning, but a sentence contains one would unlikely to contain another)
Any possible explanation?
This sort of "why" isn't a great fit for StackOverflow, but some thoughts:
The essence of word2vec & similar embedding models may be compression: the model is forced to predict neighbors using far less internal state than would be required to remember the entire training set. So it has to force similar words together, in similar areas of the parameter space, and force groups of words into various useful relative-relationships.
So, in your second example of 'toilet' and 'washroom', even though they rarely appear together, they do tend to appear around the same neighboring words. (They're synonyms in many usages.) The model tries to predict them both, to similar levels, when typical words surround them. And vice-versa: when they appear, the model should generally predict the same sorts of words nearby.
To achieve that, their vectors must be nudged quite close by the iterative training. The only way to get 'toilet' and 'washroom' to predict the same neighbors, through the shallow feed-forward network, is to corral their word-vectors to nearby places. (And further, to the extent they have slightly different shades of meaning – with 'toilet' more the device & 'washroom' more the room – they'll still skew slightly apart from each other towards neighbors that are more 'objects' vs 'places'.)
Similarly, words that are formally antonyms, but easily stand-in for each-other in similar contexts, like 'hot' and 'cold', will be somewhat close to each other at the end of training. (And, their various nearer-synonyms will be clustered around them, as they tend to be used to describe similar nearby paradigmatically-warmer or -colder words.)
On the other hand, your example "have a good day" probably doesn't have a giant influence on either 'good' or 'day'. Both words' more unique (and thus predictively-useful) senses are more associated with other words. The word 'good' alone can appear everywhere, so has weak relationships everywhere, but still a strong relationship to other synonyms/antonyms on an evaluative ("good or bad", "likable or unlikable", "preferred or disliked", etc) scale.
All those random/non-predictive instances tend to cancel-out as noise; the relationships that have some ability to predict nearby words, even slightly, eventually find some relative/nearby arrangement in the high-dimensional space, so as to help the model for some training examples.
Note that a word2vec model isn't necessarily an effective way to predict nearby words. It might never be good at that task. But the attempt to become good at neighboring-word prediction, with fewer free parameters than would allow a perfect-lookup against training data, forces the model to reflect underlying semantic or syntactic patterns in the data.
(Note also that some research shows that a larger window influences word-vectors to reflect more topical/domain similarity – "these words are used about the same things, in the broad discourse about X" – while a tiny window makes the word-vectors reflect a more syntactic/typical similarity - "these words are drop-in replacements for each other, fitting the same role in a sentence". See for example Levy/Goldberg "Dependency-Based Word Embeddings", around its Table 1.)
‘Embedding’ mean a semantic vector representation. e.g. how to represent words such that synonyms are nearer than antonyms or other unrelated words.
Embeddings algorithms like Word2vec maps entities be it e-commerce
items or words (say in English language), to N-dimensional vectors.
Now since you have a mathematical representation of the entities in
a Euclidean space, you can use associated semantics such as distance
between vectors. e.g:
For a given item say ‘Levis Jeans’ recommend the most related items
which are often co-purchased with it.
This can be easily done: search the nearest vectors to the vector of
‘Levis Jeans’, and recommend them. You will find that the nearest
vectors correspond to items such as T-shirts etc., which are
relevant to the Levis Jeans. Similarly it preserves
distance/similarity between words e.g.: King - Queen = Man - Woman !
Yes, Word2vec captures such co-occurrance relationships, when
mapping the items/words to vectors also called as ‘item/word
embeddings’.
This is not specifically targeted to sentence embeddings but nevertheless here you get some crucial insights extremely relevant to the core logic behind embedding generation. Read till the end.

How to identify words with the same meaning in order to reduce number of tags/categories/classes in a dataset

So here is an example of a column in my data-set:
"industries": ["Gaming", "fitness and wellness"]
The industries column has hundreds of different tags, some of which can have the same meaning, for example, some rows have: "Gaming" and some have "video games" and others "Games & consoles".
I'd like to "lemmatize" these tags so I could query the data and not worry about minute differences in the presentation (if they are basically the same).
What is the standard solution in this case?
I don't know that there is a "standard" solution, but I can suggest a couple of approaches, ranked by increasing depth of knowledge, or going from the surface form to the meaning.
String matching
Lemmatisation/stemming
Word embedding vector distance
String matching is based on the calculating the difference between strings, as a measure of how many characters they share or how many editing steps it takes to transform one into the other. Levenshtein distance is one of the most common ones. However, depending on the size of your data, it might be a bit inefficient to use. This is a really cool approach to find most similar strings in a large data set.
However, it might not be the most suitable one for your particular data set, as your similarities seem more semantic and less bound to the surface form of the words.
Lemmatisation/stemming goes beyond the surface by analysing the words apart based on their morphology. In your example, gaming and games both have the same stem game, so you could base your similarity measure on matching stems. This can be better than pure string matching as you can see that *go" and went are related
Word embeddings go beyond the surface form by encoding meaning as the context in which words appear and as such, might find a semantic similarity between health and *fitness", that is not apparent from the surface at all! The similarity is measured as the cosine distance/similarity between two word vectors, which is basically the angle between the two vectors.
It seems to me that the third approach might be most suitable for your data.
This is a tough NLU question! Basically 'what are synonyms or near synonyms of each other, even if there's not exact string overlap?'.
1. Use GLoVE word embeddings to judge synonymous words
It might be interesting to use spaCy's pre-trained GLoVE model (en_vectors_web_lg) for word embeddings, to get the pairwise distances between tokens, and use that as a metric for judging 'closeness'.
nlp = spacy.load('en_vectors_web_lg')
doc1 = nlp('board games')
doc2 = nlp('Games & Recreation')
doc3 = nlp('video games')
for doc in [doc1, doc2, doc3]:
for comp in [doc1, doc2, doc3]:
if doc != comp:
print(f'{doc} | {comp} | similarity: {round(doc.similarity(comp), 4)}')
board games | Games & Recreation | similarity: 0.6958
board games | video games | similarity: 0.7732
Games & Recreation | board games | similarity: 0.6958
Games & Recreation | video games | similarity: 0.675
video games | board games | similarity: 0.7732
video games | Games & Recreation | similarity: 0.675
(GLoVE is cool - really nice mathematical intuition for word embeddings.)
PROS: GLoVE is robust, spaCy has it built in, vector space comparisons are easy in spaCy
CONS: It doesn't handle out of vocabulary words well, spaCy's just taking the average of all the token vectors here (so it's sensitive to document length)
2. Try using different distance metrics/fuzzy string matching
You might also look at different kinds of distance metrics -- cosine distance isn't the only one.
FuzzyWuzzy is a good implementation of Levenshtein distance for fuzzy string matching (no vectors required).
This library implements a whole slew of string-matching algorithms.
PROS: Using a preconfigured library saves you some coding, other distance metrics might help you find new correlations, don't need to train a vector model
CONS: More dependencies, some kinds of distance aren't appropriate and will miss synonymous words without literal string overlap
3. Use WordNet to get synonym sets
You could also get a sort of dictionary of synonym sets ('synsets') from WordNet, which was put together by linguists as a kind of semantic knowledge graph.
The nice thing about this is it gets you some textual entailment -- that is, given sentence A, would a reader think sentence B is most likely true?
Because it was handmade by linguists and grad students, WordNet isn't as dependent on string overlap and can give you nice semantic enrichment. It also provides things like hyponyms/meroynms and hypernyms/holonyms -- so you could, e.g., say 'video game' is a subtype of 'game', which is a subset of 'recreation' -- just based off of WordNet.
You can access WordNet in python through the textblob library.
from textblob import Word
from textblob.wordnet import NOUN
game = Word('game').get_synsets(pos=NOUN)
for synset in game:
print(synset.definition())
a contest with rules to determine a winner
a single play of a sport or other contest
an amusement or pastime
animal hunted for food or sport
(tennis) a division of play during which one player serves
(games) the score at a particular point or the score needed to win
the flesh of wild animals that is used for food
a secret scheme to do something (especially something underhand or illegal)
the game equipment needed in order to play a particular game
your occupation or line of work
frivolous or trifling behavior
print(game[0].hyponyms())
[Synset('athletic_game.n.01'),
Synset('bowling.n.01'),
Synset('card_game.n.01'),
Synset('child's_game.n.01'),
Synset('curling.n.01'),
Synset('game_of_chance.n.01'),
Synset('pall-mall.n.01'),
Synset('parlor_game.n.01'),
Synset('table_game.n.01'),
Synset('zero-sum_game.n.01')]
Even cooler, you can get the similarity based on these semantic features between any words you like.
print((Word('card_game').synsets[0]).shortest_path_distance(Word('video_game').synsets[0]))
5
PROS: Lets you use semantic information like textual entailment to get at your objective, which is hard to get in other ways
CONS: WordNet is limited to what is in WordNet, so again out-of-vocabulary words may be a problem for you.
I Suggest to use the word2vector approcah or the lemmatisation approach:
With the first one you can compute vectors starting from words, and so you have a projection into a vectorial space. With this projection you can compute the similarity between words (with cosine similarity as #Shnipp sad) and then put a threshold, above which you say that two words belong do different arguments.
Using lemmatisation you can compare bare words/lemma using SequenceMatcher. In this case you condition of equality could be based on the presence of very similar lemmas (similarity above 95%).
It's us to you to choose the best for your purpouse. If you want something solid and structured use word2vec. Othervise if you something simple and fast to implement use the lemmatisation approcah.

NLP: Curating definitional summaries for a specific term from textbook

I would like to be able to curate definitional summaries for a specific term from a textbook.
For example, from a Biology textbook, I would like to be able form a concise summary for the word "mitochondria". I have tried this by first parsing through the textbook for all sentences that contain the word "mitochondria", and feeding those sentences through summarization algorithms such as TextRank and LexRank, but those algorithms were not able to determine "definitional" sentences that well.
By definitional summaries, I mean useful sentences as far as a definition goes. For example, the sentence "The mitochondria is the powerhouse of the cell" would be a definitional sentence while the sentence "Fungal cells also contain mitochondria and a complex system of internal membranes, including the endoplasmic reticulum and Golgi apparatus" is not really pertinent to the definition of the mitochondria.
Any help or leads would be very much appreciated
There isn't really a straightforward way to do this, but you do have some options:
Just use a regex for "mitochondria is". It is the stupidest possible thing, but given a textbook it might prove satisfactory. It's simple enough testing should be easy, and at worst provides a baseline to compare alternatives to.
Run a parser (example: Stanford Parser) on each sentence with the word "mitochondria", and extract sentences where mitochondria is the subject. This would eliminate the negative example you gave. You would have to tune this, perhaps restricting main verbs, accounting for coordinators, and so on.
Use Information Extraction (example: Stanford OpenIE) to get a list of facts about mitochondria (like is-in(mitochondria, cell)) and do something with that.
This is a very open ended question. I can try to point how I would approach this...
One way would be to use some kind of vector representation for text (word2vec
or sent2vec come to mind).
Then by encoding the average of the sentences in vector format and checking the cosine similarity of this and of the term you seek, you could be getting something close to the definitional sentences you seek.
Even testing the cosine similarity of the averaged sentences you get out of the summary algorithm and the term might get you close to judge how close you are

How is polarity calculated for a sentence ??? (in sentiment analysis)

How is polarity of words in a statement are calculated....like
"i am successful in accomplishing the task,but in vain"
how each word is scored? (like - successful- 0.7 accomplishing- 0.8 but - -0.5
vain - - 0.8)
how is it calculated ? how is each word given a value or score?? what is the thing that's going behind ? As i am doing sentiment analysis I have few thing to be clear so .that would be great if someone helps.thanks in advance
If you are willing to use Python and NLTK, then check out Vader (http://www.nltk.org/howto/sentiment.html and skip down to the Vader section)
The scores from individual words can come from predefined word lists such as ANEW, General Inquirer, SentiWordNet, LabMT or my AFINN. Either individual experts have scored them or students or Amazon Mechanical Turk workers. Obviously, these scores are not the ultimate truth.
Word scores can also be computed by supervised learning with annotated texts, or word scores can be estimated from word ontologies or co-occurence patterns.
As for aggregation of individual words, there are various ways. One way would be to sum all the individual scores (valences), another to take the max valence among the words, a third to normalize (divide) by the number of words or by the number of scored words (i.e., getting a mean score), - or divide the square root of that number. The results may differ a bit.
I made some evaluation with my AFINN word list: http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6028/pdf/imm6028.pdf
Another approach is with recursive models like Richard Socher's models. The sentiment values of the individual words are aggregated in a tree-like structure and should find that the "but in vain"-part of your example should carry the most weight.

Financial news headers classification to positive/negative classes

I'm doing a small research project where I should try to split financial news articles headers to positive and negative classes.For classification I'm using SVM approach.The main problem which I see now it that not a lot of features can be produced for ML. News articles contains a lot of Named Entities and other "garbage" elements (from my point of view of course).
Could you please suggest ML features which can be used for ML training? Current results are: precision =0.6, recall=0.8
Thanks
The task is not trivial at all.
The straightforward approach would be to find or create a training set. That is a set of headers with positive news and a set of headers with negative news.
You turn the training set to a TF/IDF representation and then you train a Linear SVM to separate the two classes. Depending on the quality and size of your training set you can achieve something decent - not sure for 0.7 break even point.
Then, to get better results you need to go for NLP approaches. Try use a part-of-speech tagger to identify adjectives (trivial), and then score them using some sentiment DB like SentiWordNet.
There is an excellent overview on Sentiment Analysis by Bo Pang and Lillian Lee you should read:
How about these features?
Length of article header in words
Average word length
Number of words in a dictionary of "bad" words, e.g. dictionary = {terrible, horrible, downturn, bankruptcy, ...}. You may have to generate this dictionary yourself.
Ratio of words in that dictionary to total words in sentence
Similar to 3, but number of words in a "good" dictionary of words, e.g. dictionary = {boon, booming, employment, ...}
Similar to 5, but use the "good"-word dictionary
Time of the article's publication
Date of the article's publication
The medium through which it was published (you'll have to do some subjective classification)
A count of certain punctuation marks, such as the exclamation point
If you're allowed access to the actual article, you could use surface features from the actual article, such as its total length and perhaps even the number of responses or the level of opposition to that article. You could also look at many other dictionaries online such as Ogden's 850 basic english dictionary, and see if bad/good articles would be likely to extract many words from those. I agree that it seems difficult to come up with a long list (e.g. 100 features) of useful features for this purpose.
iliasfl is right, this is not a straightforward task.
I would use a bag of words approach but use a POS tagger first to tag each word in the headline. Then you could remove all of the named entities - which as you rightly point out don't affect the sentiment. Other words should appear frequently enough (if your dataset is big enough) to cancel themselves out from being polarised as either positive or negative.
One step further along, if you still aren't close could be to only select the adjectives and verbs from the tagged data as they are the words that tend to convey the emotion or mood.
I wouldn't be too disheartened in your precision and recall figures though, an F number of 0.8 and above is actually quite good.

Resources