I need a labeled data (human judgment) for the structural/hierarchical semantic distance between many couples (at least hundreds) of word.
For example, d(computer, television) < d(radio, television) < d(dish washer, television).
If we organize all words in a dendogram or a tree, where each node is a category ("electric device", "with screen", etc...), and words are in leaves, the number will represent number of steps (nodes) we have to go from one word to another.
Does such dataset exist?
per couples ratings is enough, no need to have a full embeding/tree/specify the nodes
(An example dataset will be:
Computer Television 1
Radio Television 2
DishWasher Television 3
Thanks!
I'm now aware of such human judgements datasets, but I guess you could look at semantic networks like WordNet which is a lexical database of English in a form of a graph. Given two words, you could compute distance between nodes representing them in WordNet.
Both nouns and verbs are organized into hierarchies, defined by
hypernym or IS A relationships. For instance, one sense of the word
dog is found following hypernym hierarchy; the words at the same level
represent synset members. Each set of synonyms has a unique index.
dog, domestic dog, Canis familiaris
canine, canid
carnivore
placental, placental mammal, eutherian, eutherian mammal
mammal
vertebrate, craniate
chordate
animal, animate being, beast, brute, creature, fauna
...
If you are looking for a dataset, you could also ask here.
Related
So here is an example of a column in my data-set:
"industries": ["Gaming", "fitness and wellness"]
The industries column has hundreds of different tags, some of which can have the same meaning, for example, some rows have: "Gaming" and some have "video games" and others "Games & consoles".
I'd like to "lemmatize" these tags so I could query the data and not worry about minute differences in the presentation (if they are basically the same).
What is the standard solution in this case?
I don't know that there is a "standard" solution, but I can suggest a couple of approaches, ranked by increasing depth of knowledge, or going from the surface form to the meaning.
String matching
Lemmatisation/stemming
Word embedding vector distance
String matching is based on the calculating the difference between strings, as a measure of how many characters they share or how many editing steps it takes to transform one into the other. Levenshtein distance is one of the most common ones. However, depending on the size of your data, it might be a bit inefficient to use. This is a really cool approach to find most similar strings in a large data set.
However, it might not be the most suitable one for your particular data set, as your similarities seem more semantic and less bound to the surface form of the words.
Lemmatisation/stemming goes beyond the surface by analysing the words apart based on their morphology. In your example, gaming and games both have the same stem game, so you could base your similarity measure on matching stems. This can be better than pure string matching as you can see that *go" and went are related
Word embeddings go beyond the surface form by encoding meaning as the context in which words appear and as such, might find a semantic similarity between health and *fitness", that is not apparent from the surface at all! The similarity is measured as the cosine distance/similarity between two word vectors, which is basically the angle between the two vectors.
It seems to me that the third approach might be most suitable for your data.
This is a tough NLU question! Basically 'what are synonyms or near synonyms of each other, even if there's not exact string overlap?'.
1. Use GLoVE word embeddings to judge synonymous words
It might be interesting to use spaCy's pre-trained GLoVE model (en_vectors_web_lg) for word embeddings, to get the pairwise distances between tokens, and use that as a metric for judging 'closeness'.
nlp = spacy.load('en_vectors_web_lg')
doc1 = nlp('board games')
doc2 = nlp('Games & Recreation')
doc3 = nlp('video games')
for doc in [doc1, doc2, doc3]:
for comp in [doc1, doc2, doc3]:
if doc != comp:
print(f'{doc} | {comp} | similarity: {round(doc.similarity(comp), 4)}')
board games | Games & Recreation | similarity: 0.6958
board games | video games | similarity: 0.7732
Games & Recreation | board games | similarity: 0.6958
Games & Recreation | video games | similarity: 0.675
video games | board games | similarity: 0.7732
video games | Games & Recreation | similarity: 0.675
(GLoVE is cool - really nice mathematical intuition for word embeddings.)
PROS: GLoVE is robust, spaCy has it built in, vector space comparisons are easy in spaCy
CONS: It doesn't handle out of vocabulary words well, spaCy's just taking the average of all the token vectors here (so it's sensitive to document length)
2. Try using different distance metrics/fuzzy string matching
You might also look at different kinds of distance metrics -- cosine distance isn't the only one.
FuzzyWuzzy is a good implementation of Levenshtein distance for fuzzy string matching (no vectors required).
This library implements a whole slew of string-matching algorithms.
PROS: Using a preconfigured library saves you some coding, other distance metrics might help you find new correlations, don't need to train a vector model
CONS: More dependencies, some kinds of distance aren't appropriate and will miss synonymous words without literal string overlap
3. Use WordNet to get synonym sets
You could also get a sort of dictionary of synonym sets ('synsets') from WordNet, which was put together by linguists as a kind of semantic knowledge graph.
The nice thing about this is it gets you some textual entailment -- that is, given sentence A, would a reader think sentence B is most likely true?
Because it was handmade by linguists and grad students, WordNet isn't as dependent on string overlap and can give you nice semantic enrichment. It also provides things like hyponyms/meroynms and hypernyms/holonyms -- so you could, e.g., say 'video game' is a subtype of 'game', which is a subset of 'recreation' -- just based off of WordNet.
You can access WordNet in python through the textblob library.
from textblob import Word
from textblob.wordnet import NOUN
game = Word('game').get_synsets(pos=NOUN)
for synset in game:
print(synset.definition())
a contest with rules to determine a winner
a single play of a sport or other contest
an amusement or pastime
animal hunted for food or sport
(tennis) a division of play during which one player serves
(games) the score at a particular point or the score needed to win
the flesh of wild animals that is used for food
a secret scheme to do something (especially something underhand or illegal)
the game equipment needed in order to play a particular game
your occupation or line of work
frivolous or trifling behavior
print(game[0].hyponyms())
[Synset('athletic_game.n.01'),
Synset('bowling.n.01'),
Synset('card_game.n.01'),
Synset('child's_game.n.01'),
Synset('curling.n.01'),
Synset('game_of_chance.n.01'),
Synset('pall-mall.n.01'),
Synset('parlor_game.n.01'),
Synset('table_game.n.01'),
Synset('zero-sum_game.n.01')]
Even cooler, you can get the similarity based on these semantic features between any words you like.
print((Word('card_game').synsets[0]).shortest_path_distance(Word('video_game').synsets[0]))
5
PROS: Lets you use semantic information like textual entailment to get at your objective, which is hard to get in other ways
CONS: WordNet is limited to what is in WordNet, so again out-of-vocabulary words may be a problem for you.
I Suggest to use the word2vector approcah or the lemmatisation approach:
With the first one you can compute vectors starting from words, and so you have a projection into a vectorial space. With this projection you can compute the similarity between words (with cosine similarity as #Shnipp sad) and then put a threshold, above which you say that two words belong do different arguments.
Using lemmatisation you can compare bare words/lemma using SequenceMatcher. In this case you condition of equality could be based on the presence of very similar lemmas (similarity above 95%).
It's us to you to choose the best for your purpouse. If you want something solid and structured use word2vec. Othervise if you something simple and fast to implement use the lemmatisation approcah.
I'm looking to generate an algorithm that can determine the similarity of a series of sentences. Specifically, given a starter sentence, I want to determine if the following sentence is a suitable addition.
For example, take the following:
My dog loves to drink water.
All is good, this is just the first sentence.
The dog hates cats.
All is good, both sentences reference dogs.
It enjoys walks on the beach.
All is good, "it" is neutral enough to be an appropriate communication.
Pizza is great with pineapple on top.
This would not be a suitable addition, as the sentence does not build on to the "narrative" created by the first three sentences.
To outline the project a bit, I've created a library that generated Markov text chains based on the input text. That text is then corrected grammatically to produce viable sentences. I now want to string these sentences together to create coherent paragraphs.
How is polarity of words in a statement are calculated....like
"i am successful in accomplishing the task,but in vain"
how each word is scored? (like - successful- 0.7 accomplishing- 0.8 but - -0.5
vain - - 0.8)
how is it calculated ? how is each word given a value or score?? what is the thing that's going behind ? As i am doing sentiment analysis I have few thing to be clear so .that would be great if someone helps.thanks in advance
If you are willing to use Python and NLTK, then check out Vader (http://www.nltk.org/howto/sentiment.html and skip down to the Vader section)
The scores from individual words can come from predefined word lists such as ANEW, General Inquirer, SentiWordNet, LabMT or my AFINN. Either individual experts have scored them or students or Amazon Mechanical Turk workers. Obviously, these scores are not the ultimate truth.
Word scores can also be computed by supervised learning with annotated texts, or word scores can be estimated from word ontologies or co-occurence patterns.
As for aggregation of individual words, there are various ways. One way would be to sum all the individual scores (valences), another to take the max valence among the words, a third to normalize (divide) by the number of words or by the number of scored words (i.e., getting a mean score), - or divide the square root of that number. The results may differ a bit.
I made some evaluation with my AFINN word list: http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6028/pdf/imm6028.pdf
Another approach is with recursive models like Richard Socher's models. The sentiment values of the individual words are aggregated in a tree-like structure and should find that the "but in vain"-part of your example should carry the most weight.
what is best way to calculate distance between words for semantic meaning. For example.. assume we are searching for word "fraud" in documented associated with 2 nouns - "person A" and "person B". Text is something like below.
......"PersonA".....fraud.............."PersonB".........................................................................."fraud"
conslusion in "Noun - "PersonA is more likely to be adjective "fraud" since "fraud" is nearer to "PersonA" than "PersonB". Is there any good algorithm/statistical model to measure this for "text mining"
First of all, it seems that the measure you're trying to obtain isn't an ordinary 'semantic meaning' distance, or semantic similarity. It's more likely to be association measure.
So, if you have a lot of occurrences of words to be processed, then look at PMI or other distributional similarities (e.g. 8 week lectures of Natural Language Processing course).
If you have just several occurrences, then I'd suggest to perform syntax parsing and measure ordinary distance in parse tree.
For example: I have 100 books with 1000 words each. They belong to different classes (comedy,drama,...). Each class consist of 15 different books.
When I do tfidf on my data, I get the importance for every word in a book in context of all books.
I see that the books belonging to the same class have similar tfidf values for each variable.
Let's say drama and comedy are pretty similar.
How can I tell what words make a difference in between those two classes?
What words do I have to change in book that belongs to comedy so the book now belongs to drama now?
I can check one by one; but I have 2000 books, 17500 words each; 950 classes. It would take a decade :)
As a first draft, compute the average vector for each classes, normalize them to unit length, and compute the absolute differences.
These should give you a rough indication of which words distinguish the two classes.
I would definitely run pairwise tests, i.e. one for each of the 475*949 pairs of classes you have as "important variables" can differ very much from case to case. Then run some standard feature selection algorithm, such as chi-square or information gain. See http://www.jmlr.org/papers/volume3/forman03a/forman03a.pdf for an extensive study.