I am playing with WordNet and try to solve a NLP task.
I was wondering if there exists any way to get a list of words belonging to some large sets, such as "animals" (i.e. dog, cat, cow etc.), "countries", "electronics" etc.
I believe that it should be possible to somehow get this list by exploiting hypernyms.
Bonus question: do you know any other way to classify words in very large classes, besides "noun", "adjective" and "verb"? For example, classes like, "prepositions", "conjunctions" etc.
Yes, you just check if the category is a hypernym of the given word.
from nltk.corpus import wordnet as wn
def has_hypernym(word, category):
# Assume the category always uses the most popular sense
cat_syn = wn.synsets(category)[0]
# For the input, check all senses
for syn in wn.synsets(word):
for match in syn.lowest_common_hypernyms(cat_syn):
if match == cat_syn:
return True
return False
has_hypernym('dog', 'animal') # => True
has_hypernym('bucket', 'animal') # => False
If the broader word (the "category" here) is the lowest common hypernym, that means it's a direct hypernym of the query word, so the query word is in the category.
Regarding your bonus question, I have no idea what you mean. Maybe you should look at NER or open a new question.
With some help from polm23, I found this solution, which exploits similarity between words, and prevents wrong results when the class name is ambiguous.
The idea is that WordNet can be used to compare a list words, with the string animal, and compute a similarity score. From the nltk.org webpage:
Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
def keep_similar(words, similarity_thr):
similar_words=[]
w2 = wn.synset('animal.n.01')
[similar_words.append(word) for word in words if wn.synset(word + '.n.01').wup_similarity(w2) > similarity_thr ]
return similar_words
For example, if word_list = ['dog', 'car', 'train', 'dinosaur', 'London', 'cheese', 'radon'], the corresponding scores are:
0.875
0.4444444444444444
0.5
0.7
0.3333333333333333
0.3076923076923077
0.3076923076923077
This can easily be used to generate a list of animals, by setting a proper value of similarity_thr
Related
This is my first attempt with Natural Language Processing so I started with Latent Semantic Analysis and used this tutorial to build the algorithm. After testing it I see that it only classifies the first semantic words and repeats the same terms over and over on top of the other documents.
I tried feeding it the documents found in HERE too and it does exactly the same. Repeating the values of the same topic several times in the other ones.
Could anyone help explain what is happening? I've been searching all over and everything seems exactly like in the tutorials.
testDocs = [
"The Neatest Little Guide to Stock Market Investing",
"Investing For Dummies, 4th Edition",
"The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
"The Little Book of Value Investing",
"Value Investing: From Graham to Buffett and Beyond",
"Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
"Investing in Real Estate, 5th Edition",
"Stock Investing For Dummies",
"Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss",
]
stopwords = ['and','edition','for','in','little','of','the','to']
ignorechars = ''',:'!'''
#First we apply the standard SKLearn algorithm to compare with.
for element in testDocs:
#tokens.append(tokenizer.tokenize(element.lower()))
element = element.lower()
print(testDocs)
#Vectorize the features.
vectorizer = tfdv(max_df=0.5, min_df=2, max_features=8, stop_words='english', use_idf=True)#, ngram_range=(1,3))
#Store the values in matrix X.
X = vectorizer.fit_transform(testDocs)
#Apply LSA.
lsa = TruncatedSVD(n_components=3, n_iter=100)
lsa.fit(X)
#Get a list of the terms in the order it was decomposed.
terms = vectorizer.get_feature_names()
print("Terms decomposed from the document: " + str(terms))
print()
#Prints the matrix of concepts. Each number represents how important the term is to the concept and the position relates to the position of the term.
print("Number of components in element 0 of matrix of components:")
print(lsa.components_[0])
print("Shape: " + str(lsa.components_.shape))
print()
for i, comp in enumerate(lsa.components_):
#Stick each of the terms to the respective components. Zip command creates a tuple from 2 components.
termsInComp = zip(terms, comp)
#Sort the terms according to...
sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True)
print("Concept %d", i)
for term in sortedTerms:
print(term[0], end="\t")
print()
this is my code
from gensim.models import Phrases
documents = ["the mayor of new york was there the hill have eyes","the_hill have_eyes new york mayor was present"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1)
sent = ['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']
print(bigram[sent])
i want it detects "the_hill_have_eyes" but the output is
['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']
Phrases is a purely-statistical method for combining some unigram-token-pairs to new bigram-tokens. If it's not combining two unigrams you think should be combined, it's because the training data and/or chosen parameters (like threshold or min_count) don't imply that pairing should be combined.
Note especially that:
even when Phrases-combinations prove beneficial for downstream classification or info-retrieval steps, they may not intuitively/aesthetically match the "phrases" we as human readers would like to see
since Phrases requires bulk statistics for good results, it requires a lot of training data – you are unlikely to see impressive or representative results from tiny toy-sized training data
In particular with regard to that last point & your example, the interpretation of min_count in Phrases default-scoring means even a min_count=1 isn't low enough to cause bigrams for which there is only a single example in the training-corpus to be created.
So, if you expand your training-corpus a bit, you may be able to create the results you want. But you should still be aware that this method's only value comes from training are larger, realistic corpuses, so anything you see in tiny contrived examples may not generalize to real uses.
What you want is not actually bigrams but "fourgrams".
This can be achieved by doing something like this (my old piece of code I wrote some months ago):
// read the txt file
sentences = Text8Corpus(datapath('testcorpus.txt'))
phrases = Phrases(sentences, min_count=1, threshold=1)
bigram = Phraser(phrases)
sent = [u'trees', u'graph', u'minors']
// look for words in "sent"
print(bigram[sent])
[u'trees_graph', u'minors'] // output
// to create the bigrams
bigram_model = Phrases(unigram_sentences)
// apply the trained model to a sentence
for unigram_sentence in unigram_sentences:
bigram_sentence = u' '.join(bigram_model[unigram_sentence])
// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)
So here you have a trigram model (detecting 3 words together) and you get the idea on how to implement fourgrams.
Hope this helps. Good luck.
I am reading up about TF-IDF so that I can filter out common words from my corpus. It appears to me that you get a TF-IDF score for each word, document pair.
Which score do you pay attention to? Do you combine the scores across all documents for a word?
TFIDF ex:
doc1 = "This is doc1"
doc2 = "This is a different document"
corpus = [doc1, doc2]
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
X.toarray()
return: array([[0. , 0.70490949, 0. , 0.50154891, 0.50154891],
[0.57615236, 0. , 0.57615236, 0.40993715, 0.40993715]])
vec.get_feature_names()
So you have a line/1d array for each doc in the corpus, and that array has len = total vocab in your corpus (can get quite sparse). What score you pay attention to depends on what you're doing, ie finding most important word in a doc you could look for highest TF-idf in that doc. Most important in a corpus, look in the entire array. If you're trying to identify stop words, you could consider finding the set of X number of words with the minimum TF-IDF scores. However, I wouldn't really recommend using TF-IDF to find stop words in the first place, it lowers the weight of stop words, but they still occur frequently which could offset the weight loss. You'd probably be better off finding the most common words and then filtering them out. You'd want to look at either set you generated manually though.
I'm testing such sentence to extract entity values:
s = "Height: 3m, width: 4.0m, others: 3.4 m, 4m, 5 meters, 10 m. Quantity: 6."
sent = nlp(s)
for ent in sent.ents:
print(ent.text, ent.label_)
And got some misleading values:
3 CARDINAL
4.0m CARDINAL
3.4 m CARDINAL 4m CARDINAL 5 meters QUANTITY 10 m QUANTITY 6 CARDINAL
namely, number 3m is not paired with m. This is the case for many examples as I can't rely on this engine when want to separate meters from quantities.
Should I do this manually?
One potential difficulty in your example is that it's not very close to natural language. The pre-trained English models were trained on ~2m words of general web and news text, so they're not always going to perform perfect out-of-the-box on text with a very different structure.
While you could update the model with more examples of QUANTITY in your specific texts, I think that a rule-based approach might actually be a better and more efficient solution here.
The example in this blog post is actually very close to what you're trying to do:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load("en_core_web_sm")
weights_pattern = [
{"LIKE_NUM": True},
{"LOWER": {"IN": ["g", "kg", "grams", "kilograms", "lb", "lbs", "pounds"]}}
]
patterns = [{"label": "QUANTITY", "pattern": weights_pattern}]
ruler = EntityRuler(nlp, patterns=patterns)
nlp.add_pipe(ruler, before="ner")
doc = nlp("U.S. average was 2 lbs.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('U.S.', 'GPE'), ('2 lbs', 'QUANTITY')]
The statistical named entity recognizer respects pre-defined entities and wil "predict around" them. So if you're adding the EntityRuler before it in the pipeline, your custom QUANTITY entities will be assigned first and will be taken into account when the entity recognizer predicts labels for the remaining tokens.
Note that this example is using the latest version of spaCy, v2.1.x. You might also want to add more patterns to cover different constructions. For more details and inspiration, check out the documentation on the EntityRuler, combining models and rules and the token match pattern syntax.
I am implementing word2vec in gensim, on a corpus with nested lists (collection of tokenized words in sentences of sentences form) with 408226 sentences (lists) and a total of 3150546 words or tokens.
I am getting a meaningful results (in terms of the similarity between two words using model.wv.similarity) with the chosen values of 200 as size, window as 15, min_count as 5, iter as 10 and alpha as 0.5. All are lemmatized words and these all are input to models with vocabulary as 32716.
The results incurred from default alpha value, size, window and dimensions are meaningless for me based on the used data in computing the similarity values. However higher value of alpha as 0.5 gives me some meaningful results in terms of inducing meaningful similarity scores between two words. However, when I calculate the top n similar words, it's again meaningless. Does I need to change the entire parameters used in the initial training process.
I am still unable to reveal the exact reason, why the model behaves good with such a higher alpha value in computing the similarity between two words of the used corpus, whereas it's meaningless while computing the top n similar words with scores for an input word. Why is this the case?
Does it is diverging towards optimal solution. How to check this?
Any idea why is it the case is deeply appreciated.
Note: I'm using Python 3.7 on Windows machine with anaconda prompt and giving input to the model from a file.
This is what I have tried.
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models import Word2Vec
import ast
path = "F:/Folder/"
def load_data():
global Sentences
Sentences = []
for file in ['data_d1.txt','data_d2.txt']:
with open(path + file, 'r', encoding = 'utf-8') as f1:
Sentences.extend(ast.literal_eval(*f1.readlines()))
load_data()
def initialize_word_embedding():
model = Word2Vec(Sentences, size = 200, window = 15, min_count = 5, iter = 10, workers = 4)
print(model)
print(len(model.wv.vocab))
print(model.wv.similarity(w1 = 'structure', w2 = '_structure_'))
similarities = model.wv.most_similar('system')
for word, score in similarities:
print(word , score)
initialize_word_embedding()
The example of Sentences list is as follows:
[['scientist', 'time', 'comet', 'activity', 'sublimation', 'carbon', 'dioxide', 'nears', 'ice', 'system'], ['inconsistent', 'age', 'system', 'year', 'size', 'collision'], ['intelligence', 'system'], ['example', 'application', 'filter', 'image', 'motion', 'channel', 'estimation', 'equalization', 'example', 'application', 'filter', 'system']]
The data_d1.txt and data_d2.txt is a nested list (list of lists of lemmatized tokenized words). I have preprocessed the raw data and save it in a file. Now giving the same as input. For computing the lemmatizing tokens, I have used the popular WordNet lemmatizer.
I need the word-embedding model to calculate the similarity between two words and computing the most_similar words of a given input word. I am getting some meaningful scores for the model.wv.similarity() method, whereas in calculating the most_similar() words of a word (say, system as shown in above). I am not getting the desired results.
I am guessing the model is getting diverged from the global minima, with the use of high alpha values.
I am confused what should be the dimension size, window for inducing some meaningful results, as there is no such rules regarding how to compute the the size and window.
Any suggestion is appreciated. The size of total sentences and words are specified above in the question.
Results what I am getting without setting alpha = 0.5
Edit to Recent Comment:
Results:
Word2Vec(vocab=32716, size=200, alpha=0.025)
The similarity between set and _set_ is : 0.000269373188960656
which is meaningless for me as it is very very less in terms of accuracy, But, I am a getting 71% by setting alpha as 0.5, which seems to be meaningful for me as the word set is same for both the domains.
Explanation: The word set should be same for both the domains (as I am comparing the data of two domains with same word). Don't get confused with word _set_, this is because the word is same as set, I have injected a character _ at start and end to distinguish the same for two different domains.
The top 10 words along with scores of _set_ are:
_niche_ 0.6891741752624512
_intermediate_ 0.6883598566055298
_interpretation_ 0.6813371181488037
_printer_ 0.675414502620697
_finer_ 0.6625382900238037
_pertinent_ 0.6620787382125854
_respective_ 0.6619025468826294
_converse_ 0.6610435247421265
_developed_ 0.659270167350769
_tent_ 0.6588765382766724
Whereas, the top 10 words for set are:
cardinality 0.633270263671875
typereduction 0.6233855485916138
zdzisław 0.619156002998352
crisp 0.6165326833724976
equivalenceclass 0.605925977230072
pawlak 0.6058803200721741
straight 0.6045454740524292
culik 0.6040038466453552
rin 0.6038737297058105
multisets 0.6035065650939941
Why the cosine similarity value is 0.00 for the word set for two different data.