this is my code
from gensim.models import Phrases
documents = ["the mayor of new york was there the hill have eyes","the_hill have_eyes new york mayor was present"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1)
sent = ['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']
print(bigram[sent])
i want it detects "the_hill_have_eyes" but the output is
['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']
Phrases is a purely-statistical method for combining some unigram-token-pairs to new bigram-tokens. If it's not combining two unigrams you think should be combined, it's because the training data and/or chosen parameters (like threshold or min_count) don't imply that pairing should be combined.
Note especially that:
even when Phrases-combinations prove beneficial for downstream classification or info-retrieval steps, they may not intuitively/aesthetically match the "phrases" we as human readers would like to see
since Phrases requires bulk statistics for good results, it requires a lot of training data – you are unlikely to see impressive or representative results from tiny toy-sized training data
In particular with regard to that last point & your example, the interpretation of min_count in Phrases default-scoring means even a min_count=1 isn't low enough to cause bigrams for which there is only a single example in the training-corpus to be created.
So, if you expand your training-corpus a bit, you may be able to create the results you want. But you should still be aware that this method's only value comes from training are larger, realistic corpuses, so anything you see in tiny contrived examples may not generalize to real uses.
What you want is not actually bigrams but "fourgrams".
This can be achieved by doing something like this (my old piece of code I wrote some months ago):
// read the txt file
sentences = Text8Corpus(datapath('testcorpus.txt'))
phrases = Phrases(sentences, min_count=1, threshold=1)
bigram = Phraser(phrases)
sent = [u'trees', u'graph', u'minors']
// look for words in "sent"
print(bigram[sent])
[u'trees_graph', u'minors'] // output
// to create the bigrams
bigram_model = Phrases(unigram_sentences)
// apply the trained model to a sentence
for unigram_sentence in unigram_sentences:
bigram_sentence = u' '.join(bigram_model[unigram_sentence])
// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)
So here you have a trigram model (detecting 3 words together) and you get the idea on how to implement fourgrams.
Hope this helps. Good luck.
Related
I'm using CONLL2003 dataset to generate word embeddings using Word2vec and Glove.
The number of words returned by word2vecmodel.wv.vocab is different(much lesser) than glove.dictionary.
Here is the code:
Word2Vec:
word2vecmodel = Word2Vec(result ,size= 100, window =5, sg = 1)
X = word2vecmodel[word2vecmodel.wv.vocab]
w2vwords = list(word2vecmodel.wv.vocab)
Output len(w2vwords) = 4653
Glove:
from glove import Corpus
from glove import Glove
import numpy as np
corpus = Corpus()
nparray = []
allwords = []
no_clusters=500
corpus.fit(result, window=5)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
Output: len(glove.dictionary) = 22833
The input is a list of sentences. For example:
result[1:5] =
['Peter', 'Blackburn'],
['BRUSSELS', '1996-08-22'],
['The',
'European',
'Commission',
'said',
'Thursday',
'disagreed',
'German',
'advice',
'consumers',
'shun',
'British',
'lamb',
'scientists',
'determine',
'whether',
'mad',
'cow',
'disease',
'transmitted',
'sheep',
'.'],
['Germany',
"'s",
'representative',
'European',
'Union',
"'s",
'veterinary',
'committee',
'Werner',
'Zwingmann',
'said',
'Wednesday',
'consumers',
'buy',
'sheepmeat',
'countries',
'Britain',
'scientific',
'advice',
'clearer',
'.']]
There are totally 13517 sentences in the result list.
Can someone please explain why the list of words for which the embeddings are created are drastically different in size?
You haven't mentioned which Word2Vec implementation you're using, but I'll assume you're using the popular Gensim library.
Like the original word2vec.c code released by Google, Gensim Word2Vec uses a default min_count parameter of 5, meaning that any words appearing fewer than 5 times are ignored.
The word2vec algorithm needs many varied examples of a word's usage is different contexts to generate strong word-vectors. When words are rare, they fail to get very good word-vectors themselves: the few examples only show a few uses that may be idiosyncractic compared to what a larger sampling would show, and can't be subtly balanced against many other word representations in the manner that's best.
But further, given that in typical word-distributions, there are many such low-frequency words, altogether they also tend to make the word-vectors for other more-frequent qords worse. The lower-frequency words are, comparatively, 'interference' that absorbs training state/effort to the detriment of other more-improtant words. (At best, you can offset this effect a bit by using more training epochs.)
So, discarding low-frequency words is usually the right approach. If you really need vectors-for those words, obtaining more data so that those words are no longer rare is the best approach.
You can also use a lower min_count, including as low as min_count=1 to retain all words. But often discarding such rare words is better for whatever end-purpose for which the word-vectors will be used.
I am playing with WordNet and try to solve a NLP task.
I was wondering if there exists any way to get a list of words belonging to some large sets, such as "animals" (i.e. dog, cat, cow etc.), "countries", "electronics" etc.
I believe that it should be possible to somehow get this list by exploiting hypernyms.
Bonus question: do you know any other way to classify words in very large classes, besides "noun", "adjective" and "verb"? For example, classes like, "prepositions", "conjunctions" etc.
Yes, you just check if the category is a hypernym of the given word.
from nltk.corpus import wordnet as wn
def has_hypernym(word, category):
# Assume the category always uses the most popular sense
cat_syn = wn.synsets(category)[0]
# For the input, check all senses
for syn in wn.synsets(word):
for match in syn.lowest_common_hypernyms(cat_syn):
if match == cat_syn:
return True
return False
has_hypernym('dog', 'animal') # => True
has_hypernym('bucket', 'animal') # => False
If the broader word (the "category" here) is the lowest common hypernym, that means it's a direct hypernym of the query word, so the query word is in the category.
Regarding your bonus question, I have no idea what you mean. Maybe you should look at NER or open a new question.
With some help from polm23, I found this solution, which exploits similarity between words, and prevents wrong results when the class name is ambiguous.
The idea is that WordNet can be used to compare a list words, with the string animal, and compute a similarity score. From the nltk.org webpage:
Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
def keep_similar(words, similarity_thr):
similar_words=[]
w2 = wn.synset('animal.n.01')
[similar_words.append(word) for word in words if wn.synset(word + '.n.01').wup_similarity(w2) > similarity_thr ]
return similar_words
For example, if word_list = ['dog', 'car', 'train', 'dinosaur', 'London', 'cheese', 'radon'], the corresponding scores are:
0.875
0.4444444444444444
0.5
0.7
0.3333333333333333
0.3076923076923077
0.3076923076923077
This can easily be used to generate a list of animals, by setting a proper value of similarity_thr
I have somewhat read a bunch of papers which talks about predicting missing words in a sentence. What I really want is to create a model that suggest a word from an incomplete sentence.
Example:
Incomplete Sentence :
I bought an ___________ because its rainy.
Suggested Words:
umbrella
soup
jacket
From the journal I have read, they have utilized Microsoft Sentence Completion Dataset for predicting missing words from a sentence.
Example :
Incomplete Sentence :
Im sad because you are __________
Missing Word Options:
a) crying
b) happy
c) pretty
d) sad
e) bad
I don't want to predict a missing word from a list of options. I want to suggest a list of words from an incomplete sentence. Is it feasible? Please enlighten me cause Im really confused. What is state of the art model I can use for suggesting a list of words (semantically coherent) from an incomplete sentence?
Is it necessary that the list of suggested words as an output is included in the training dataset?
This is exactly how the BERT model was trained: mask some random words in the sentence, and make your network predict these words. So yes, it is feasible. And not, it is not necessary to have the list of suggested words as a training input. However, these suggested words should be the part of the overall vocabulary with which this BERT has been trained.
I adapted this answer to show how the completion function may work.
# install this package to obtain the pretrained model
# ! pip install -U pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval(); # turning off the dropout
def fill_the_gaps(text):
text = '[CLS] ' + text + ' [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
results = []
for i, t in enumerate(tokenized_text):
if t == '[MASK]':
predicted_index = torch.argmax(predictions[0, i]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
results.append(predicted_token)
return results
print(fill_the_gaps(text = 'I bought an [MASK] because its rainy .'))
print(fill_the_gaps(text = 'Im sad because you are [MASK] .'))
print(fill_the_gaps(text = 'Im worried because you are [MASK] .'))
print(fill_the_gaps(text = 'Im [MASK] because you are [MASK] .'))
The [MASK] symbol indicates the missing words (there can be any number of them). [CLS] and [SEP] are BERT-specific special tokens. The outputs for these particular prints are
['umbrella']
['here']
['worried']
['here', 'here']
The duplication is not surprising - transformer NNs are generally good at copying words. And from semantic point of view, these symmetric continuations look indeed very likely.
Moreover, if it is not a random word which is missing, but exactly the last word (or last several words), you can utilize any language model (e.g. another famous SOTA language model, GPT-2) to complete the sentence.
I am reading up about TF-IDF so that I can filter out common words from my corpus. It appears to me that you get a TF-IDF score for each word, document pair.
Which score do you pay attention to? Do you combine the scores across all documents for a word?
TFIDF ex:
doc1 = "This is doc1"
doc2 = "This is a different document"
corpus = [doc1, doc2]
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
X.toarray()
return: array([[0. , 0.70490949, 0. , 0.50154891, 0.50154891],
[0.57615236, 0. , 0.57615236, 0.40993715, 0.40993715]])
vec.get_feature_names()
So you have a line/1d array for each doc in the corpus, and that array has len = total vocab in your corpus (can get quite sparse). What score you pay attention to depends on what you're doing, ie finding most important word in a doc you could look for highest TF-idf in that doc. Most important in a corpus, look in the entire array. If you're trying to identify stop words, you could consider finding the set of X number of words with the minimum TF-IDF scores. However, I wouldn't really recommend using TF-IDF to find stop words in the first place, it lowers the weight of stop words, but they still occur frequently which could offset the weight loss. You'd probably be better off finding the most common words and then filtering them out. You'd want to look at either set you generated manually though.
I use three txt file to do a LDA project
I try to separate these three txt file with two way
The difference among the process is:
docs = [[doc1.split(' ')], [doc2.split(' ')], [doc3.split(' ')]]
docs1 = [[''.join(i)] for i in re.split(r'\n{1,}', doc11)] + [[''.join(e)] for e in re.split(r'\n{1,}', doc22)] + [[''.join(t)] for t in re.split(r'\n{1,}', doc33)]
dictionary = Dictionary(docs)
dictionary1 = Dictionary(docs1)
corpus = [dictionary.doc2bow(doc) for doc in docs]
corpus1 = [dictionary.doc2bow(doc) for doc in docs1]
And the document number is
len(corpus)
len(corpus1)
3
1329
But the lda model create a rubbish result in corpus but a relatively good result in corpus1
I use this model to train the document
model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=10,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
The difference in the two model is the document number, everything else is the same
Why LDA create such a different result in this two model?
If you study about LDA I think almost everywhere the first line is "LDA is good for large corpus whereas it doesn't work good for short text". In your corpus only 3 documents are there whereas in corpus1 it's 1329 so definitely it's gonna produce accurate results for corpus1
Another point is LDA works based on iterations and find random samples for training from documents, so when you have large corpus(more documents) it's most likely that every sample will be different as compared to same samples(few documents) and different samples can lead to more accurate results.
Hope this make sense.