What is the input format of fastText and why does my model doesn't give me a meaningful similar output? - nlp

My goal is to find similarities between a word and a document. For example, I want to find the similarity between "new" and a document, for simplicity, say "Hello World!".
I used word2vec from gensim, but the problem is it does not find the similarity for an unseen word. Thus, I tried to use fastText from gensim as it can find similarity for words that are out of vocabulary.
Here is a sample of my document data:
[['This', 'is', 'the', 'only', 'rule', 'of', 'our', 'household'],
['If',
'you',
'feel',
'a',
'presence',
'standing',
'over',
'you',
'while',
'you',
'sleep',
'do'],
['NOT', 'open', 'your', 'eyes'],
['Ignore', 'it', 'and', 'try', 'to', 'fall', 'asleep'],
['This',
'may',
'sound',
'a',
'bit',
'like',
'the',
'show',
'Bird',
'Box',
'from',
'Netflix']]
I simply train data like this:
from gensim.models.fasttext import FastText
model = FastText(sentences_cleaned)
Consequently, I want to find the similarity between say, "rule" and this document.
model.wv.most_similar("rule")
However, fastText gives me this:
[('the', 0.1334390938282013),
('they', 0.12790171802043915),
('in', 0.12731242179870605),
('not', 0.12656228244304657),
('and', 0.11071767657995224),
('of', 0.08563747256994247),
('I', 0.06609072536230087),
('that', 0.05195673555135727),
('The', 0.002402491867542267),
('my', -0.009009800851345062)]
Obviously, it must have "rule" as the top similarity since the word "rule" appears in the first sentence of the document. I also tried stemming/lemmatization, but it doesn't work either.
Was my input format correct? I've seen lots of documents are using .cor or .bin format and I don't know what are those.
Thanks for any reply!

model.wv.most_similar('rule') asks for that's model's set-of-word-vectors (.wv) to return the words most-similar to 'rule'. That is, you've provided neither any document (multiple words) as a query, nor is there any way for the FastText model to return either a document itself, or a name of any documents. Only words, as it has done.
While FastText trains on texts – lists of word-tokens – it only models words/subwords. So it's unclear what you expected instead: the answer is of the proper form.
Those don't look like words very-much like 'rule', but you'll only get good results from FastText (and similar word2vec-algorithms) if you train them with lots of varied data showing many subtly-contrasting realistic uses of the relevant words.
How many texts, with how many words, are in your sentences_cleaned data? (How many uses of 'rule' and related words?)
In any real FastText/Word2Vec/etc model, trained with asequate data/parameters, no single sentence (like your 1st sentence) can tell you much about what the results "should" be. That only emerged from the full rich dataset.

Related

How to know which contextual embedding to use at test time

Models like BERT generate contextual embeddings for words with different contextual meanings, like 'bank', 'left'.
I don't understand which contextual embedding the model chooses to use at test time? Given a test sentence for classification, when we load the pre-trained bert, how do we initialize the word (token) embedding to use the right contextual embedding over the other embeddings of the same word?
More specifically, there is a convert_to_id() function which converts a word to an id? how does one id represent the correct contextual embedding for the input sentence at test time? Thank you.
I searched all over online but only found explanation about the difference between static vs. contextual embedding, the high level concept is easy to get, but how is that really achieved is unclear. I also search some code example, but the convert_to_id() makes me further confused as I asked in my question.
TL;DR There's only one embedding for the word "left." There's also no way to know which meaning the word has if the sequence is only one word. BERT uses the representation of it's start-of-sequence token (i.e., [CLS]) to represent each sequence, and that representation will differ depending on what context the word "left" is used in.
Given your example of text classification, the input sentence is first tokenized using the WordPiece tokenizer, and the [CLS] token's representation is fed to a feedforward layer for classification.
You can't really debate context when given single words, so I'll use two different sentences:
I left my house this morning.
You'll see my house on your left.
The steps to performing text classification typically are:
Tokenize your input text and receive the necessary input_ids and attention_masks.
Feed this tokenized input into your model and receive the outputs.
Feed the [CLS] token's representation to a classifier layer (typically a feedforward network).
The two sentences are tokenized to (using bert-base-uncased):
['[CLS]', 'i', 'left', 'my', 'house', 'this', 'morning', '.', '[SEP]']
['[CLS]', 'you', "'", 'll', 'see', 'my', 'house', 'on', 'your', 'left', '.', '[SEP]']
The [CLS] token's representation for each sentence will be different because the sentences have different words (i.e., contexts). The result is therefore different.
>>> from transformers import AutoModel, AutoTokenizer
>>> bert = AutoModel.from_pretrained('bert-base-uncased')
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
>>> sentence1 = "I left my house this morning."
>>> sentence2 = "You'll see my house on your left."
>>> sentence1_inputs = tokenizer(sentence1, return_tensors='pt')
>>> sentence2_inputs = tokenizer(sentence2, return_tensors='pt')
>>> sentence1_outputs = bert(**sentence1_inputs)
>>> sentence2_outputs = bert(**sentence2_inputs)
>>> cls1 = sentence1_outputs[1]
>>> cls2 = sentence2_outputs[1]
>>> print(cls1[0][:5])
... tensor([-0.8832, -0.3484, -0.8044, 0.6536, 0.6409], grad_fn=<SliceBackward0>)
>>> print(cls2[0][:5])
... tensor([-0.8791, -0.4069, -0.8994, 0.7371, 0.7010], grad_fn=<SliceBackward0>)

Identifying phrases which contrast two corpora

I would like to identify compound phrases in one corpus (e.g. (w_1, w_2) in Corpus 1) which not only appear significantly more often than their constituents (e.g. (w_1),(w_2) in Corpus 1) within the corpus but also more than they do in a second corpus (e.g. (w_1, w_2) in Corpus 2). Consider the following informal example. I have the two corpora each consisting of a set of documents:
[['i', 'live', 'in', 'new', 'york'], ['new', 'york', 'is', 'busy'], ...]
[['los', 'angeles', 'is', 'sunny'], ['los', 'angeles', 'has', 'bad', 'traffic'], ...].
In this case, I would like new_york to be detected as a compound phrase. However, when corpus 2 is replaced by
[['i', 'go', 'to', 'new', york'], ['i', 'like', 'new', 'york'], ...],
I would like new_york to be relatively disregarded.
I could just use a ratio between n-gram scores between corresponding phrases in corpora, but I don't see how to scale to general n. Normally, phrase detection for n-grams with n>2 is done by recursing on n and gradually inserting compound phrases into the documents by thresholding a score function. This insures that at step n, if you want to score the n-gram (w_1, ..., w_n), then you can always normalize by the constituent m-grams for m<n. But with a different corpus, these are not guaranteed to appear.
A reference to the literature or a relevant hack will be appreciated.

Word vocabulary generated by Word2vec and Glove models are different for the same corpus

I'm using CONLL2003 dataset to generate word embeddings using Word2vec and Glove.
The number of words returned by word2vecmodel.wv.vocab is different(much lesser) than glove.dictionary.
Here is the code:
Word2Vec:
word2vecmodel = Word2Vec(result ,size= 100, window =5, sg = 1)
X = word2vecmodel[word2vecmodel.wv.vocab]
w2vwords = list(word2vecmodel.wv.vocab)
Output len(w2vwords) = 4653
Glove:
from glove import Corpus
from glove import Glove
import numpy as np
corpus = Corpus()
nparray = []
allwords = []
no_clusters=500
corpus.fit(result, window=5)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
Output: len(glove.dictionary) = 22833
The input is a list of sentences. For example:
result[1:5] =
['Peter', 'Blackburn'],
['BRUSSELS', '1996-08-22'],
['The',
'European',
'Commission',
'said',
'Thursday',
'disagreed',
'German',
'advice',
'consumers',
'shun',
'British',
'lamb',
'scientists',
'determine',
'whether',
'mad',
'cow',
'disease',
'transmitted',
'sheep',
'.'],
['Germany',
"'s",
'representative',
'European',
'Union',
"'s",
'veterinary',
'committee',
'Werner',
'Zwingmann',
'said',
'Wednesday',
'consumers',
'buy',
'sheepmeat',
'countries',
'Britain',
'scientific',
'advice',
'clearer',
'.']]
There are totally 13517 sentences in the result list.
Can someone please explain why the list of words for which the embeddings are created are drastically different in size?
You haven't mentioned which Word2Vec implementation you're using, but I'll assume you're using the popular Gensim library.
Like the original word2vec.c code released by Google, Gensim Word2Vec uses a default min_count parameter of 5, meaning that any words appearing fewer than 5 times are ignored.
The word2vec algorithm needs many varied examples of a word's usage is different contexts to generate strong word-vectors. When words are rare, they fail to get very good word-vectors themselves: the few examples only show a few uses that may be idiosyncractic compared to what a larger sampling would show, and can't be subtly balanced against many other word representations in the manner that's best.
But further, given that in typical word-distributions, there are many such low-frequency words, altogether they also tend to make the word-vectors for other more-frequent qords worse. The lower-frequency words are, comparatively, 'interference' that absorbs training state/effort to the detriment of other more-improtant words. (At best, you can offset this effect a bit by using more training epochs.)
So, discarding low-frequency words is usually the right approach. If you really need vectors-for those words, obtaining more data so that those words are no longer rare is the best approach.
You can also use a lower min_count, including as low as min_count=1 to retain all words. But often discarding such rare words is better for whatever end-purpose for which the word-vectors will be used.

Extract relationship concepts from sentences

Is there a current model or how could I train a model that takes a sentence involving two subjects like:
[Meiosis] is a type of [cell division]...
and decides if one is the child or parent concept of the other? In this case, cell division is the parent of meiosis.
Are the subjects already identified, i.e., do you know beforehand for each sentence which words or sequence of words represent the subjects? If you do I think what you are looking for is relationship extraction.
Unsupervised approach
A simple unsupervised approach is to look for patterns using part-of-speech tags, e.g.:
First you tokenize and get the PoS-tags for each sentence:
sentence = "Meiosis is a type of cell division."
tokens = nltk.word_tokenize("Meiosis is a type of cell division.")
tokens
['Meiosis', 'is', 'a', 'type', 'of', 'cell', 'division', '.']
token_pos = nltk.pos_tag(tokens)
token_pos
[('Meiosis', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('type', 'NN'), ('of', 'IN'),
('cell', 'NN'), ('division', 'NN'), ('.', '.')]
Then you build a parser, to parse a specific pattern based on PoS-tags, which is a pattern that mediates relationships between two subjects/entities/nouns:
verb = "<VB|VBD|VBG|VBN|VBP|VBZ>*<RB|RBR|RBS>*"
word = "<NN|NNS|NNP|NNPS|JJ|JJR|JJS|RB|WP>"
preposition = "<IN>"
rel_pattern = "({}|{}{}|{}{}*{})+ ".format(verb, verb, preposition, verb, word, preposition)
grammar_long = '''REL_PHRASE: {%s}''' % rel_pattern
reverb_pattern = nltk.RegexpParser(grammar_long)
NOTE: This pattern is based on this paper: http://www.aclweb.org/anthology/D11-1142
You can then apply the parser to all the tokens/PoS-tags except the ones which are part of the subjects/entities:
reverb_pattern.parse(token_pos[1:5])
Tree('S', [Tree('REL_PHRASE', [('is', 'VBZ')]), ('a', 'DT'), ('type', 'NN'), ('of', 'IN')])
If the the parser outputs a REL_PHRASE than there is a relationships between the two subjects. You then need to analyse all these patterns and decide which represent a parent-of relationships. One way to achieve that is by clustering them, for instance.
Supervised approach
If your sentences already are tagged with subjects/entities and with the type of relationships, i.e., a supervised scenario than you can build a model where the features can be the words between the two subjects/entities and the type of relationship the label.
sent: "[Meiosis] is a type of [cell division.]"
label: parent of
You can build a vector representation of is a type of, and train a classifier to predict the label parent of. You will need many examples for this, it also depends on how many different classes/labels you have.

Doc2Vec: Best practice for feedback loop for training the model

I am learning about Doc2Vec and the gensim library. I have been able to train my model by creating a corpus of documents such as
LabeledSentence(['what', 'happens', 'when', 'an', 'army', 'of', 'wetbacks', 'towelheads', 'and', 'godless', 'eastern', 'european', 'commies', 'gather', 'their', 'forces', 'south', 'of', 'the', 'border', 'gary', 'busey', 'kicks', 'their', 'butts', 'of', 'course', 'another', 'laughable', 'example', 'of', 'reagan-era', 'cultural', 'fallout', 'bulletproof', 'wastes', 'a', 'decent', 'supporting', 'cast', 'headed', 'by', 'l', 'q', 'jones', 'and', 'thalmus', 'rasulala'], ['LABELED_10', '0'])`
note that this particular document has two tags, namely 'LABELED_10' and '0'.
Now after i load my model and perform
print(model.docvecs.most_similar("LABELED_10"))
i get
[('LABELED_107', 0.48432376980781555), ('LABELED_110', 0.4827481508255005), ('LABELED_214', 0.48039984703063965), ('LABELED_207', 0.479473352432251), ('LABELED_315', 0.47931796312332153), ('LABELED_307', 0.47898322343826294), ('LABELED_124', 0.4776897132396698), ('LABELED_222', 0.4768940210342407), ('LABELED_413', 0.47479286789894104), ('LABELED_735', 0.47462597489356995)]
which is perfect ! as i get all the tags most similar to LABELED_10.
Now i would like to have a feedback loop while training my model. So if i give my model a new document, i would like to know how good or bad the model's classification is before tagging and adding that document to my corpus. How would i do that using Doc2Vec? So how do i know whether the documents for LABELED_107 and LABELED_10 are actually similar or not. Here is one approach that i have in mind. Here is the code for my random forest classifier
result = cfun.rfClassifer(n_estimators, trainingDataFV, train["sentiment"],testDataFV)
and here is the function
def rfClassifer(n_estimators, trainingSet, label, testSet):
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
forest = RandomForestClassifier(n_estimators)
forest = forest.fit(trainingSet, label)
result = forest.predict(testSet)
return result
and finally i can do
output = pd.DataFrame(data={"id": test["id"], "sentiment": result})
output.to_csv("../../submits/Doc2Vec_AvgVecPredict.csv", index=False, quoting=3)
Feedback process
Keep a validation set which is tagged correctly.
Feed the tagged validation set to the classifier after removing the tags and save the result in a csv.
Compare the result with another csv that has the correct tags.
For every mismatch, add those documents to the labeled training set and train the model again.
Repeat for more validation sets.
Is this approach correct? Also, can i incrementally train the doc2vec model? Lets say that initially i trained my doc2vec model with 100k tagged docs. Now after the validation step, i need my model to be trained on a further 10k documents. Will i have to train my model from the very beginning ? Meaning will i need to train my model on the initial 100k tagged docs again?
I would really appreciate your insights.
Thanks
From my understanding of Doc2Vec, you can retrain your model as long are you have all the previous vectors with the model.
And from the following link, you can see how they perform their validation:
https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis

Resources