How can I recover the likelihood of a certain word appearing in a given context from word embeddings? - nlp

I know that some methods of generating word embeddings (e.g. CBOW) are based on predicting the likelihood of a given word appearing in a given context. I'm working with polish language, which is sometimes ambiguous with respect to segmentation, e.g. 'Coś' can be either treated as one word, or two words which have been conjoined ('Co' + '-ś') depending on the context. What I want to do, is create a tokenizer which is context sensitive. Assuming that I have the vector representation of the preceding context, and all possible segmentations, could I somehow calculate, or approximate the likelihood of particular words appearing in this context?

This very much depends on the way how you got your embeddings. The CBOW model has two parameters the embedding matrix that is denoted v and the output projection matrix v'. If you want to recover the probabilities that are used in the CBOW model at training time, you need to get v' as well.
See equation (2) in the word2vec paper. Tools for pre-computing word embeddings usually don't do that, so you would need to modify them yourself.
Anyway, if you want to compute a probability of a word, given a context, you should rather think about using a (neural) language model than a table of word embeddings. If you search the Internet, I am sure you will find something that suits your needs.

Related

Custom word-embeddings in gensim

I have a word embedding matrix (say M) obtained of order V x N where V is the size of the vocabulary and N is the size of each word vector. I want the word2vec model of gensim to initialise its word embedding matrix with M, during training. I am able to load M in the word2vec format using
gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file)
but I don't know how to feed M into the gensim word2vec model.
The Gensim Word2Vec model isn't designed to be pre-initialized with outside vectors, so there's no built-in helper methods.
But of course since the source is available, and all objects can be modified by your direct tampering, you can still change/replace its usual (random) initialization with anything of your own choosing, wiht a little effort.
Specifically, you'd first create a Word2Vec model using the constructor without yet supplying any training corpus. (If you supply a corpus, it will automatically do the next two build_vocab() and train() steps for you, and you don't want that.)
Then, you'd perform the necessary .build_vocab() step, allowing it to survey your training text data to discover its vocabulary with word-frequencies, and perform its usual model initialization:
model.build_vocab(corpus)
At this point, before doing any other training, you can tamper with the model to replace its random-initialization of words with your alternate word-vectors, as from the vectors you've loaded. If your other vectors are in the KeyedVectors variable loaded_kv, this could be as simple as:
for word in loaded_kv.index_to_key:
if word in model.wv:
model.kv[word] = loaded_kv[word]
Note that if your loaded_kv includes words that aren't in the corpus, or too rare (appear fewer than min_count times) in the corpus, the model will not have allocated a space for those vectors – as they won't be used in training – and they won't be part of the final model.
If for some reason you need them to be, you should ensure a sufficient number of valid usage examples of those words appear inside the corpus. You shouldn't add them to the model in a ways that changes the total number of vectors in model.wv after .build_vocab(), because the model is not expecting that sort of change, and errors/undefined-behavior are likely.
(You also shouldn't simply force extra words that aren't in the real training data into the model, because while they will remain unchanged through training, all other words will continue to be adjusted through training – meaning any words that weren't also incrementally adjusted, in an interleaved fashion, along with the rest may wind up essentially "incompatible" with the word-vectors that were co-trained in the same full model.)
After you've modified the model's initialization to match your preferences, continue with normal training, something like:
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)

Finding both target and center word2vec matrices

I've read and heard(In the CS224 of Stanford) that the Word2Vec algorithm actually trains two matrices(that is, two sets of vectors.) These two are the U and the V set, one for words being a target and one for words being the context. The final output is the average of these two.
I have two questions in mind. one is that:
Why do we get an average of two vectors? Why it makes sense? Don't we lose some information?
The second question is, using pre-trained word2vec models, how can I get access to both matrices? Is there any downloadable word2vec with both sets of vectors? I don't have enough resources to train a new one.
Thanks
That relayed description isn't quite right. The word-vectors traditionally retrieved from a word2vec model come from a "projection matrix" which converts individual words to a right-sized input-vector for the shallow neural network.
(You could think of the projection matrix as turning a one-hot encoding into a dense-embedding for that word, but libraries typically implement this via a dictionary-lookup – eg: "what row of the vectors-matrix should I consult for this word-token?")
There's another matrix of weights leading to the model's output nodes, whose interpretation varies based on the training mode. In the common default of negative-sampling, there's one node per known word, so you could also interpret this matrix as having a vector per word. (In hierarchical-softmax mode, the known-words aren't encoded as single output nodes, so it's harder to interpret the relationship of this matrix to individual words.)
However, this second vector per word is rarely made directly available by libraries. Most commonly, the word-vector is considered simply the trained-up input vector, from the projection matrix. For example, the export format from Google's original word2vec.c release only saves-out those vectors, and the large "GoogleNews" vector set they released only has those vectors. (There's no averaging with the other output-side representation.)
Some work, especially that of Mitra et all of Microsoft Research (in "Dual Embedding Space Models" & associated writeups) has noted those output-side vectors may be of value in some applications as well – but I haven't seen much other work using those vectors. (And, even in that work, they're not averaged with the traditional vectors, but consulted as a separate option for some purposes.)
You'd have to look at the code of whichever libraries you're using to see if you can fetch these from their full post-training model representation. In the Python gensim library, this second matrix in the negative-sampling case is a model property named syn1neg, following the naming of the original word2vec.c.

Should I train embeddings using data from both training,validating and testing corpus?

I am in a case that I don't have any pre-trained words embedding for my domain (Vietnamese food reviews). so I got a though of embedding from the general and specific corpus.
And the point here is can I use the dataset of training, test and validating (did preprocess) as a source for creating my own word embeddings. If don't, hope you can give your experience.
Based on my intuition, and some experiments a wide corpus appears to be better, but I'd like to know if there's relevant research or other relevant results.
can I use the dataset of training, test and validating (did
preprocess) as a source for creating my own word embeddings
Sure, embeddings are not your features for your machine learning model. They are the "computational representation" of your data. In short, they are made of words represented in a vector space. With embeddings, your data is less sparse. Using word embeddings could be considered part of the pre-processing step of NLP.
Usually (I mean, using the most used technique, word2vec), the representation of a word in the vector space is defined by its surroundings (the words that it commonly goes along with).
Therefore, to create embeddings, the larger the corpus, the better, since it can better place a word vector in the vector space (and hence compare it to other similar words).

Fasttext algorithm use only word and subword? or sentences too?

I read the paper and googled as well if there is any good example of the learning method(or more likely learning procedure)
For word2vec, suppose there is corpus sentence
I go to school with lunch box that my mother wrapped every morning
Then with window size 2, it will try to obtain the vector for 'school' by using surrounding words
['go', 'to', 'with', 'lunch']
Now, FastText says that it uses the subword to obtain the vector, so it is definitely use n gram subword, for example with n=3,
['sc', 'sch', 'cho', 'hoo', 'ool', 'school']
Up to here, I understood.
But it is not clear that if the other words are being used for learning for 'school'. I can only guess that other surrounding words are used as well like the word2vec, since the paper mentions
=> the terms Wc and Wt are both used in functions
where Wc is context word and Wt is word at sequence t.
However, it is not clear that how FastText learns the vectors for word.
.
.
Please clearly explain how FastText learning process goes in procedure?
.
.
More precisely I want to know that if FastText also follows the same procedure as Word2Vec while it learns the n-gram characterized subword in addition. Or only n-gram characterized subword with word being used?
How does it vectorize the subword at initial? etc
Any context word has its candidate input vector assembled from the combination of both its full-word token and all its character-n-grams. So if the context word is 'school', and you're using 3-4 character n-grams, the in-training input vector is a combination of the full-word vector for school, and all the n-gram vectors for ['sch', 'cho', 'hoo', 'ool', 'scho', 'choo', 'hool'].)
When that candidate vector is adjusted by training, all the constituent vectors are adjusted. (This is a little like how in word2vec CBOW, mode, all the words of the single average context input vector get adjusted together, when their ability to predict a single target output word is evaluated and improved.)
As a result, those n-grams that happen to be meaningful hints across many similar words – for example, common word-roots or prefixes/suffixes – get positioned where they confer that meaning. (Other n-grams may remain mostly low-magnitude noise, because there's little meaningful pattern to where they appear.)
After training, reported vectors for individual in-vocabulary words are also constructed by combining the full-word vector and all n-grams.
Then, when you also encounter an out-of-vocabulary word, to the extent it shares some or many n-grams with morphologically-similar in-training words, it will get a similar calculated vector – and thus be better than nothing, in guessing what that word's vector should be. (And in the case of small typos or slight variants of known words, the synthesized vector may be pretty good.)
The fastText site states that at least 2 of implemented algorithms do use surrounding words in sentences.
Moreover, the original fastText implementation is open source so you can check how exactly it works exploring the code.

CBOW v.s. skip-gram: why invert context and target words?

In this page, it is said that:
[...] skip-gram inverts contexts and targets, and tries to predict each context word from its target word [...]
However, looking at the training dataset it produces, the content of the X and Y pair seems to be interexchangeable, as those two pairs of (X, Y):
(quick, brown), (brown, quick)
So, why distinguish that much between context and targets if it is the same thing in the end?
Also, doing Udacity's Deep Learning course exercise on word2vec, I wonder why they seem to do the difference between those two approaches that much in this problem:
An alternative to skip-gram is another Word2Vec model called CBOW (Continuous Bag of Words). In the CBOW model, instead of predicting a context word from a word vector, you predict a word from the sum of all the word vectors in its context. Implement and evaluate a CBOW model trained on the text8 dataset.
Would not this yields the same results?
Here is my oversimplified and rather naive understanding of the difference:
As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was a really [...] day CBOW model will tell you that most probably the word is beautiful or nice. Words like delightful will get much less attention of the model, because it is designed to predict the most probable word. This word will be smoothed over a lot of examples with more frequent words.
On the other hand, the skip-gram model is designed to predict the context. Given the word delightful it must understand it and tell us that there is a huge probability that the context is yesterday was really [...] day, or some other relevant context. With skip-gram the word delightful will not try to compete with the word beautiful but instead, delightful+context pairs will be treated as new observations.
UPDATE
Thanks to #0xF for sharing this article
According to Mikolov
Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
One more addition to the subject is found here:
In the "skip-gram" mode alternative to "CBOW", rather than averaging
the context words, each is used as a pairwise training example. That
is, in place of one CBOW example such as [predict 'ate' from
average('The', 'cat', 'the', 'mouse')], the network is presented with
four skip-gram examples [predict 'ate' from 'The'], [predict 'ate'
from 'cat'], [predict 'ate' from 'the'], [predict 'ate' from 'mouse'].
(The same random window-reduction occurs, so half the time that would
just be two examples, of the nearest words.)
It has to do with what exactly you're calculating at any given point. The difference will become clearer if you start to look at models that incorporate a larger context for each probability calculation.
In skip-gram, you're calculating the context word(s) from the word at the current position in the sentence; you're "skipping" the current word (and potentially a bit of the context) in your calculation. The result can be more than one word (but not if your context window is just one word long).
In CBOW, you're calculating the current word from the context word(s), so you will only ever have one word as a result.
In Deep Learning Course, from coursera https://www.coursera.org/learn/nlp-sequence-models?specialization=deep-learning you can see that Andrew NG doesn't switch the context-target concepts. It means that target word will ALWAYS be treated as the word to be predicted, no matter if is CBOW or skip-gram.

Resources