Fasttext algorithm use only word and subword? or sentences too? - nlp

I read the paper and googled as well if there is any good example of the learning method(or more likely learning procedure)
For word2vec, suppose there is corpus sentence
I go to school with lunch box that my mother wrapped every morning
Then with window size 2, it will try to obtain the vector for 'school' by using surrounding words
['go', 'to', 'with', 'lunch']
Now, FastText says that it uses the subword to obtain the vector, so it is definitely use n gram subword, for example with n=3,
['sc', 'sch', 'cho', 'hoo', 'ool', 'school']
Up to here, I understood.
But it is not clear that if the other words are being used for learning for 'school'. I can only guess that other surrounding words are used as well like the word2vec, since the paper mentions
=> the terms Wc and Wt are both used in functions
where Wc is context word and Wt is word at sequence t.
However, it is not clear that how FastText learns the vectors for word.
.
.
Please clearly explain how FastText learning process goes in procedure?
.
.
More precisely I want to know that if FastText also follows the same procedure as Word2Vec while it learns the n-gram characterized subword in addition. Or only n-gram characterized subword with word being used?
How does it vectorize the subword at initial? etc

Any context word has its candidate input vector assembled from the combination of both its full-word token and all its character-n-grams. So if the context word is 'school', and you're using 3-4 character n-grams, the in-training input vector is a combination of the full-word vector for school, and all the n-gram vectors for ['sch', 'cho', 'hoo', 'ool', 'scho', 'choo', 'hool'].)
When that candidate vector is adjusted by training, all the constituent vectors are adjusted. (This is a little like how in word2vec CBOW, mode, all the words of the single average context input vector get adjusted together, when their ability to predict a single target output word is evaluated and improved.)
As a result, those n-grams that happen to be meaningful hints across many similar words – for example, common word-roots or prefixes/suffixes – get positioned where they confer that meaning. (Other n-grams may remain mostly low-magnitude noise, because there's little meaningful pattern to where they appear.)
After training, reported vectors for individual in-vocabulary words are also constructed by combining the full-word vector and all n-grams.
Then, when you also encounter an out-of-vocabulary word, to the extent it shares some or many n-grams with morphologically-similar in-training words, it will get a similar calculated vector – and thus be better than nothing, in guessing what that word's vector should be. (And in the case of small typos or slight variants of known words, the synthesized vector may be pretty good.)

The fastText site states that at least 2 of implemented algorithms do use surrounding words in sentences.
Moreover, the original fastText implementation is open source so you can check how exactly it works exploring the code.

Related

How does gensim word2vec word embedding extract training word pair for 1 word sentence?

Refer to below image (the process of how word2vec skipgram extract training datasets-the word pair from the input sentences).
E.G. "I love you." ==> [(I,love), (I, you)]
May I ask what is the word pair when the sentence contains only one word?
Is it "Happy!" ==> [(happy,happy)] ?
I tested the word2vec algorithm in genism, when there is just one word in the training set sentences, (and this word is not included in other sentences), the word2vec algorithm can still construct an embedding vector for this specific word. I am not sure how the algorithm is able to do so.
===============UPDATE===============================
As the answer posted below, I think the word embedding vector created for the word in the 1-word-sentence is just the random initialization of neural network weights.
No word2vec training is possible from a 1-word sentence, because there's no neighbor words to use as input to predict a center/target word. Essentially, that sentence is skipped.
If that was the only appearance of the word in the corpus, and you're seeing a vector for that word, it's just the starting random-initialization of the word, with no further training. (And, you should probably use a higher min_count, as keeping such rare words is usually a mistake in word2vec: they won't get good vectors, and other nearby words' vectors will improve if the 'noise' from all such insufficiently model-able rare words is removed.)
If that 1-word sentence actually appeared next-to other real sentences in your corpus, it could make sense to combine it with surrounding texts. There's nothing magic about actual sentences for this kind word-from-surroundings modeling - the algorithm is just working on 'neighbors', and it's common to use multi-sentence chunks as the texts for training, and sometimes even punctuation (like sentence-ending periods) is also retained as 'words'. Then words from an actually-separate sentence – but still related by having appeared in the same document – will appear in each other's contexts.

How can I recover the likelihood of a certain word appearing in a given context from word embeddings?

I know that some methods of generating word embeddings (e.g. CBOW) are based on predicting the likelihood of a given word appearing in a given context. I'm working with polish language, which is sometimes ambiguous with respect to segmentation, e.g. 'Coś' can be either treated as one word, or two words which have been conjoined ('Co' + '-ś') depending on the context. What I want to do, is create a tokenizer which is context sensitive. Assuming that I have the vector representation of the preceding context, and all possible segmentations, could I somehow calculate, or approximate the likelihood of particular words appearing in this context?
This very much depends on the way how you got your embeddings. The CBOW model has two parameters the embedding matrix that is denoted v and the output projection matrix v'. If you want to recover the probabilities that are used in the CBOW model at training time, you need to get v' as well.
See equation (2) in the word2vec paper. Tools for pre-computing word embeddings usually don't do that, so you would need to modify them yourself.
Anyway, if you want to compute a probability of a word, given a context, you should rather think about using a (neural) language model than a table of word embeddings. If you search the Internet, I am sure you will find something that suits your needs.

How to interpret CBOW word embeddings?

In context of word2vec, it is said that "words occurring in similar contexts have similar word embeddings"; for example, "love" and "hate" may have similar embeddings because they appear in contextual words such as "I" and "movie", just for an example.
I get the intuition with skip-gram: both embeddings of "love" and "hate" should predict the context words "I" and "movie", thus the embeddings should be similar. However, I can't get it with CBOW: it says that the average embeddings of "I" and "movie" should predict "love" and "hate"; does that necessarily lead to that the embeddings of "love" and "hate" should be similar? Or do we interpret word embeddings of SG and CBOW in different ways?
In practice it's all smoothed out by the diversity of contexts in CBOW – so the same intuition that works for skip-gram should also apply to CBOW.
Even if 'movie' is only 1/Nth of an influence on the average vector of all context words, when that average vector gets backpropagation-corrected to slightly more predictive of 'love' (for a single training-example), each word contributing to it is also backpropagation-corrected.
Over all examples, and all passes, corrections that pull in random directions tend to cancel each other out, but any consistent tendencies – like two words often co-occurring – tend to reinforce similar correction-nudges on their word-vectors, moving them near each other. (Or, other words that are like a word in other aspects.)
Skip-gram is the stark, simple version: force word X to be more predictive of word Y – but expect that lots of other 1:1 corrections will all balance out. CBOW does things in batches: force words X^1, X^2, ... X^n to all be more predictive of word Y - but expect that lots of other somewhat-overlapping batches will pull distinct words together/apart as needed.

Semantic Similarity across multiple languages

I am using word embeddings for finding similarity between two sentences. Using word2vec, I also get a similarity measure if one sentence is in English and the other one in Dutch (though not very good).
So I started wondering if it's possible to compute the similarity between two sentences in two different languages (without an explicit translation), especially if the languages have some similarities (Englis/Dutch)?
Let's assume that your sentence-similarity scheme uses only word-vectors as an input – as in simple word-vector averaging schemes, or Word Mover's Distance.
It should be possible to do what you've suggested, provided that:
you have good sets of word-vectors for each language's words
the coordinate spaces of the word-vectors are compatible, meaning the words for the exact-same things in both languages have nearly-identical coordinates (and other words with similar meanings have close coordinates)
That second quality is not automatically assured. In fact, given the random initialization of word2vec models, and other randomization introduced by the algorithm/implementation, even subsequent training runs on the exact same data won't place words into the exact same places. So word-vectors trained on totally-separate English/Dutch corpuses won't likely place equivalent words at the same coordinates.
But, you can learn an algebraic-transformation between two spaces, based on certain anchor/reference word-pairs (that you know should have similar vectors). You can then apply that transformation to all words in one of the two sets, which results in you having vectors for those 'foreign' words within the comparable coordinate-space of the 'canonical' word-set.
In fact this very idea was used in one of the first word2vec papers:
"Exploiting Similarities among Languages for Machine Translation"
If you were to apply a similar transformation on one of your language word-vector sets, then use those transformed vectors as inputs to your sentence-vector scheme, those sentence-vectors would likely have some useful comparability to sentence-vectors in the other language, bootstrapped from word-vectors in the same coordinate-space.
Update: There's a very interesting recent paper that manages to train word-vectors in multiple languages simultaneously, using a corpus that includes both raw sentences in each single language, and a (smaller) set of aligned-sentences that are known to mean the same in both languages. Gensim doesn't yet support this mode, but there's discussion of supporting it in a future refactor.
I've recently produced a Python implementation of the technique mentioned in the paper from #gojomo's answer: transvec.
You'll need to provide word translation pairs as training data (I just threw words from my corpus into Google Translate to get as many such pairs as I can) and then you can use a wrapper model from transvec to produce comparable word embeddings for multiple languages. Here's an example:
import gensim.downloader
from transvec.transformers import TranslationWordVectorizer
# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")
# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]
bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)
# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]
Don't worry about the weird POS tags on the Russian words - this is just a quirk of the particular pre-trained model I used.
For the case of documents rather than words, things are a little trickier because Doc2Vec can't use pre-trained Word2Vec models as a starting point. However, you can get an approximate document vector by simply taking the mean of all the word vectors from that document. If you provide a 2d array to TranslationWordVectorizer's transform method, it will do exactly this and provide you with an approximate document vector so you can find documents with similar meaning even if the languages are different.

CBOW v.s. skip-gram: why invert context and target words?

In this page, it is said that:
[...] skip-gram inverts contexts and targets, and tries to predict each context word from its target word [...]
However, looking at the training dataset it produces, the content of the X and Y pair seems to be interexchangeable, as those two pairs of (X, Y):
(quick, brown), (brown, quick)
So, why distinguish that much between context and targets if it is the same thing in the end?
Also, doing Udacity's Deep Learning course exercise on word2vec, I wonder why they seem to do the difference between those two approaches that much in this problem:
An alternative to skip-gram is another Word2Vec model called CBOW (Continuous Bag of Words). In the CBOW model, instead of predicting a context word from a word vector, you predict a word from the sum of all the word vectors in its context. Implement and evaluate a CBOW model trained on the text8 dataset.
Would not this yields the same results?
Here is my oversimplified and rather naive understanding of the difference:
As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was a really [...] day CBOW model will tell you that most probably the word is beautiful or nice. Words like delightful will get much less attention of the model, because it is designed to predict the most probable word. This word will be smoothed over a lot of examples with more frequent words.
On the other hand, the skip-gram model is designed to predict the context. Given the word delightful it must understand it and tell us that there is a huge probability that the context is yesterday was really [...] day, or some other relevant context. With skip-gram the word delightful will not try to compete with the word beautiful but instead, delightful+context pairs will be treated as new observations.
UPDATE
Thanks to #0xF for sharing this article
According to Mikolov
Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
One more addition to the subject is found here:
In the "skip-gram" mode alternative to "CBOW", rather than averaging
the context words, each is used as a pairwise training example. That
is, in place of one CBOW example such as [predict 'ate' from
average('The', 'cat', 'the', 'mouse')], the network is presented with
four skip-gram examples [predict 'ate' from 'The'], [predict 'ate'
from 'cat'], [predict 'ate' from 'the'], [predict 'ate' from 'mouse'].
(The same random window-reduction occurs, so half the time that would
just be two examples, of the nearest words.)
It has to do with what exactly you're calculating at any given point. The difference will become clearer if you start to look at models that incorporate a larger context for each probability calculation.
In skip-gram, you're calculating the context word(s) from the word at the current position in the sentence; you're "skipping" the current word (and potentially a bit of the context) in your calculation. The result can be more than one word (but not if your context window is just one word long).
In CBOW, you're calculating the current word from the context word(s), so you will only ever have one word as a result.
In Deep Learning Course, from coursera https://www.coursera.org/learn/nlp-sequence-models?specialization=deep-learning you can see that Andrew NG doesn't switch the context-target concepts. It means that target word will ALWAYS be treated as the word to be predicted, no matter if is CBOW or skip-gram.

Resources