How does gensim word2vec word embedding extract training word pair for 1 word sentence? - nlp

Refer to below image (the process of how word2vec skipgram extract training datasets-the word pair from the input sentences).
E.G. "I love you." ==> [(I,love), (I, you)]
May I ask what is the word pair when the sentence contains only one word?
Is it "Happy!" ==> [(happy,happy)] ?
I tested the word2vec algorithm in genism, when there is just one word in the training set sentences, (and this word is not included in other sentences), the word2vec algorithm can still construct an embedding vector for this specific word. I am not sure how the algorithm is able to do so.
===============UPDATE===============================
As the answer posted below, I think the word embedding vector created for the word in the 1-word-sentence is just the random initialization of neural network weights.

No word2vec training is possible from a 1-word sentence, because there's no neighbor words to use as input to predict a center/target word. Essentially, that sentence is skipped.
If that was the only appearance of the word in the corpus, and you're seeing a vector for that word, it's just the starting random-initialization of the word, with no further training. (And, you should probably use a higher min_count, as keeping such rare words is usually a mistake in word2vec: they won't get good vectors, and other nearby words' vectors will improve if the 'noise' from all such insufficiently model-able rare words is removed.)
If that 1-word sentence actually appeared next-to other real sentences in your corpus, it could make sense to combine it with surrounding texts. There's nothing magic about actual sentences for this kind word-from-surroundings modeling - the algorithm is just working on 'neighbors', and it's common to use multi-sentence chunks as the texts for training, and sometimes even punctuation (like sentence-ending periods) is also retained as 'words'. Then words from an actually-separate sentence – but still related by having appeared in the same document – will appear in each other's contexts.

Related

How to find similar sentence from a corpus on word2vec?

I have implemented word2vec on my corpus using the TensorFlow tutorial: https://www.tensorflow.org/tutorials/text/word2vec#next_steps
Now I'm want to give a sentence as input and want to find a similar sentence in the corpus.
Any leads on how I can perform this?
A simple word2vec model is not capable of such task, as it only relates word semantics to each other, not the semantics of whole sentences. Inherently, such a model has no generative function, it only serves as a look-up table.
Word2vec models map word strings to vectors in the embedding space. To find similar words for a given sample word, one can simply go through all vectors in the vocabulary and find the ones that are closest (in terms of the 2-norm) from the sample word vector. For further information you could go here or here.
However, this does not work for sentences as it would require a whole vocabulary of sentences of which to pick similar ones - which is not feasible.
Edit: This seems to be a duplicate of this question.

What is the difference between Sentence Encodings and Contextualized Word Embeddings?

I have seen both terms used while reading papers about BERT and ELMo so I wonder if there is a difference between them.
A contextualized word embeding is a vector representing a word in a special context. The traditional word embeddings such as Word2Vec and GloVe generate one vector for each word, whereas a contextualized word embedding generates a vector for a word depending on the context. Consider the sentences The duck is swimmingand You shall duck when someone shoots at you. With traditional word embeddings, the word vector for duckwould be the same in both sentences, whereas it should be a different one in the contextualized case.
While word embeddings encode words into a vector representation, there is also the question on how to represent a whole sentence in a way a computer can easily work with. These sentence encodings can embedd a whole sentence as one vector , doc2vec for example which generate a vector for a sentence. But also BERT generates a representation for the whole sentence, the [CLS]-token.
So in short, a conextualized word embedding represents a word in a context, whereas a sentence encoding represents a whole sentence.

Should I train embeddings using data from both training,validating and testing corpus?

I am in a case that I don't have any pre-trained words embedding for my domain (Vietnamese food reviews). so I got a though of embedding from the general and specific corpus.
And the point here is can I use the dataset of training, test and validating (did preprocess) as a source for creating my own word embeddings. If don't, hope you can give your experience.
Based on my intuition, and some experiments a wide corpus appears to be better, but I'd like to know if there's relevant research or other relevant results.
can I use the dataset of training, test and validating (did
preprocess) as a source for creating my own word embeddings
Sure, embeddings are not your features for your machine learning model. They are the "computational representation" of your data. In short, they are made of words represented in a vector space. With embeddings, your data is less sparse. Using word embeddings could be considered part of the pre-processing step of NLP.
Usually (I mean, using the most used technique, word2vec), the representation of a word in the vector space is defined by its surroundings (the words that it commonly goes along with).
Therefore, to create embeddings, the larger the corpus, the better, since it can better place a word vector in the vector space (and hence compare it to other similar words).

What is the difference between keras.tokenize.text_to_sequences and word embeddings

Difference between tokenize.fit_on_text, tokenize.text_to_sequence and word embeddings?
Tried to search on various platforms but didn't get a suitable answer.
Word embeddings is a way of representing words such that words with the same/similar meaning have a similar representation. Two commonly used algorithms that learn word embedding are Word2Vec and GloVe.
Note that word embeddings can also be learnt from scratch while training your neural network for text processing, on your specific NLP problem. You can also use transfer learning; in this case, it would mean to transfer the learned representation of the words from huge datasets on your problem.
As for the tokenizer(I assume it's Keras that we're speaking of), taking from the documentation:
tokenize.fit_on_text() --> Creates the vocabulary index based on word frequency. For example, if you had the phrase "My dog is different from your dog, my dog is prettier", word_index["dog"] = 0, word_index["is"] = 1 (dog appears 3 times, is appears 2 times)
tokenize.text_to_sequence() --> Transforms each text into a sequence of integers. Basically if you had a sentence, it would assign an integer to each word from your sentence. You can access tokenizer.word_index() (returns a dictionary) to verify the assigned integer to your word.

Fasttext algorithm use only word and subword? or sentences too?

I read the paper and googled as well if there is any good example of the learning method(or more likely learning procedure)
For word2vec, suppose there is corpus sentence
I go to school with lunch box that my mother wrapped every morning
Then with window size 2, it will try to obtain the vector for 'school' by using surrounding words
['go', 'to', 'with', 'lunch']
Now, FastText says that it uses the subword to obtain the vector, so it is definitely use n gram subword, for example with n=3,
['sc', 'sch', 'cho', 'hoo', 'ool', 'school']
Up to here, I understood.
But it is not clear that if the other words are being used for learning for 'school'. I can only guess that other surrounding words are used as well like the word2vec, since the paper mentions
=> the terms Wc and Wt are both used in functions
where Wc is context word and Wt is word at sequence t.
However, it is not clear that how FastText learns the vectors for word.
.
.
Please clearly explain how FastText learning process goes in procedure?
.
.
More precisely I want to know that if FastText also follows the same procedure as Word2Vec while it learns the n-gram characterized subword in addition. Or only n-gram characterized subword with word being used?
How does it vectorize the subword at initial? etc
Any context word has its candidate input vector assembled from the combination of both its full-word token and all its character-n-grams. So if the context word is 'school', and you're using 3-4 character n-grams, the in-training input vector is a combination of the full-word vector for school, and all the n-gram vectors for ['sch', 'cho', 'hoo', 'ool', 'scho', 'choo', 'hool'].)
When that candidate vector is adjusted by training, all the constituent vectors are adjusted. (This is a little like how in word2vec CBOW, mode, all the words of the single average context input vector get adjusted together, when their ability to predict a single target output word is evaluated and improved.)
As a result, those n-grams that happen to be meaningful hints across many similar words – for example, common word-roots or prefixes/suffixes – get positioned where they confer that meaning. (Other n-grams may remain mostly low-magnitude noise, because there's little meaningful pattern to where they appear.)
After training, reported vectors for individual in-vocabulary words are also constructed by combining the full-word vector and all n-grams.
Then, when you also encounter an out-of-vocabulary word, to the extent it shares some or many n-grams with morphologically-similar in-training words, it will get a similar calculated vector – and thus be better than nothing, in guessing what that word's vector should be. (And in the case of small typos or slight variants of known words, the synthesized vector may be pretty good.)
The fastText site states that at least 2 of implemented algorithms do use surrounding words in sentences.
Moreover, the original fastText implementation is open source so you can check how exactly it works exploring the code.

Resources