What is the difference between keras.tokenize.text_to_sequences and word embeddings - keras

Difference between tokenize.fit_on_text, tokenize.text_to_sequence and word embeddings?
Tried to search on various platforms but didn't get a suitable answer.

Word embeddings is a way of representing words such that words with the same/similar meaning have a similar representation. Two commonly used algorithms that learn word embedding are Word2Vec and GloVe.
Note that word embeddings can also be learnt from scratch while training your neural network for text processing, on your specific NLP problem. You can also use transfer learning; in this case, it would mean to transfer the learned representation of the words from huge datasets on your problem.
As for the tokenizer(I assume it's Keras that we're speaking of), taking from the documentation:
tokenize.fit_on_text() --> Creates the vocabulary index based on word frequency. For example, if you had the phrase "My dog is different from your dog, my dog is prettier", word_index["dog"] = 0, word_index["is"] = 1 (dog appears 3 times, is appears 2 times)
tokenize.text_to_sequence() --> Transforms each text into a sequence of integers. Basically if you had a sentence, it would assign an integer to each word from your sentence. You can access tokenizer.word_index() (returns a dictionary) to verify the assigned integer to your word.

Related

How to find similar sentence from a corpus on word2vec?

I have implemented word2vec on my corpus using the TensorFlow tutorial: https://www.tensorflow.org/tutorials/text/word2vec#next_steps
Now I'm want to give a sentence as input and want to find a similar sentence in the corpus.
Any leads on how I can perform this?
A simple word2vec model is not capable of such task, as it only relates word semantics to each other, not the semantics of whole sentences. Inherently, such a model has no generative function, it only serves as a look-up table.
Word2vec models map word strings to vectors in the embedding space. To find similar words for a given sample word, one can simply go through all vectors in the vocabulary and find the ones that are closest (in terms of the 2-norm) from the sample word vector. For further information you could go here or here.
However, this does not work for sentences as it would require a whole vocabulary of sentences of which to pick similar ones - which is not feasible.
Edit: This seems to be a duplicate of this question.

How does gensim word2vec word embedding extract training word pair for 1 word sentence?

Refer to below image (the process of how word2vec skipgram extract training datasets-the word pair from the input sentences).
E.G. "I love you." ==> [(I,love), (I, you)]
May I ask what is the word pair when the sentence contains only one word?
Is it "Happy!" ==> [(happy,happy)] ?
I tested the word2vec algorithm in genism, when there is just one word in the training set sentences, (and this word is not included in other sentences), the word2vec algorithm can still construct an embedding vector for this specific word. I am not sure how the algorithm is able to do so.
===============UPDATE===============================
As the answer posted below, I think the word embedding vector created for the word in the 1-word-sentence is just the random initialization of neural network weights.
No word2vec training is possible from a 1-word sentence, because there's no neighbor words to use as input to predict a center/target word. Essentially, that sentence is skipped.
If that was the only appearance of the word in the corpus, and you're seeing a vector for that word, it's just the starting random-initialization of the word, with no further training. (And, you should probably use a higher min_count, as keeping such rare words is usually a mistake in word2vec: they won't get good vectors, and other nearby words' vectors will improve if the 'noise' from all such insufficiently model-able rare words is removed.)
If that 1-word sentence actually appeared next-to other real sentences in your corpus, it could make sense to combine it with surrounding texts. There's nothing magic about actual sentences for this kind word-from-surroundings modeling - the algorithm is just working on 'neighbors', and it's common to use multi-sentence chunks as the texts for training, and sometimes even punctuation (like sentence-ending periods) is also retained as 'words'. Then words from an actually-separate sentence – but still related by having appeared in the same document – will appear in each other's contexts.

What is the difference between Sentence Encodings and Contextualized Word Embeddings?

I have seen both terms used while reading papers about BERT and ELMo so I wonder if there is a difference between them.
A contextualized word embeding is a vector representing a word in a special context. The traditional word embeddings such as Word2Vec and GloVe generate one vector for each word, whereas a contextualized word embedding generates a vector for a word depending on the context. Consider the sentences The duck is swimmingand You shall duck when someone shoots at you. With traditional word embeddings, the word vector for duckwould be the same in both sentences, whereas it should be a different one in the contextualized case.
While word embeddings encode words into a vector representation, there is also the question on how to represent a whole sentence in a way a computer can easily work with. These sentence encodings can embedd a whole sentence as one vector , doc2vec for example which generate a vector for a sentence. But also BERT generates a representation for the whole sentence, the [CLS]-token.
So in short, a conextualized word embedding represents a word in a context, whereas a sentence encoding represents a whole sentence.

Fasttext algorithm use only word and subword? or sentences too?

I read the paper and googled as well if there is any good example of the learning method(or more likely learning procedure)
For word2vec, suppose there is corpus sentence
I go to school with lunch box that my mother wrapped every morning
Then with window size 2, it will try to obtain the vector for 'school' by using surrounding words
['go', 'to', 'with', 'lunch']
Now, FastText says that it uses the subword to obtain the vector, so it is definitely use n gram subword, for example with n=3,
['sc', 'sch', 'cho', 'hoo', 'ool', 'school']
Up to here, I understood.
But it is not clear that if the other words are being used for learning for 'school'. I can only guess that other surrounding words are used as well like the word2vec, since the paper mentions
=> the terms Wc and Wt are both used in functions
where Wc is context word and Wt is word at sequence t.
However, it is not clear that how FastText learns the vectors for word.
.
.
Please clearly explain how FastText learning process goes in procedure?
.
.
More precisely I want to know that if FastText also follows the same procedure as Word2Vec while it learns the n-gram characterized subword in addition. Or only n-gram characterized subword with word being used?
How does it vectorize the subword at initial? etc
Any context word has its candidate input vector assembled from the combination of both its full-word token and all its character-n-grams. So if the context word is 'school', and you're using 3-4 character n-grams, the in-training input vector is a combination of the full-word vector for school, and all the n-gram vectors for ['sch', 'cho', 'hoo', 'ool', 'scho', 'choo', 'hool'].)
When that candidate vector is adjusted by training, all the constituent vectors are adjusted. (This is a little like how in word2vec CBOW, mode, all the words of the single average context input vector get adjusted together, when their ability to predict a single target output word is evaluated and improved.)
As a result, those n-grams that happen to be meaningful hints across many similar words – for example, common word-roots or prefixes/suffixes – get positioned where they confer that meaning. (Other n-grams may remain mostly low-magnitude noise, because there's little meaningful pattern to where they appear.)
After training, reported vectors for individual in-vocabulary words are also constructed by combining the full-word vector and all n-grams.
Then, when you also encounter an out-of-vocabulary word, to the extent it shares some or many n-grams with morphologically-similar in-training words, it will get a similar calculated vector – and thus be better than nothing, in guessing what that word's vector should be. (And in the case of small typos or slight variants of known words, the synthesized vector may be pretty good.)
The fastText site states that at least 2 of implemented algorithms do use surrounding words in sentences.
Moreover, the original fastText implementation is open source so you can check how exactly it works exploring the code.

what is dimensionality in word embeddings?

I want to understand what is meant by "dimensionality" in word embeddings.
When I embed a word in the form of a matrix for NLP tasks, what role does dimensionality play? Is there a visual example which can help me understand this concept?
Answer
A Word Embedding is just a mapping from words to vectors. Dimensionality in word
embeddings refers to the length of these vectors.
Additional Info
These mappings come in different formats. Most pre-trained embeddings are
available as a space-separated text file, where each line contains a word in the
first position, and its vector representation next to it. If you were to split
these lines, you would find out that they are of length 1 + dim, where dim
is the dimensionality of the word vectors, and 1 corresponds to the word being represented. See the GloVe pre-trained
vectors for a real example.
For example, if you download glove.twitter.27B.zip, unzip it, and run the following python code:
#!/usr/bin/python3
with open('glove.twitter.27B.50d.txt') as f:
lines = f.readlines()
lines = [line.rstrip().split() for line in lines]
print(len(lines)) # number of words (aka vocabulary size)
print(len(lines[0])) # length of a line
print(lines[130][0]) # word 130
print(lines[130][1:]) # vector representation of word 130
print(len(lines[130][1:])) # dimensionality of word 130
you would get the output
1193514
51
people
['1.4653', '0.4827', ..., '-0.10117', '0.077996'] # shortened for illustration purposes
50
Somewhat unrelated, but equally important, is that lines in these files are sorted according to the word frequency found in the corpus in which the embeddings were trained (most frequent words first).
You could also represent these embeddings as a dictionary where
the keys are the words and the values are lists representing word vectors. The length
of these lists would be the dimensionality of your word vectors.
A more common practice is to represent them as matrices (also called lookup
tables), of dimension (V x D), where V is the vocabulary size (i.e., how
many words you have), and D is the dimensionality of each word vector. In
this case you need to keep a separate dictionary mapping each word to its
corresponding row in the matrix.
Background
Regarding your question about the role dimensionality plays, you'll need some theoretical background. But in a few words, the space in which words are embedded presents nice properties that allow NLP systems to perform better. One of these properties is that words that have similar meaning are spatially close to each other, that is, have similar vector representations, as measured by a distance metric such as the Euclidean distance or the cosine similarity.
You can visualize a 3D projection of several word embeddings here, and see, for example, that the closest words to "roads" are "highways", "road", and "routes" in the Word2Vec 10K embedding.
For a more detailed explanation I recommend reading the section "Word Embeddings" of this post by Christopher Olah.
For more theory on why using word embeddings, which are an instance of distributed representations, is better than using, for example, one-hot encodings (local representations), I recommend reading the first sections of Distributed Representations by Geoffrey Hinton et al.
Word embeddings like word2vec or GloVe don't embed words in two-dimensional matrices, they use one-dimensional vectors. "Dimensionality" refers to the size of these vectors. It is separate from the size of the vocabulary, which is the number of words you actually keep vectors for instead of just throwing out.
In theory larger vectors can store more information since they have more possible states. In practice there's not much benefit beyond a size of 300-500, and in some applications even smaller vectors work fine.
Here's a graphic from the GloVe homepage.
The dimensionality of the vectors is shown on the left axis; decreasing it would make the graph shorter, for example. Each column is an individual vector with color at each pixel determined by the number at that position in the vector.
The "dimensionality" in word embeddings represent the total number of features that it encodes. Actually, it is over simplification of the definition, but will come to that bit later.
The selection of features is usually not manual, it is automatic by using hidden layer in the training process. Depending on the corpus of literature the most useful dimensions (features) are selected. For example if the literature is about romantic fictions, the dimension for gender is much more likely to be represented compared to the literature of mathematics.
Once you have the word embedding vector of 100 dimensions (for example) generated by neural network for 100,000 unique words, it is not generally much useful to investigate the purpose of each dimension and try to label each dimension by "feature name". Because the feature(s) that each dimension represents may not be simple and orthogonal and since the process is automatic no body knows exactly what each dimension represents.
For more insight to understand this topic you may find this post useful.
Textual data has to be converted into numeric data before feeding into any Machine Learning algorithm.
Word Embedding is an approach for this where each word is mapped to a vector.
In algebra, A Vector is a point in space with scale & direction.
In simpler term Vector is a 1-Dimensional vertical array ( or say a matrix having single column) and Dimensionality is the number of elements in that 1-D vertical array.
Pre-trained word embedding models like Glove, Word2vec provides multiple dimensional options for each word, for instance 50, 100, 200, 300. Each word represents a point in D dimensionality space and synonyms word are points closer to each other. Higher the dimension better shall be the accuracy but computation needs would also be higher.
I'm not an expert, but I think the dimensions just represent the variables (aka attributes or features) which have been assigned to the words, although there may be more to it than that. The meaning of each dimension and total number of dimensions will be specific to your model.
I recently saw this embedding visualisation from the Tensor Flow library:
https://www.tensorflow.org/get_started/embedding_viz
This particularly helps reduce high-dimensional models down to something human-perceivable. If you have more than three variables it's extremely difficult to visualise the clustering (unless you are Stephen Hawking apparently).
This wikipedia article on dimensional reduction and related pages discuss how features are represented in dimensions, and the problems of having too many.
According to the book Neural Network Methods for Natural Language Processing by Goldenberg, dimensionality in word embeddings (demb) refers to number of columns in first weight matrix (weights between input layer and hidden layer) of embedding algorithms such as word2vec. N in the image is dimensionality in word embedding:
For more information you can refer to this link:
https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

Resources