Suppose I want to train an RNN on pseudo-random words (not part of any dictionary) so I can't use word2vec. How can I represent each char in the word using tensorflow?
If you are just doing characters you can just use a one hot vector of size 128 which can represent every ascii character (you may want to use smaller since I doubt you will use all ascii characters, maybe just 26 for every letter). You don't really need to use anything like word vectors since the range of possibilities is small.
Actually when you use the one hot encodings you are kind of learning vectors for each character. Say your first dense layer (or rnn layer) contains 100 neurons. Then this would result in a 128x100 matrix multiply with the one hot encoding. Since all but one of the values is non zero you are essentially selecting a single row of size 100 from the matrix which is a vector representation of that character. Essentially that first matrix is just a list of the vectors which represent each character and your model will learn these vector representations. Due to the sparseness of the one hot encodings it is often faster to just look up the row rather than carry out the full matrix multiply. This is what the tf.nn.embedding_lookup or tf.gather function is used for.
Related
I can't figure out what a subword input vector is. I read in the newspaper that the subword is hashed, the subword is the hash code, hash code is a number, not a vector
Ex: Input vector of word eating is [0,0,0,1,0,0,0,0,0]
So what is the input vector of subwords "eat", "ati", "ing",...?
Link paper: https://arxiv.org/pdf/1607.04606.pdf
enter image description here
the subword is the hash code, hash code is a number, not a vector
The FastText subwords are, as you've suggested, fragments of the full word. For the purposes of subword creation, FastText will also prepend/append special start-of-word and end-of-word characters. (If I recall correctly, it uses < & >.)
So, for the full word token 'eating', it is considered as '<eating>'.
All the 3-character subwords would be '<ea', 'eat', 'ati', 'tin', 'ing', 'ng>'.
All the 4-character subwords would be '<eat', 'atin', 'ting', 'ing>'.
All the 5-character subwords would be '<eati', 'ating', 'ting>'.
I see you've written out a "one-hot" representation of the full word 'eating' – [0,0,0,1,0,0,0,0,0] – as if 'eating' is the 4th word in a 9-word vocabulary. While diagrams & certain ways of thnking about the underlying model may consider such a one-hot vector, it's useful to realize that in actual code implementations, such a sparse one-hot vector for words is never actually created.
Instead, it's just represented as a single number – the index to the non-zero number. That's used as a lookup into an array of vectors of the configured 'dense' size, returning one input word-vector of that size for the word.
For example, imagine you have a model with a 1-million word known vocabulary, which offers 100-dimensional 'dense embedding' word-vectors. The word 'eating' is the 543,210th word.
That model will have an array of input-vectors that's has one million slots, and each slot has a 100-dimensional vector in it. We could call it word_vectors_in. The word 'eating''s vector will be at word_vectors_in[543209] (beccause the 1st vector is at word_vectors_in[0]).
At no point during the creation/training/use of this model will an actual 1-million-long one-hot vector for 'eating' be created. Most often, it'll just be referred-to inside the code as the word-index 543209. The model will have a helper lookup dictionary/hashmap, let's call it word_index that lets code find the right slot for a word. So word_index['eating'] will be 543209.
OK, now to your actual question, about the subwords. I've detailed how the the single vectors per one known full word are stored, above, in order to contrast it with the different way subwords are handled.
Subwords are also stored in a big array of vectors, but that array is treated as a collision-oblivious hashtable. That is, by design, many subwords can and do all reuse the same slot.
Let's call that big array of subword vectors subword_vector_in. Let's also make it 1 million slots long, where each slot has a 100-dimensional vector.
But now, there is no dictionary that remembers which subwords are in which slots - for example, remembering that subword '<eat' is in arbitrary slot 78789.
Instead, the string '<eat' is hashed to a number, that number is restricted to the possible indexes into the subwords, and the vector at that index, let's say it's 12344, is used for the subword.
And then when some other subword comes along, maybe '<dri', it might hash to the exact-same 12344 slot. And that same vector then gets adjusted for that other subword (during training), or returned for both those subwords (and possibly many others) during later FastText-vector synthesis from the finali model.
Notably, now even if there are far more than 1-million unique subwords, they can all be represented inside that single 1-million slot array, albeit with collisions/interference.
In practice, the collisions are tolerable because many collisions from very-rare subwords essentially just fuzz slots with lots of random noise that mostly cancels out. For the most-common subwords, that tend to carry any unique meaning because of the way word-roots/prefixes/suffixes hint at word meaning in English & similar langauges, those very-common examples overpower the other noise, and ensure that slot, for at least one or more of its most-common subwords, carries at least some hint of the subword's implied meaning(s).
So when FastText assembles its final word-vector, by adding:
word_vector_in[word_index['eating']] # learned known-word vector
+ subword_vector_in[slot_hash('<ea')] # 1st 3-char subword
+ subword_vector_in[slot_hash('eat')]
+ subword_vector_in[slot_hash('ati')]
... # other 3-char subwords
... # every 4-char subword
... # other 5-char subwords
+ subword_vector_in[slot_hash('ting>')] # last 5-char subword
…it gets something that's dominated by the (likely stronger-in-magnitude) known full-word vector, with some useful hints of meaning also contributed by the (probably lower-magnitude) many noisy subword vectors.
And then if we were to imagine that some other word that's not part of the known 1-million word vocabulary comes along, say 'eatery', it has nothing from word_vector_in for the full word, but it can still do:
subword_vector_in[slot_hash('<ea')] # 1st 3-char subword
+ subword_vector_in[slot_hash('eat')]
+ subword_vector_in[slot_hash('ate')]
... # other 3-char subwords
... # every 4-char subword
... # other 5-char subwords
+ subword_vector_in[slot_hash('tery>')] # last 5-char subword
Because at least a few of those subwords likely include some meaningful hints of the meaning of the word 'eatery' – especially meanings around 'eat' or even the venue/vendor aspects of the suffix -tery, this synthesized guess for an out-of-vocabulary (OOV) word will be better than a random vector, & often better than ignoring the word entirely in whatever upper-level process is using the FastText vectors.
Universal Sentence Encoder encodes sentences into a vector of 512 features. My proposition is that if a sentence is gibberish then most of the features will be very close to zero. However, if a sentence has meaning then some of the features out of the 512 features would be much greater than or much lesser than zero. Can we then, just by seeing the vector feature's weight distribution decide which vector encodes meaning and which vector encodes gibberish ?
It seems that the USE encodes features in a very arbitrary fashion. I conducted a lot of experiments and saw that the features scaled up and down in an arbitrary fashion without regard to the sentence being gibberish or meaningful. The experiments include counting the number of positives and negative features in a meaningful and gibberish vector, finding the mean and standard distribution of the features. But nothing bore any pattern which can delineate the two.Attached are the screenshots.
Below is sample 2 . Many more samples (around 30) were taken and no pattern in count of positive-negative features, standard dev and mean was observed which can separate a gibberish USE vector from a meaningful one.
For a university project I have to recognize characters from a license plate. I have to do this using python 3. I am not allowed to use OCR functions or use functions that use deep learning or neural networks. I have reached the point where I am able to segment the characters from a license plate and transform them to a uniform format. A few examples of segmented characters are here.
The format of the segmented characters is very dependent on the input. However, I can easily convert this to uniform dimensions using opencv. Additionally, I have a set of template characters and numbers that I can use to predict what character / number it is.
I therefore need a metric to express the similarity between the segmented character and the reference image. In this way, I can say that the reference image with the highest similarity score matches the segmented character. I have tried the following ways to compute the similarity.
For these operations I have made sure that the reference characters and the segmented characters have the same dimensions.
A bitwise XOR-operator
Inverting the reference characters and comparing them pixel by pixel. If a pixel matches increment the similarity score, if a pixel does not match decrement the similarity score.
hash both the segmented character and the reference character using 'imagehash'. Consequently comparing the hashes and see which ones are most similar.
None of these methods succeed to give me an accurate prediction for all characters. Most characters are usually correctly predicted. However, the program confuses characters like 8-B, D-0, 7-Z, P-R consistently.
Does anybody have an idea how to predict the segmented characters? I.e. defining a better similarity score.
Edit: Unfortunately, cv2.matchTemplate and cv2.matchShapes are not allowed for this assignment...
The general procedure for comparing two images consists in the extraction of features from the two images and their subsequent comparison. What you are actually doing in the first two methods is considering the value of every pixel as a feature. The similarity measure is therefore a distance-computation on a space of very high dimension. This methods are, however, subject to noise and this requires very big datasets in order not to obtain acceptable results.
For this reason, usually one attempts to reduce the space dimensionality. I'm not familiar with the third method, but it seems to go in this direction.
A way to reduce the space dimensionality consists in defining some custom features meaningful for the problem you are facing.
A possibility for the character classification problem could be to define features that measure the response of the input image on strategic subshapes of the characters (an upper horizontal line, a lower one, a circle in the upper part of the image, a diagonal line, etc.).
You could define a minimal set of shapes that, combined together, can generate every character. Then you should retrieve one feature for each shape, by measuring the response (i.e., integrating the signal of the input image inside the shape) of the original image on that particular shape. Finally, you should determine the class which the image belongs to by taking the nearest reference point in this, smaller, space of the features.
If I pass a Sentence containing 5 words to the Doc2Vec model and if the size is 100, there are 100 vectors. I'm not getting what are those vectors. If I increase the size to 200, there are 200 vectors for just a simple sentence. Please tell me how are those vectors calculated.
When using a size=100, there are not "100 vectors" per text example – there is one vector, which includes 100 scalar dimensions (each a floating-point value, like 0.513 or -1.301).
Note that the values represent points in 100-dimensional space, and the individual dimensions/axes don't have easily-interpretable meanings. Rather, it is only the relative distances and relative directions between individual vectors that have useful meaning for text-based applications, such as assisting in information-retrieval or automatic classification.
The method for computing the vectors is described in the paper 'Distributed Representation of Sentences and Documents' by Le & Mikolov. But, it is closely associated to the 'word2vec' algorithm, so understanding that 1st may help, such as via its first and second papers. If that style of paper isn't your style, queries like [word2vec tutorial] or [how does word2vec work] or [doc2vec intro] should find more casual beginning descriptions.
i am going thorugh this paper http://cs.stanford.edu/~quocle/paragraph_vector.pdf
and it states that
" Theparagraph vector and word vectors are averaged or concatenated
to predict the next word in a context. In the experiments, we use
concatenation as the method to combine the vectors."
How does concatenation or averaging work?
example (if paragraph 1 contain word1 and word2):
word1 vector =[0.1,0.2,0.3]
word2 vector =[0.4,0.5,0.6]
concat method
does paragraph vector = [0.1+0.4,0.2+0.5,0.3+0.6] ?
Average method
does paragraph vector = [(0.1+0.4)/2,(0.2+0.5)/2,(0.3+0.6)/2] ?
Also from this image:
It is stated that :
The paragraph token can be thought of as another word. It acts as a
memory that remembers what is missing from the current context – or
the topic of the paragraph. For this reason, we often call this model
the Distributed Memory Model of Paragraph Vectors (PV-DM).
Is the paragraph token equal to the paragraph vector which is equal to on?
How does concatenation or averaging work?
You got it right for the average. The concatenation is: [0.1,0.2,0.3,0.4,0.5,0.6].
Is the paragraph token equal to the paragraph vector which is equal to on?
The "paragraph token" is mapped to a vector that is called "paragraph vector". It is different from the token "on", and different from the word vector that the token "on" is mapped to.
A simple (and sometimes useful) vector for a range of text is the sum or average of the text's words' vectors – but that's not what the 'Paragraph Vector' of the 'Paragraph Vectors' paper is.
Rather, the Paragraph Vector is another vector, trained similarly to the word vectors, which is also adjusted to help in word-prediction. These vectors are combined (or interleaved) with the word vectors to feed the prediction model. That is, the averaging (in DM mode) includes the PV alongside word-vectors - it doesn't compose the PV from word-vectors.
In the diagram, on is the target-word being predicted, in that diagram by a combination of closely-neighboring words and the full-example's PV, which may perhaps be informally thought of as a special pseudoword, ranging over the entire text example, participating in all the sliding 'windows' of real words.