RNN Implementation - python-3.x

I am going to implement RNN using Pytorch . But , before that , I am having some difficulties in understanding the character level one-hot encoding which is asked in the question .
Please find below the question
Choose the text you want your neural network to learn, but keep in mind that your
data set must be quite large in order to learn the structure! RNNs have been trained
on highly diverse texts (novels, song lyrics, Linux Kernel, etc.) with success, so you
can get creative. As one easy option, Gutenberg Books is a source of free books where
you may download full novels in a .txt format.
We will use a character-level representation for this model. To do this, you may use
extended ASCII with 256 characters. As you read your chosen training set, you will
read in the characters one at a time into a one-hot-encoding, that is, each character
will map to a vector of ones and zeros, where the one indicates which of the characters
is present:
char → [0, 0, · · · , 1, · · · , 0, 0]
Your RNN will read in these length-256 binary vectors as input.
So , For example , I have read a novel in python. Total unique characters is 97. and total characters is somewhere around 300,000 .
So , will my input be 97 x 256 one hot encoded matrix ?
or will it be 300,000 x 256 one hot encoded matrix ?

One hot assumes each of your vector should be different in one place. So if you have 97 unique character then i think you should use a 1-hot vector of size ( 97 + 1 = 98). The extra vector maps all the unknown character to that vector. But you can also use a 256 length vector. So you input will be:
B x N x V ( B = batch size, N = no of characters , V = one hot vector size).
But if you are using libraries they usually ask the index of characters in vocabulary and they handle index to one hot conversion. Hope that helps.

Related

Subword vector in fastText?

I can't figure out what a subword input vector is. I read in the newspaper that the subword is hashed, the subword is the hash code, hash code is a number, not a vector
Ex: Input vector of word eating is [0,0,0,1,0,0,0,0,0]
So what is the input vector of subwords "eat", "ati", "ing",...?
Link paper: https://arxiv.org/pdf/1607.04606.pdf
enter image description here
the subword is the hash code, hash code is a number, not a vector
The FastText subwords are, as you've suggested, fragments of the full word. For the purposes of subword creation, FastText will also prepend/append special start-of-word and end-of-word characters. (If I recall correctly, it uses < & >.)
So, for the full word token 'eating', it is considered as '<eating>'.
All the 3-character subwords would be '<ea', 'eat', 'ati', 'tin', 'ing', 'ng>'.
All the 4-character subwords would be '<eat', 'atin', 'ting', 'ing>'.
All the 5-character subwords would be '<eati', 'ating', 'ting>'.
I see you've written out a "one-hot" representation of the full word 'eating' – [0,0,0,1,0,0,0,0,0] – as if 'eating' is the 4th word in a 9-word vocabulary. While diagrams & certain ways of thnking about the underlying model may consider such a one-hot vector, it's useful to realize that in actual code implementations, such a sparse one-hot vector for words is never actually created.
Instead, it's just represented as a single number – the index to the non-zero number. That's used as a lookup into an array of vectors of the configured 'dense' size, returning one input word-vector of that size for the word.
For example, imagine you have a model with a 1-million word known vocabulary, which offers 100-dimensional 'dense embedding' word-vectors. The word 'eating' is the 543,210th word.
That model will have an array of input-vectors that's has one million slots, and each slot has a 100-dimensional vector in it. We could call it word_vectors_in. The word 'eating''s vector will be at word_vectors_in[543209] (beccause the 1st vector is at word_vectors_in[0]).
At no point during the creation/training/use of this model will an actual 1-million-long one-hot vector for 'eating' be created. Most often, it'll just be referred-to inside the code as the word-index 543209. The model will have a helper lookup dictionary/hashmap, let's call it word_index that lets code find the right slot for a word. So word_index['eating'] will be 543209.
OK, now to your actual question, about the subwords. I've detailed how the the single vectors per one known full word are stored, above, in order to contrast it with the different way subwords are handled.
Subwords are also stored in a big array of vectors, but that array is treated as a collision-oblivious hashtable. That is, by design, many subwords can and do all reuse the same slot.
Let's call that big array of subword vectors subword_vector_in. Let's also make it 1 million slots long, where each slot has a 100-dimensional vector.
But now, there is no dictionary that remembers which subwords are in which slots - for example, remembering that subword '<eat' is in arbitrary slot 78789.
Instead, the string '<eat' is hashed to a number, that number is restricted to the possible indexes into the subwords, and the vector at that index, let's say it's 12344, is used for the subword.
And then when some other subword comes along, maybe '<dri', it might hash to the exact-same 12344 slot. And that same vector then gets adjusted for that other subword (during training), or returned for both those subwords (and possibly many others) during later FastText-vector synthesis from the finali model.
Notably, now even if there are far more than 1-million unique subwords, they can all be represented inside that single 1-million slot array, albeit with collisions/interference.
In practice, the collisions are tolerable because many collisions from very-rare subwords essentially just fuzz slots with lots of random noise that mostly cancels out. For the most-common subwords, that tend to carry any unique meaning because of the way word-roots/prefixes/suffixes hint at word meaning in English & similar langauges, those very-common examples overpower the other noise, and ensure that slot, for at least one or more of its most-common subwords, carries at least some hint of the subword's implied meaning(s).
So when FastText assembles its final word-vector, by adding:
word_vector_in[word_index['eating']] # learned known-word vector
+ subword_vector_in[slot_hash('<ea')] # 1st 3-char subword
+ subword_vector_in[slot_hash('eat')]
+ subword_vector_in[slot_hash('ati')]
... # other 3-char subwords
... # every 4-char subword
... # other 5-char subwords
+ subword_vector_in[slot_hash('ting>')] # last 5-char subword
…it gets something that's dominated by the (likely stronger-in-magnitude) known full-word vector, with some useful hints of meaning also contributed by the (probably lower-magnitude) many noisy subword vectors.
And then if we were to imagine that some other word that's not part of the known 1-million word vocabulary comes along, say 'eatery', it has nothing from word_vector_in for the full word, but it can still do:
subword_vector_in[slot_hash('<ea')] # 1st 3-char subword
+ subword_vector_in[slot_hash('eat')]
+ subword_vector_in[slot_hash('ate')]
... # other 3-char subwords
... # every 4-char subword
... # other 5-char subwords
+ subword_vector_in[slot_hash('tery>')] # last 5-char subword
Because at least a few of those subwords likely include some meaningful hints of the meaning of the word 'eatery' – especially meanings around 'eat' or even the venue/vendor aspects of the suffix -tery, this synthesized guess for an out-of-vocabulary (OOV) word will be better than a random vector, & often better than ignoring the word entirely in whatever upper-level process is using the FastText vectors.

Can we segregate gibberish from meaningful sentences just by looking at the features of the 512 dimensional Universal Sentence Encoder Vector?

Universal Sentence Encoder encodes sentences into a vector of 512 features. My proposition is that if a sentence is gibberish then most of the features will be very close to zero. However, if a sentence has meaning then some of the features out of the 512 features would be much greater than or much lesser than zero. Can we then, just by seeing the vector feature's weight distribution decide which vector encodes meaning and which vector encodes gibberish ?
It seems that the USE encodes features in a very arbitrary fashion. I conducted a lot of experiments and saw that the features scaled up and down in an arbitrary fashion without regard to the sentence being gibberish or meaningful. The experiments include counting the number of positives and negative features in a meaningful and gibberish vector, finding the mean and standard distribution of the features. But nothing bore any pattern which can delineate the two.Attached are the screenshots.
Below is sample 2 . Many more samples (around 30) were taken and no pattern in count of positive-negative features, standard dev and mean was observed which can separate a gibberish USE vector from a meaningful one.

Adding noise to genomic data having discrete values (A, G, T, C)

Since genomic sequences vary greatly in length, I have been trying to work on using denoising autoencoders to get a compact representation for any given sequence. My expected input is a sequence of nucleotides (letters - A, G, T, C), for example, "AAAAGGAATTTCTCTGGGG....".
For images, adding a noise is easy since it's a continuous space. But in a discrete scenario such as this, what would be a good strategy to add noise to my input?
My first thought is to randomly replace some of the nucleotides with "N", which means that the nucleotide at that position couldn't be identified accurately during sequencing. But changing even one nucleotide leads to a completely different sequence altogether, unlike images where adding a small noise doesn't change how the image looks visually. Please let me know if this is right or there's a better way that I am not aware of.
I'm not sure if this will help you or further complicate your issue, but in biology people normally use FASTQ files to store biological sequences and their corresponding Phred quality scores. A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing.
For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000.
Public domain image from Wikipedia
So you can add noise to the Phred quality scores (i.e. the probabilities that the base calling is correct) without changing the sequence.
Also see this paragraph about current work done on compressing FASTQ files.

How combine word embedded vectors to one vector?

I know the meaning and methods of word embedding(skip-gram, CBOW) completely. And I know, that Google has a word2vector API that by getting the word can produce the vector.
but my problem is this: we have a clause that includes the subject, object, verb... that each word is previously embedded by the Google API, now "How we can combine these vectors together to create a vector that is equal to the clause?"
Example:
Clause: V= "dog bites man"
after word embedding by the Google, we have V1, V2, V3 that each of them maps to the dog, bites, man. and we know that:
V = V1+ V2 +V3
How can we provide V?
I will appreciate if you explain it by taking an example of real vectors.
A vector is basically just a list of numbers. You add vectors by adding the number in the same position in each list together. Here's an example:
a = [1, 2, 3]
b = [4, 5, 6]
c = a + b # vector addition
c is [(1+4), (2+5), (3+6)], or [5, 7, 9]
As indicated in this question, a simple way to do this in python is like this:
map(sum, zip(a, b))
Vector addition is part of linear algebra. If you don't understand operations on vectors and matrices the math around word vectors will be very hard to understand, so you may want to look into learning more about linear algebra in general.
Normally adding word vectors together is a good way to approximate a sentence vector, since for any given set of words there's an obvious order. However, your example of Dog bites man and Man bites dog shows the weakness of adding vectors - the result doesn't change based on word order, so the results for those two sentences would be the same, even though their meanings are very different.
For methods of getting sentence vectors that are affected by word order, look into doc2vec or the just-released InferSent.
Two solutions:
Use vector addition of the constituent words of a phrase - this typically works well because addition is a good estimation of semantic composition.
Use paragraph vectors, which is able to encode arbitrary length sequence of words as a single vector.
So, In this paper : https://arxiv.org/pdf/2004.07464.pdf
They have combined image embedding and text embedding by concatenating them.
X = TE + IE
Here X is fusion embedding with TE and IE as text and image embedding respectively.
If your TE and IE have dimension of suppose 2048 each, your X will be of length 2*2024. Then maybe you can use this if possible or if you want to reduce the dimension you can use t-SNE/PCA or https://arxiv.org/abs/1708.03629 (Implemented here : https://github.com/vyraun/Half-Size)

Tensorflow : RNN with char input

Suppose I want to train an RNN on pseudo-random words (not part of any dictionary) so I can't use word2vec. How can I represent each char in the word using tensorflow?
If you are just doing characters you can just use a one hot vector of size 128 which can represent every ascii character (you may want to use smaller since I doubt you will use all ascii characters, maybe just 26 for every letter). You don't really need to use anything like word vectors since the range of possibilities is small.
Actually when you use the one hot encodings you are kind of learning vectors for each character. Say your first dense layer (or rnn layer) contains 100 neurons. Then this would result in a 128x100 matrix multiply with the one hot encoding. Since all but one of the values is non zero you are essentially selecting a single row of size 100 from the matrix which is a vector representation of that character. Essentially that first matrix is just a list of the vectors which represent each character and your model will learn these vector representations. Due to the sparseness of the one hot encodings it is often faster to just look up the row rather than carry out the full matrix multiply. This is what the tf.nn.embedding_lookup or tf.gather function is used for.

Resources