Probability of the next generated character sequence - nlp

I am new to language processing and sorry if this might look a very basic question.
Given a training sequence, for example: "aaabbcddecbbaaaaabbabbbabccddbbcdaaaaaa" (the real sequence is much longer), I can use the recurrent neural networks such as LSTM to learn patterns and dependencies in the sequence to generate next characters (a single character or several characters). For example, feeding a sample sequence "aaabb" will generate "c". It is worth mentioning, that my alphabet contains only 6 ordered characters {a,b,c,d,e,f}
My question is: how to compute probability of a particular combination of next characters? For example, given a sequence "aabcdcbbaa" what will be the probability of obtaining "cc" ?
Many thanks in advance!
UPD
While writing the question, I realised that the the probability of a combination of the next characters might be computed as a "tensor product" of a single characters. What I mean is: given a test sample, the LSTM outputs a vector (through the softmax function) with probabilities of each character and then these probabilities are converted into a single character (the most probable outcome). For example: the sequence "aabcdcbbaa" will generate a 6-dim vector p1 = (0.1, 0.07, 0.23, 0.15, 0.31, 0.14) which corresponds to characters (a, b, c, d, e, f). Then by using each of these characters we can compute the probabilities of the next (the second) character p2. Then by multiplying these two probability vectors p1Xp2 we can compute the joint probability of obtaining two characters: aa, ab, ac, ad,....
Am I correct?

Related

How to convert vectors of varying lengths to one scalar?

I am trying to calculate the emotion of paragraphs by measuring the emotion of its individual sentences. But this ends in vectors of varying length, as a paragraph might be as short as 1 sentence or as long as 30 sentences. how would you suggest converting these vectors to scalars?
The first option is taking average, but this biases the results: It turns out shorter paragraphs have a higher score and longer ones a score around the mean.
The second option is summing up the values, but this biases the results again, as longer paragraph will have bigger scores
The third option is using a method used in VADER, which is summing up and then normalizing, but I could not find a reliable resource that explains how the results are normalized. The only thing I found is the following formula from VADER code:
norm_score = score / math.sqrt((score * score) + alpha)
VADER sets alpha to 15, but how this number should be changed and based on what? Also , where does this normalization method come from?

Using Word Embeddings to find similarity between documents with certain words having more weight

Using Word embeddings ,I am calculating the similarity distance between 2 paragraphs where distance between 2 paragraphs is the sum of euclidean distances between vectors of 2 words ,1 from each paragraph.
The more the value of this sum, the less similar 2 documents are-
How can I assign prefernce/weights to certain words while calculating this similarity distance.
It sounds like you've improvised your own paragraph-to-paragraph distance measure based on doing (lots of?) word-to-word distances.
Are you picking the words for each word-to-word comparison randomly, and doing it a lot to find the overall difference?
One naive measure that works better-than-nothing is to average all words in a paragraph to get a single vector for the paragraph. You could conceivably overweight words there quite easily by assigning each word a weight, default 1.0 (for normal average), but larger to overweight words.
Another more sophisticated comparison based on word-vectors is "Word Mover's Distance" - it essentially considers each word to be a "pile of meaning", and then finds the minimal pairwise "moves" to tranform one paragraph (as a bag-of-words) to another. (It's available in Python gensim as wmdistance(), and other libraries.) It's quite a bit more expensive to calculate, though, especially as a function of text word count.

How word2vec output vectors are used to compute the similarities?

I am a bit confused about the interpretation of word2vec output vectors!
If I want to predict the most probable word that will appear after a specific word(w1), can I use the most nearest word to w1?
I mean, a word having the shortest distance from w1 can be interpreted as the next word with the highest probability?
If I want to predict the most probable word that will appear after a specific word(w1),
This is called language modeling
can I use the most nearest word to w1?
I mean, a word having the shortest distance from w1 can be interpreted as the next word with the highest probability?
no: the nearest word to w1 is the most semantically close word to w1.

Is there a standard metric for sorted text?

Given a range of numbers, say from [80,240], it is easy to determine how much of that range lies within [100,105]: (105-100)/(240-80) = 5/160 = .03125. Easy.
So now, how much of a Meriam Webster dictionary lies between umbrella and velvet? Even if we assume uniform distribution of text across the corpus, is there a standard metric for text?
I don't think there is a standard for that. If you had all entries from Meriam Webster in an array, you could use first and last positions as the bounds, so you would have a set going from 1 to n. Then you could pick the positions of "umbrella" and "velvet", call them x and y, and calculate your range as (y - x + 1) / (n).
That works if you are seeing words as elements of an ordered set, so as to have them behave as real numbers. You are basically dividing the distance between two numbers in a set by the distance between the boundaries of the set. Some forms of algebra deal with them differently - when calculating the Levenshtein distance between any two given words, for example, each words is seen as a vector with as many dimensions as they have characters.
You could define the boundaries of your n-dimensional space by using the biggest word in Meriam Webster (hint: it's "pneumonoultramicroscopicsilicovolcanoconiosis", so your space would have 45 dimensions). However, when considering any A-B pair of words, a third word C of intermediary length may or may not be between those, depending on the operations involved in the transformation from A to B.
You'd have to check every word with a length between that of A and B to check whether they are part of the range between A and B... So it's not a matter of simple calculus, and I don't know if this could be even feasible with a regular computer nowadays. And that's just considering Meriam's close to half a million entries.

How to calculate probabilities from confusion matrices? need denominator, chars matrices

This paper contains confusion matrices for spelling errors in a noisy channel. It describes how to correct the errors based on conditional properties.
The conditional probability computation is on page 2, left column. In footnote 4, page 2, left column, the authors say: "The chars matrices can be easily replicated, and are therefore omitted from the appendix." I cannot figure out how can they be replicated!
How to replicate them? Do I need the original corpus? or, did the authors mean they could be recomputed from the material in the paper itself?
Looking at the paper, you just need to calculate them using a corpus, either the same one or one relevant to your application.
In replicating the matrices, note that they implicitly define two different chars matrices: a vector and an n-by-n matrix. For each character x, the vector chars contains a count of the number of times the character x occurred in the corpus. For each character sequence xy, the matrix chars contains a count of the number of times that sequence occurred in the corpus.
chars[x] represents a look-up of x in the vector; chars[x,y] represents a look-up of the sequence xy in the matrix. Note that chars[x] = the sum over chars[x,y] for each value of y.
Note that their counts are all based on the 1988 AP Newswire corpus (available from the LDC). If you can't use their exact corpus, I don't think it would be unreasonable to use another text from the same genre (i.e. another newswire corpus) and scale your counts such that they fit the original data. That is, the frequency of a given character shouldn't vary too much from one text to another if they're similar enough, so if you've got a corpus of 22 million words of newswire, you could count characters in that text and then double them to approximate their original counts.

Resources