CMU Sphinx4 - Custom Language Model - cmusphinx

I have a very specific requirement. I am working on an application which will allow users to speak their employee number which is of the format HN56C12345 (any alphanumeric characters sequence) into the app. I have gone through the link: http://cmusphinx.sourceforge.net/wiki/tutoriallm but I am not sure if that would work for my usecase.
So my question is three-folds :
Can Sphinx4 actually recognize an alphanumeric sequence with high accuracy like an emp number in my case?
If yes, can anyone point me to a concrete example / reference page where someone has built custom language support in Sphinx4 from scratch. I haven't found a detailed step-by-step doc yet on this. Did anyone work on alphanumeric sequence based dictionaries or language models?
How to build an acoustic model for this scenario?

You don't need a new acoustic model for this, but rather a custom grammar. See http://cmusphinx.sourceforge.net/wiki/tutoriallm#building_a_grammar and http://cmusphinx.sourceforge.net/doc/sphinx4/edu/cmu/sphinx/jsgf/JSGFGrammar.html to learn more. Sphinx4 recognizes characters just fine if you put them space-separated in the grammar:
#JSGF V1.0
grammar jsgf.emplID;
<digit> = zero | one | two | three | four | five | six | seven | eight | nine ;
<digit2> = <digit> <digit> ;
<digit4> = <digit2> <digit2> ;
<digit5> = <digit4> <digit> ;
// This rule accepts IDs of a kind: hn<2 digits>c<5 digits>.
public <id> = h n <digit2> c <digit5> ;
As to accuracy, there are two ways to increase it. If the numbers of employees isn't too large, you can just make the grammar with all possible employee IDs. If this is not your case, than to have a generic grammar is your only option. Although it's possible to make a custom scorer which will use the context information to predict the employee ID better than the generic algorithm. This way requires some knowledge in both ASR and CMU Sphinx code.

Related

how can i see directly that a language is not regular

given L={a^n b^n c^n}, how can i say directly without looking at production rules that this language is not regular? i can use pumping lemma but some guys are saying just looking at the grammar that this is not regular one. how is it possible?
You have three chars in your alphabet. All of them depends on the same variable: n.
Now, if you have only two of them, imagine {a^n b^n} you can easily accomplish the task with this production:
S -> ab | aSb
But you have three of them and there's no way to link all of them to the same variable. You should use two syntax category, but since you do it, they are unlinked and you can generate different string from each one of them. The only way to link them is with only one syntax category, and that is impossible.
You can't do:
S -> abc | aSbc
In fact, you can't have a syntax category in your final string, so that is not a string. It needs to be transformed again. And what can you do from that point?
You can do:
aabcbc
or you can do:
aaSbcbc
The first one is a string, and isn't part of your language. The second is not a string, yet. But it's very easy to see that you can't manage to do any allowed string from that.

How to understand and add syllable break in this example?

I am new in machine learning and computing probabilities. This is an example from Lingpipe for adding syllabification in a word by training data.
Given a source model p(h) for hyphenated words, and a channel model p(w|h) defined so that p(w|h) = 1 if w is equal to h with the hyphens removed and 0 otherwise. We then seek to find the most likely source message h to have produced message w by:
ARGMAXh p(h|w) = ARGMAXh p(w|h) p(h) / p(w)
= ARGMAXh p(w|h) p(h)
= ARGMAXh s.t. strip(h)=w p(h)
where we use strip(h) = w to mean that w is equal to h with the hyphenations stripped out (in Java terms, h.replaceAll(" ","").equals(w)). Thus with a deterministic channel, we wind up looking for the most likely hyphenation h according to p(h), restricting our search to h that produce w when the hyphens are stripped out.
I do not understand how to use it to build a syllabification model.
If there is a training set containing:
a bid jan
a bide
a bie
a bil i ty
a bim e lech
How to have a model that will syllabify words? I mean what to be computed in order to find possible syllable breaks of a new word.
First compute what? then compute what? Can you please be specific with example?
Thanks a lot.
The method described in the article is based on a statistical law allowing to compute the correct value observing a noisy value. In other words, non-syllabified word is noisy or incorrect, like picnic, and the goal is finding a probably correct value, which is pic-nic.
Here is an excellent video lesson on very this topic (scroll to 1:25, but the whole set of lectures worth watching).
This method is specifically useful for word delimiting, but some use it for syllabification as well. Chinese language has space delimiters only for logical constructs, but most words follow each other with no delimiters. However, each character is a syllable, no exception.
There are other languages that have more complicated grammar. For instance, Thai has no spaces between the words, but each syllable may be constructed from several symbols, e.g. สวัสดี -> ส-วัส-ดี. Rule-based syllabification may be hard but possible.
As per English, I would not bother with Markov chains and N-grams and instead just use several simple rules that give pretty good match ratio (not perfect, however):
Two consonants between two vowels VCCV - split between them VC-CV as in cof-fee, pic-nic, except the "cluster consonant" that represents a single sound: meth-od, Ro-chester, hang-out
Three or more consonants between the vowels VCCCV - split keeping the blends together as in mon-ster or child-ren (this seems the most difficult as you cannot avoid a dictionary)
One consonant between two vowels VCV - split after the first vowel V-CV as in ba-con, a-rid
The rule above also has an exception based on blends: cour-age, play-time
Two vowels together VV - split between, except they represent a "cluster vowel": po-em, but glacier, earl-ier
I would start with the "main" rules first, and then cover them with "guard" rules preventing cluster vowels and consonants to be split. Also, there would be an obvious guard rule to prevent a single consonant to become a syllable. When done, I would have added another guard rule based on a dictionary.

Programming Language with Inflection

Is there a programming language that uses inflections (suffixing a word to add a certain meaning) instead of operators to express instructions? Just wondering.
What I am talking about is using inflections to add a meaning to an identifier such as a variable or type name.
For example:
native type integer
var x : integer = 12
var location : integers = 12, 5, 42
say 0th locationte to_string (( -te replaces "." operator. prints 12 ))
I think Perligata (Perl in Latin) is what you're looking for. :) From the article
There is no reason why programming
languages could not also use
inflexions, rather than position, to
denote lexical roles.
Here's an example program (Sieve of Eratosthenes):
#! /usr/local/bin/perl -w
use Lingua::Romana::Perligata;
maximum inquementum tum biguttam egresso scribe.
meo maximo vestibulo perlegamentum da.
da duo tum maximum conscribementa meis listis.
dum listis decapitamentum damentum nexto
fac sic
nextum tum novumversum scribe egresso.
lista sic hoc recidementum nextum cis vannementa da listis.
cis.
This is partially facetious, but... assembly language? Things like conditional jump instructions are often variations on a root ("J" for jump or whatnot) with suffixes added to denote the associated condition ("JNZ" for jump-if-not-zero, et cetera).
The excellent (dare I say fascinating) game-design language Inform 7 is inflected like English. But it's so closely integrated with a host of other design decisions that it's hard to peel away as a separate feature.
Anyone who is interested in language designs that are unusual but successful should check out Inform 7.
Presumably any programming language that uses natural language explicitly or closely as a basis, e.g., Natural-Language Programming. There was some research done at MIT into using English to produce high-level skeletons of programs, which is more in the realm of natural-language processing; the tool they created is called Metafor.
As far as I know, no existing language has support for, say, modifying or extending keywords with inflection. Now you've got me interested, though, so I'm sure I'll come up with something soon!
Of the 40 or so languages I know, the only thing that comes to mind is some rare SQL implementations which include friendly aliases. For example to select a default database after connecting, the standard is USE <some database name> but one I used somewhere which also allowed USING <some database name>.
FORTRAN uses the first letter of the name to determine the type of an implicitly-declared variable.
COBOL has singular and plural versions of its "figurative constants", e.g. SPACE and SPACES.
Python3.7 standard module contextvars has Context Variables, which can be used for inflection..

Given a document, select a relevant snippet

When I ask a question here, the tool tips for the question returned by the auto search given the first little bit of the question, but a decent percentage of them don't give any text that is any more useful for understanding the question than the title. Does anyone have an idea about how to make a filter to trim out useless bits of a question?
My first idea is to trim any leading sentences that contain only words in some list (for instance, stop words, plus words from the title, plus words from the SO corpus that have very weak correlation with tags, that is that are equally likely to occur in any question regardless of it's tags)
Automatic Text Summarization
It sounds like you're interested in automatic text summarization. For a nice overview of the problem, issues involved, and available algorithms, take a look at Das and Martin's paper A Survey on Automatic Text Summarization (2007).
Simple Algorithm
A simple but reasonably effective summarization algorithm is to just select a limited number of sentences from the original text that contain the most frequent content words (i.e., the most frequent ones not including stop list words).
Summarizer(originalText, maxSummarySize):
// start with the raw freqs, e.g. [(10,'the'), (3,'language'), (8,'code')...]
wordFrequences = getWordCounts(originalText)
// filter, e.g. [(3, 'language'), (8, 'code')...]
contentWordFrequences = filtStopWords(wordFrequences)
// sort by freq & drop counts, e.g. ['code', 'language'...]
contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences)
// Split Sentences
sentences = getSentences(originalText)
// Select up to maxSummarySize sentences
setSummarySentences = {}
foreach word in contentWordsSortbyFreq:
firstMatchingSentence = search(sentences, word)
setSummarySentences.add(firstMatchingSentence)
if setSummarySentences.size() = maxSummarySize:
break
// construct summary out of select sentences, preserving original ordering
summary = ""
foreach sentence in sentences:
if sentence in setSummarySentences:
summary = summary + " " + sentence
return summary
Some open source packages that do summarization using this algorithm are:
Classifier4J (Java)
If you're using Java, you can use Classifier4J's module SimpleSummarizer.
Using the example found here, let's assume the original text is:
Classifier4J is a java package for working with text. Classifier4J includes a summariser. A Summariser allows the summary of text. A Summariser is really cool. I don't think there are any other java summarisers.
As seen in the following snippet, you can easily create a simple one sentence summary:
// Request a 1 sentence summary
String summary = summariser.summarise(longOriginalText, 1);
Using the algorithm above, this will produce Classifier4J includes a summariser..
NClassifier (C#)
If you're using C#, there's a port of Classifier4J to C# called NClassifier
Tristan Havelick's Summarizer for NLTK (Python)
There's a work-in-progress Python port of Classifier4J's summarizer built with Python's Natural Language Toolkit (NLTK) available here.

How do I determine if a random string sounds like English?

I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD.
EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted.
You can build a markov-chain of a huge english text.
Afterwards you can feed words into the markov chain and check how high the probability is that the word is english.
See here: http://en.wikipedia.org/wiki/Markov_chain
At the bottom of the page you can see the markov text generator. What you want is exactly the reverse of it.
In a nutshell: The markov-chain stores for each character the probabilities of which next character will follow. You can extend this idea to two or three characters if you have enough memory.
The easy way with Bayesian filters (Python example from http://sebsauvage.net/python/snyppets/#bayesian)
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')
>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]
>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]
You could approach this by tokenizing a candidate string into bigrams—pairs of adjascent letters—and checking each bigram against a table of English bigram frequencies.
Simple: if any bigram is sufficiently low on the frequency table (or outright absent), reject the string as implausible. (String contains a "QZ" bigram? Reject!)
Less simple: calculate the overall plausibility of the whole string in terms of, say, a product of the frequencies of each bigram divided by the mean frequency of a valid English string of that length. This would allow you to both (a) accept a string with an odd low-frequency bigram among otherwise high-frequency bigrams, and (b) reject a string with several individual low-but-not-quite-below-the-threshold bigrams.
Either of those would require some tuning of the threshold(s), the second technique more so than the first.
Doing the same thing with trigrams would likely be more robust, though it'll also likely lead to a somewhat more strict set of "valid" strings. Whether that's a win or not depends on your application.
Bigram and trigram tables based on existing research corpora may be available for free or purchase (I didn't find any freely available but only did a cursory google so far), but you can calculate a bigram or trigram table from yourself from any good-sized corpus of English text. Just crank through each word as a token and tally up each bigram—you might handle this as a hash with a given bigram as the key and an incremented integer counter as the value.
English morphology and English phonetics are (famously!) less than isometric, so this technique might well generate strings that "look" English but present troublesome prounciations. This is another argument for trigrams rather than bigrams—the weirdness produced by analysis of sounds that use several letters in sequence to produce a given phoneme will be reduced if the n-gram spans the whole sound. (Think "plough" or "tsunami", for example.)
It's quite easy to generate English sounding words using a Markov chain. Going backwards is more of a challenge, however. What's the acceptable margin of error for the results? You could always have a list of common letter pairs, triples, etc, and grade them based on that.
You should research "pronounceable" password generators, since they're trying to accomplish the same task.
A Perl solution would be Crypt::PassGen, which you can train with a dictionary (so you could train it to various languages if you need to). It walks through the dictionary and collects statistics on 1, 2, and 3-letter sequences, then builds new "words" based on relative frequencies.
I'd be tempted to run the soundex algorithm over a dictionary of English words and cache the results, then soundex your candidate string and match against the cache.
Depending on performance requirements, you could work out a distance algorithm for soundex codes and accept strings within a certain tolerance.
Soundex is very easy to implement - see Wikipedia for a description of the algorithm.
An example implementation of what you want to do would be:
def soundex(name, len=4):
digits = '01230120022455012623010202'
sndx = ''
fc = ''
for c in name.upper():
if c.isalpha():
if not fc: fc = c
d = digits[ord(c)-ord('A')]
if not sndx or (d != sndx[-1]):
sndx += d
sndx = fc + sndx[1:]
sndx = sndx.replace('0','')
return (sndx + (len * '0'))[:len]
real_words = load_english_dictionary()
soundex_cache = [ soundex(word) for word in real_words ]
if soundex(candidate) in soundex_cache:
print "keep"
else:
print "discard"
Obviously you'll need to provide an implementation of read_english_dictionary.
EDIT: Your example of "KEAL" will be fine, since it has the same soundex code (K400) as "KEEL". You may need to log rejected words and manually verify them if you want to get an idea of failure rate.
Metaphone and Double Metaphone are similar to SOUNDEX, except they may be tuned more toward your goal than SOUNDEX. They're designed to "hash" words based on their phonetic "sound", and are good at doing this for the English language (but not so much other languages and proper names).
One thing to keep in mind with all three algorithms is that they're extremely sensitive to the first letter of your word. For example, if you're trying to figure out if KEAL is English-sounding, you won't find a match to REAL because the initial letters are different.
Do they have to be real English words, or just strings that look like they could be English words?
If they just need to look like possible English words you could do some statistical analysis on some real English texts and work out which combinations of letters occur frequently. Once you've done that you can throw out strings that are too improbable, although some of them may be real words.
Or you could just use a dictionary and reject words that aren't in it (with some allowances for plurals and other variations).
You could compare them to a dictionary (freely available on the internet), but that may be costly in terms of CPU usage. Other than that, I don't know of any other programmatic way to do it.
That sounds like quite an involved task! Off the top of my head, a consonant phoneme needs a vowel either before or after it. Determining what a phoneme is will be quite hard though! You'll probably need to manually write out a list of them. For example, "TR" is ok but not "TD", etc.
I would probably evaluate each word using a SOUNDEX algorithm against a database of english words. If you're doing this on a SQL-server it should be pretty easy to setup a database containing a list of most english words (using a freely available dictionary), and MSSQL server has SOUNDEX implemented as an available search-algorithm.
Obviously you can implement this yourself if you want, in any language - but it might be quite a task.
This way you'd get an evaluation of how much each word sounds like an existing english word, if any, and you could setup some limits for how low you'd want to accept results. You'd probably want to consider how to combine results for multiple words, and you would probably tweak the acceptance-limits based on testing.
I'd suggest looking at the phi test and index of coincidence. http://www.threaded.com/cryptography2.htm
I'd suggest a few simple rules and standard pairs and triplets would be good.
For example, english sounding words tend to follow the pattern of vowel-consonant-vowel, apart from some dipthongs and standard consonant pairs (e.g. th, ie and ei, oo, tr). With a system like that you should strip out almost all words that don't sound like they could be english. You'd find on closer inspection that you will probably strip out a lot of words that do sound like english as well, but you can then start adding rules that allow for a wider range of words and 'train' your algorithm manually.
You won't remove all false negatives (e.g. I don't think you could manage to come up with a rule to include 'rythm' without explicitly coding in that rythm is a word) but it will provide a method of filtering.
I'm also assuming that you want strings that could be english words (they sound reasonable when pronounced) rather than strings that are definitely words with an english meaning.

Resources