What does the parenthesis mean with a cmusphinx result? - cmusphinx

My output is:
['<s>', 'does', 'any', '<sil>', 'unable', 'to(3)', 'bear', 'the', 'senate', 'is', 'touching', 'emotion', 'turned', 'away', '<sil>', 'and(2)', 'ill', 'afford', '<sil>', 'without', 'seeking', 'any', 'further', 'explanation', '<sil>', 'and(2)', 'attracted', 'towards(2)', 'him', 'and', 'irresistible', 'magnetism', 'which', 'draws', 'us', 'towards(2)', 'those', 'who', 'have', 'loved', 'to(3)', 'people', 'for(2)', 'whom', 'we', 'mourn', '<sil>', 'extended', 'his', 'hand', 'towards(2)', 'the(2)', 'young', 'man', '</s>']
I get what <s> and <sil> do. But what about to(3)?

It's hard to say with absolute certainty without checking the dictionary file (normally the file with .dict extension) which relates each word to its pronunciation. You could then check how different it is from (supposedly) to(2) or to. (Or even if those variations exist at all.)
However, since many words with the same spelling have different pronunciations, the convention is to account for those with different symbols in the dictionary, like stated in the official tutorial.
A dictionary can also contain alternative pronunciations. In that case you can designate them with a number in parentheses:
the TH IH
the(2) TH AH
In the example above, the software would recognise differently according to the speaker having said it differently.
If you're using a pre-made official model, then that's the case. Assuming you don't care so much about how it was pronounced and more about what it was pronounced, you can ignore the parenthesis.

Related

Identifying phrases which contrast two corpora

I would like to identify compound phrases in one corpus (e.g. (w_1, w_2) in Corpus 1) which not only appear significantly more often than their constituents (e.g. (w_1),(w_2) in Corpus 1) within the corpus but also more than they do in a second corpus (e.g. (w_1, w_2) in Corpus 2). Consider the following informal example. I have the two corpora each consisting of a set of documents:
[['i', 'live', 'in', 'new', 'york'], ['new', 'york', 'is', 'busy'], ...]
[['los', 'angeles', 'is', 'sunny'], ['los', 'angeles', 'has', 'bad', 'traffic'], ...].
In this case, I would like new_york to be detected as a compound phrase. However, when corpus 2 is replaced by
[['i', 'go', 'to', 'new', york'], ['i', 'like', 'new', 'york'], ...],
I would like new_york to be relatively disregarded.
I could just use a ratio between n-gram scores between corresponding phrases in corpora, but I don't see how to scale to general n. Normally, phrase detection for n-grams with n>2 is done by recursing on n and gradually inserting compound phrases into the documents by thresholding a score function. This insures that at step n, if you want to score the n-gram (w_1, ..., w_n), then you can always normalize by the constituent m-grams for m<n. But with a different corpus, these are not guaranteed to appear.
A reference to the literature or a relevant hack will be appreciated.

What is the input format of fastText and why does my model doesn't give me a meaningful similar output?

My goal is to find similarities between a word and a document. For example, I want to find the similarity between "new" and a document, for simplicity, say "Hello World!".
I used word2vec from gensim, but the problem is it does not find the similarity for an unseen word. Thus, I tried to use fastText from gensim as it can find similarity for words that are out of vocabulary.
Here is a sample of my document data:
[['This', 'is', 'the', 'only', 'rule', 'of', 'our', 'household'],
['If',
'you',
'feel',
'a',
'presence',
'standing',
'over',
'you',
'while',
'you',
'sleep',
'do'],
['NOT', 'open', 'your', 'eyes'],
['Ignore', 'it', 'and', 'try', 'to', 'fall', 'asleep'],
['This',
'may',
'sound',
'a',
'bit',
'like',
'the',
'show',
'Bird',
'Box',
'from',
'Netflix']]
I simply train data like this:
from gensim.models.fasttext import FastText
model = FastText(sentences_cleaned)
Consequently, I want to find the similarity between say, "rule" and this document.
model.wv.most_similar("rule")
However, fastText gives me this:
[('the', 0.1334390938282013),
('they', 0.12790171802043915),
('in', 0.12731242179870605),
('not', 0.12656228244304657),
('and', 0.11071767657995224),
('of', 0.08563747256994247),
('I', 0.06609072536230087),
('that', 0.05195673555135727),
('The', 0.002402491867542267),
('my', -0.009009800851345062)]
Obviously, it must have "rule" as the top similarity since the word "rule" appears in the first sentence of the document. I also tried stemming/lemmatization, but it doesn't work either.
Was my input format correct? I've seen lots of documents are using .cor or .bin format and I don't know what are those.
Thanks for any reply!
model.wv.most_similar('rule') asks for that's model's set-of-word-vectors (.wv) to return the words most-similar to 'rule'. That is, you've provided neither any document (multiple words) as a query, nor is there any way for the FastText model to return either a document itself, or a name of any documents. Only words, as it has done.
While FastText trains on texts – lists of word-tokens – it only models words/subwords. So it's unclear what you expected instead: the answer is of the proper form.
Those don't look like words very-much like 'rule', but you'll only get good results from FastText (and similar word2vec-algorithms) if you train them with lots of varied data showing many subtly-contrasting realistic uses of the relevant words.
How many texts, with how many words, are in your sentences_cleaned data? (How many uses of 'rule' and related words?)
In any real FastText/Word2Vec/etc model, trained with asequate data/parameters, no single sentence (like your 1st sentence) can tell you much about what the results "should" be. That only emerged from the full rich dataset.

How to sort list of strings without using any pre-defined function?

I am new to python and I am stuck to find solution for one problem.
I have a list like ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again'] which I want to sort without using any pre defined function.
I thought a lot but not able to solve it properly.
Is there any short and elegant way to sort such list of string without using pre-defined functions.
Which algorithm will be best suitable to sort list of strings?
Thanks.
This sounds like you're learning about sorting algorithms. One of the simplest sorting methods is bubblesort. Basically, it's just making passes through the list and looking at each neighboring pair of values. If they're not in the right order, we swap them. Then we keep making passes through the list until there are no more swaps to make, then we're done. This is not the most efficient sort, but it is very simple to code and understand:
values = ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
def bubblesort(values):
'''Sort a list of values using bubblesort.'''
sorted = False
while not sorted:
sorted = True
# take a pass through every pair of values in the list
for index in range(0, len(values)-1):
if values[index] > values[index+1]:
# if the left value is greater than the right value, swap them
values[index], values[index+1] = values[index+1], values[index]
# also, this means the list was NOT fully sorted during this pass
sorted = False
print(f'Original: {values}')
bubblesort(values)
print(f'Sorted: {values}')
## OUTPUT ##
# Original: ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
# Sorted: ['again', 'and', 'hello', 'makes', 'perfect', 'practice', 'world']
There are lots more sorting algorithms to learn about, and they each have different strengths and weaknesses - some are faster than others, some take up more memory, etc. It's fascinating stuff and worth it to learn more about Computer Science topics. But if you're a developer working on a project, unless you have very specific needs, you should probably just use the built-in Python sorting algorithms and move on:
values = ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
print(f'Original: {values}')
values.sort()
print(f'Sorted: {values}')
## OUTPUT ##
# Original: ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
# Sorted: ['again', 'and', 'hello', 'makes', 'perfect', 'practice', 'world']

Extract relationship concepts from sentences

Is there a current model or how could I train a model that takes a sentence involving two subjects like:
[Meiosis] is a type of [cell division]...
and decides if one is the child or parent concept of the other? In this case, cell division is the parent of meiosis.
Are the subjects already identified, i.e., do you know beforehand for each sentence which words or sequence of words represent the subjects? If you do I think what you are looking for is relationship extraction.
Unsupervised approach
A simple unsupervised approach is to look for patterns using part-of-speech tags, e.g.:
First you tokenize and get the PoS-tags for each sentence:
sentence = "Meiosis is a type of cell division."
tokens = nltk.word_tokenize("Meiosis is a type of cell division.")
tokens
['Meiosis', 'is', 'a', 'type', 'of', 'cell', 'division', '.']
token_pos = nltk.pos_tag(tokens)
token_pos
[('Meiosis', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('type', 'NN'), ('of', 'IN'),
('cell', 'NN'), ('division', 'NN'), ('.', '.')]
Then you build a parser, to parse a specific pattern based on PoS-tags, which is a pattern that mediates relationships between two subjects/entities/nouns:
verb = "<VB|VBD|VBG|VBN|VBP|VBZ>*<RB|RBR|RBS>*"
word = "<NN|NNS|NNP|NNPS|JJ|JJR|JJS|RB|WP>"
preposition = "<IN>"
rel_pattern = "({}|{}{}|{}{}*{})+ ".format(verb, verb, preposition, verb, word, preposition)
grammar_long = '''REL_PHRASE: {%s}''' % rel_pattern
reverb_pattern = nltk.RegexpParser(grammar_long)
NOTE: This pattern is based on this paper: http://www.aclweb.org/anthology/D11-1142
You can then apply the parser to all the tokens/PoS-tags except the ones which are part of the subjects/entities:
reverb_pattern.parse(token_pos[1:5])
Tree('S', [Tree('REL_PHRASE', [('is', 'VBZ')]), ('a', 'DT'), ('type', 'NN'), ('of', 'IN')])
If the the parser outputs a REL_PHRASE than there is a relationships between the two subjects. You then need to analyse all these patterns and decide which represent a parent-of relationships. One way to achieve that is by clustering them, for instance.
Supervised approach
If your sentences already are tagged with subjects/entities and with the type of relationships, i.e., a supervised scenario than you can build a model where the features can be the words between the two subjects/entities and the type of relationship the label.
sent: "[Meiosis] is a type of [cell division.]"
label: parent of
You can build a vector representation of is a type of, and train a classifier to predict the label parent of. You will need many examples for this, it also depends on how many different classes/labels you have.

CoreNLP API for N-grams?

Does CoreNLP have an API for getting unigrams, bigrams, trigrams, etc.?
For example, I have a string "I have the best car ". I would love to get:
I
I have
the
the best
car
based on the string I am passing.
If you are coding in Java, check out getNgrams* functions in the StringUtils class in CoreNLP.
You can also use CollectionUtils.getNgrams (which is what StringUtils class uses too)
You can use CoreNLP to tokenize, but for grabbing n-grams, do it natively in whatever language you're working in. If, say, you're piping this into Python, you can use list slicing and some list comprehensions to split them up:
>>> tokens
['I', 'have', 'the', 'best', 'car']
>>> unigrams = [tokens[i:i+1] for i,w in enumerate(tokens) if i+1 <= len(tokens)]
>>> bigrams = [tokens[i:i+2] for i,w in enumerate(tokens) if i+2 <= len(tokens)]
>>> trigrams = [tokens[i:i+3] for i,w in enumerate(tokens) if i+3 <= len(tokens)]
>>> unigrams
[['I'], ['have'], ['the'], ['best'], ['car']]
>>> bigrams
[['I', 'have'], ['have', 'the'], ['the', 'best'], ['best', 'car']]
>>> trigrams
[['I', 'have', 'the'], ['have', 'the', 'best'], ['the', 'best', 'car']]
CoreNLP is great for doing NLP heavy lifting, like dependencies, coref, POS tagging, etc. It seems like overkill if you just want to tokenize though, like bringing a fire truck to a water gun fight. Using something like TreeTagger might equally fulfill your needs for tokenization.

Resources