Unsupervised HMM training in NLTK - nlp

I am just trying to do very simple unsupervised HMM training in nltk.
Consider:
import nltk
trainer = nltk.tag.hmm.HiddenMarkovModelTrainer()
from nltk.corpus import gutenberg
emma = gutenberg.words('austen-emma.txt')
m = trainer.train_unsupervised(emma)
ValueError: A Uniform probability distribution must have at least one sample.
Can I find an example of using nltk.tag.hmm.HiddenMarkovModelTrainer.train_unsupervised?

Apparently, nltk requires us to manually specify the set of observed symbols and states, and also requires the unlabeled sequences to be in the form of [ [(symb,tag),(symb,tag),...], [(symb,tag),(symb,tag),...], ...].
So we have
s = """"Your humble writer knows a little bit about a lot of things, but despite writing a fair amount about text processing (a book, for example), linguistic processing is a relatively novel area for me. Forgive me if I stumble through my explanations of the quite remarkable Natural Language Toolkit (NLTK), a wonderful tool for teaching, and working in, computational linguistics using Python. Computational linguistics, moreover, is closely related to the fields of artificial intelligence, language/speech recognition, translation, and grammar checking.\nWhat NLTK includes\nIt is natural to think of NLTK as a stacked series of layers that build on each other. Readers familiar with lexing and parsing of artificial languages (like, say, Python) will not have too much of a leap to understand the similar -- but deeper -- layers involved in natural language modeling.\nGlossary of terms\nCorpora: Collections of related texts. For example, the works of Shakespeare might, collectively, by called a corpus; the works of several authors, corpora.\nHistogram: The statistic distribution of the frequency of different words, letters, or other items within a data set.\nSyntagmatic: The study of syntagma; namely, the statistical relations in the contiguous occurrence of letters, words, or phrases in corpora.\nContext-free grammar: Type-2 in Noam Chomsky's hierarchy of the four types of formal grammars. See Resources for a thorough description.\nWhile NLTK comes with a number of corpora that have been pre-processed (often manually) to various degrees, conceptually each layer relies on the processing in the adjacent lower layer. Tokenization comes first; then words are tagged; then groups of words are parsed into grammatical elements, like noun phrases or sentences (according to one of several techniques, each with advantages and drawbacks); and finally sentences or other grammatical units can be classified. Along the way, NLTK gives you the ability to generate statistics about occurrences of various elements, and draw graphs that represent either the processing itself, or statistical aggregates in results.\nIn this article, you'll see some relatively fleshed-out examples from the lower-level capabilities, but most of the higher-level capabilities will be simply described abstractly. Let's now take the first steps past text processing, narrowly construed. """
sentences = s.split('.')[:-1]
seq = [map(lambda x:(x,''), ss.split(' ')) for ss in sentences]
symbols = list(set([ss[0] for sss in seq for ss in sss]))
states = range(5)
trainer = nltk.tag.hmm.HiddenMarkovModelTrainer(states=states,symbols=symbols)
m = trainer.train_unsupervised(seq)
m.random_sample(random.Random(),10)

I thought that this was this bug in NLTK:
http://code.google.com/p/nltk/source/diff?spec=svn8791&r=8791&format=side&path=/trunk/nltk/nltk/tag/hmm.py
http://code.google.com/p/nltk/issues/detail?id=681
However the error message "A Uniform probability distribution must have at least one sample." is different from the one you get from the bug.

Related

Why are word embeddings with linguistic features (e.g. Sense2Vec) not used?

Given that embedding systems such as Sense2Vec incorporate linguistic features such as part-of-speech, why are these embeddings not more commonly used?
Across popular work in NLP today, Word2Vec and GloVe are the most commonly used word embedding systems. Despite the fact that they only incorporate word information and does not have linguistic features of the words.
For example, in sentiment analysis, text classification or machine translation tasks, it makes logical sense that if the input incorporates linguistic features as well, performance could be improved. Particular when disambiguating words such as "duck" the verb and "duck" the noun.
Is this thinking flawed? Or is there some other practical reason why these embeddings are not more widely used.
It's a very subjective question. One reason is the pos-tagger itself. Pos-tagger is a probabilistic model which could add to the overall error/confusion.
For eg. say you have dense representations for duck-NP and duck-VB but during run/inference time your pos-tagger tags 'duck' as something else then you wont even find it. Moreover it also effectively reduces the total number of times your system sees the word duck hence one could argue that representations generated would be weak.
To top it off the main problem which sense2vec was addressing is contextualisation of word representations which has been solved by contextual representations like BERT and ElMo etc. without producing any of the above problems.

Semantic Similarity across multiple languages

I am using word embeddings for finding similarity between two sentences. Using word2vec, I also get a similarity measure if one sentence is in English and the other one in Dutch (though not very good).
So I started wondering if it's possible to compute the similarity between two sentences in two different languages (without an explicit translation), especially if the languages have some similarities (Englis/Dutch)?
Let's assume that your sentence-similarity scheme uses only word-vectors as an input – as in simple word-vector averaging schemes, or Word Mover's Distance.
It should be possible to do what you've suggested, provided that:
you have good sets of word-vectors for each language's words
the coordinate spaces of the word-vectors are compatible, meaning the words for the exact-same things in both languages have nearly-identical coordinates (and other words with similar meanings have close coordinates)
That second quality is not automatically assured. In fact, given the random initialization of word2vec models, and other randomization introduced by the algorithm/implementation, even subsequent training runs on the exact same data won't place words into the exact same places. So word-vectors trained on totally-separate English/Dutch corpuses won't likely place equivalent words at the same coordinates.
But, you can learn an algebraic-transformation between two spaces, based on certain anchor/reference word-pairs (that you know should have similar vectors). You can then apply that transformation to all words in one of the two sets, which results in you having vectors for those 'foreign' words within the comparable coordinate-space of the 'canonical' word-set.
In fact this very idea was used in one of the first word2vec papers:
"Exploiting Similarities among Languages for Machine Translation"
If you were to apply a similar transformation on one of your language word-vector sets, then use those transformed vectors as inputs to your sentence-vector scheme, those sentence-vectors would likely have some useful comparability to sentence-vectors in the other language, bootstrapped from word-vectors in the same coordinate-space.
Update: There's a very interesting recent paper that manages to train word-vectors in multiple languages simultaneously, using a corpus that includes both raw sentences in each single language, and a (smaller) set of aligned-sentences that are known to mean the same in both languages. Gensim doesn't yet support this mode, but there's discussion of supporting it in a future refactor.
I've recently produced a Python implementation of the technique mentioned in the paper from #gojomo's answer: transvec.
You'll need to provide word translation pairs as training data (I just threw words from my corpus into Google Translate to get as many such pairs as I can) and then you can use a wrapper model from transvec to produce comparable word embeddings for multiple languages. Here's an example:
import gensim.downloader
from transvec.transformers import TranslationWordVectorizer
# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")
# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]
bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)
# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]
Don't worry about the weird POS tags on the Russian words - this is just a quirk of the particular pre-trained model I used.
For the case of documents rather than words, things are a little trickier because Doc2Vec can't use pre-trained Word2Vec models as a starting point. However, you can get an approximate document vector by simply taking the mean of all the word vectors from that document. If you provide a 2d array to TranslationWordVectorizer's transform method, it will do exactly this and provide you with an approximate document vector so you can find documents with similar meaning even if the languages are different.

word2vec lemmatization of corpus before training

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.
I think it really matters about what you want to solve with this. It depends on the task.
Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.
But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.
Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.
The current project I am working on involves identifying gene names within Biology papers abstracts using the vector space created by Word2Vec. When we run the algorithm without lemmatizing the Corpus mainly 2 problems arise:
The vocabulary gets way too big, since you have words in different forms which in the end have the same meaning.
As noted above, your space get less sparse, since you get more representatives of a certain "meaning", but at the same time, some of these meanings might get split among its representatives, let me clarify with an example
We are currently interest in a gene recognized by the acronym BAD. At the same time, "bad" is a english word which has different forms (badly, worst, ...). Since Word2vec build its vectors based on the context (its surrounding words) probability, when you don't lemmatize some of these forms, you might end up losing the relationship between some of these words. This way, in the BAD case, you might end up with a word closer to gene names instead of adjectives in the vector space.

Causal Sentences Extraction Using NLTK python

I am extracting causal sentences from the accident reports on water. I am using NLTK as a tool here. I manually created my regExp grammar by taking 20 causal sentence structures [see examples below]. The constructed grammar is of the type
grammar = r'''Cause: {<DT|IN|JJ>?<NN.*|PRP|EX><VBD><NN.*|PRP|VBD>?<.*>+<VBD|VBN>?<.*>+}'''
Now the grammar has 100% recall on the test set ( I built my own toy dataset with 50 causal and 50 non causal sentences) but a low precision. I would like to ask about:
How to train NLTK to build the regexp grammar automatically for
extracting particular type of sentences.
Has any one ever tried to extract causal sentences. Example
causal sentences are:
There was poor sanitation in the village, as a consequence, she had
health problems.
The water was impure in her village, For this reason, she suffered
from parasites.
She had health problems because of poor sanitation in the village.
I would want to extract only the above type of sentences from a
large text.
Had a brief discussion with the author of the book: "Python Text Processing with NLTK 2.0 Cookbook", Mr.Jacob Perkins. He said, "a generalized grammar for sentences is pretty hard. I would instead see if you can find common tag patterns, and use those. But then you're essentially do classification by regexp matching. Parsing is usually used to extract phrases within a sentence, or to produce deep parse trees of a sentence, but you're just trying to identify/extract sentences, which is why I think classification is a much better approach. Consider including tagged words as features when you try this, since the grammar could be significant." taking his suggestions I looked at the causal sentences I had and I found out that these sentences have words like
consequently
as a result
Therefore
as a consequence
For this reason
For all these reasons
Thus
because
since
because of
on account of
due to
for the reason
so, that
These words are indeed connecting cause and effect in a sentence. Using these connectors it is now easy to extract causal sentences. A detailed report can be found on arxiv: https://arxiv.org/pdf/1507.02447.pdf

Paraphrase recognition using sentence level similarity

I'm a new entrant to NLP (Natural Language Processing). As a start up project, I'm developing a paraphrase recognizer (a system which can recognize two similar sentences). For that recognizer I'm going to apply various measures at three levels, namely: lexical, syntactic and semantic. At the lexical level, there are multiple similarity measures like cosine similarity, matching coefficient, Jaccard coefficient, et cetera. For these measures I'm using the simMetrics package developed by the University of Sheffield which contains a lot of similarity measures. But for the Levenshtein distance and Jaro-Winkler distance measures, the code is only at character level, whereas I require code at the sentence level (i.e. considering a single word as a unit instead of character-wise). Additionally, there is no code for computing the Manhattan distance in SimMetrics. Are there any suggestions for how I could develop the required code (or someone provide me the code) at the sentence level for the above mentioned measures?
Thanks a lot in advance for your time and effort helping me.
I have been working in the area of NLP for a few years now, and I completely agree with those who have provided answers/comments. This really is a hard nut to crack! But, let me still provide a few pointers:
(1) Lexical similarity: Instead of trying to generalize Jaro-Winkler distance to sentence-level, it is probably much more fruitful if you develop a character-level or word-level language model, and compute the log-likelihood. Let me explain further: train your language model based on a corpus. Then take a whole lot of candidate sentences that have been annotated as similar/dissimilar to the sentences in the corpus. Compute the log-likelihood for each of these test sentences, and establish a cut-off value to determine similarity.
(2) Syntactic similarity: So far, only stylometric similarities can manage to capture this. For this, you will need to use PCFG parse trees (or TAG parse trees. TAG = tree adjoining grammar, a generalization of CFGs).
(3) Semantic similarity: off the top of my head, I can only think of using resources such as Wordnet, and identifying the similarity between synsets. But this is not simple either. Your first problem will be to identify which words from the two (or more) sentences are "corresponding words", before you can proceed to check their semantics.
As Chris suggests, this is a non-trivial project for a beginner. I would suggest you start of something simpler (if relatively boring) such as chunking.
Have a look at the docs and books for the Python NLTK library - there are some samples that are close to what you are looking for. For example, containment: is it plausible that one statement contains another. note the 'plausible' there, the state of the art isn't good enough for a simple yes/no or even a probability.

Resources