How to use hmmlearn to classify English text? - python-3.x

I want to implement a classic Markov model problem: Train MM to learn English text patterns, and use that to detect English text vs. random strings.
I decided to use hmmlearn so I don't have to write my own. However I am confused about how to train it. It seems to require the number of components in the HMM, but what is a reasonable number for English? Also, can I not do a simple higher order Markov model instead of hidden? Presumably the interesting property is is patterns of ngrams, not hidden states.

hmmlearn is designed for unsupervised learning of HMMs, while your problem is clearly supervised: given examples of English and random strings, learn to distinguish between the two. Also, as you've correctly pointed it out, the notion of hidden states is tricky to define for text data, therefore for your problem plain MMs would be more appropriate. I think you should be able to implement them in <100 lines of code in Python.

Related

Dataset Language identification

I am working on a text classification problem with a multilingual dataset. I would like to know how the languages are distributed in my dataset and what languages are these. The number of languages might be approximately 8-12. I am considering this language detection as a part of the preprocessing. I would like to figure out the languages in order to be able to use the appropriate stop words and see how less data in some of the given languages could affect the occuracy of the classificatin.
Is langid.py or simple langdetect suitable? or any other suggestions?
Thanks
The easiest way to identify the language of a text is to have a list of common grammatical words of each language (pretty much your stop words, in fact), take a sample of the text and count which words occur in your (language-specific) word lists. Then sum them up and the word list with the largest overlap should be the language of the text.
If you want to be more advanced, you can use n-grams instead of words: collect n-grams from a text you know the language of, and use that as a classifier instead of your stop words.
You could use any transformer-based model trained on multiple languages. For instance, you could use XLM-Roberta which is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang tensors to understand which language is used (which is good in your case), and should be able to determine the correct language from the input ids. Besides like any other transformer based model, it comes with its tokenizer so you could jump the preprocessing part.
You could use the Huggingface library to use any of these models.
Check the XLM Roberta Huggingface documentation here

Natural Language Generation - how to go beyond templates

We've build a system that analyzes some data and outputs some results in plain English (i.e. no charts etc.). The current implementation relies on lots of templates and some randomization in order to give as much diversity to the text as possible.
We'd like to switch to something more advanced with the hope that the produced text is less repetitive and sounds less robotic. I've searched a lot on google but I cannot find something concrete to start from. Any ideas?
EDIT: The data fed to the NLG mechanism are in JSON format. Here is an example about web analytics data. The json file may contain for example a metric (e.g. visits), it's value in the last X days, whether the last value is expected or not and which dimensions (e.g. countries or marketing channels) affected its change.
The current implementation could give something like this:
Overall visits in the UK mainly from ABC email campaign reached 10K (+20% DoD) and were above the expected value by 10%. Users were mainly landing on XXX page while the increase was consistent across devices.
We're looking to finding a way to depend less on templates, sound even more natural and increase the vocabulary.
What you are looking for is a hot research area and a pretty tough task. Currently there is no way to generate 100% meaningful diverse and natural sentences. one approach to generate sentences is using n-grams. using these method you can generate sentences that look more natural and diverse that may look good but probably meaningless and grammatically incorrect.
A more up to date approach is using Deep learning. anyway if you want to generate meaningful sentences, maybe your best way is using your current template based method.
You can find an introduction to basics of n-gram based NLG here:
Generating Random Text with Bigrams
this tool sounds to implement some of the most famous techniques for natural language generation: simplenlg
Have you tried Neural Networks especially LSTM and GRU architectures? These models are the most recent developments in predicting sequences of words. Generating natural language means to generate a sequence of words such that it makes sense with respect to the input and earlier words in the sequence. This is equivalent to predicting time series. LSTM is designed for predicting time series. Hence, it is commonly used to predict a sequence of words, given an input sequence, an input word, or any other input that can be embedded in a vector.
Deep learning libraries such as Tensorflow, Keras, and Torch all have sequence to sequence implementations that can be used for generating natural language by predicting a sequence of words given an input.
Note that usually these models need a huge amount of training data.
You need to meet two criteria in order to benefit from such models:
You should be able to represent your input as a vector.
You need a relatively large amount of input/target pairs.

Word sense disambiguation using WEKA

I've got a training DataSet and a Test DataSet. How can we experiment and get results ?
Can WEKA be used for the same ?
The topic is Word Sense Disambiguation using Support Vector Machine Supervised learning Approach
The Document types within both the sets include following file types:
1. 2 XML files
2. README file
3. SENSEMAP format
4. TRAIN format
5. KEY format
6. WORDS format
Machine learning approaches like SVM are not popular with word sense disambiguation.
Are you aware of Wikify, mapping to wikipedia can be considered very fine word-sense disambiguation.
To answer your question, in cases like these; any machine learning technique can give you desired results. One should be more worried about the features to extract and make sure the word features are distinctive enough to resolve the disambiguations at the level you chose. For example in the sentence: Wish you a very Happy Christamas you just want to disambiguate Happy Christmas as either book or festival.

Unguided speech to text conversion

I am trying to come up with a way to convert speech to text. I am trying to use Sphinx to attain this. What I mean by unguided speech to text is that, the speaker is not bound to speak from a definite set of sentences. Rather he might speak any sentence. So its not possible for me to have a grammar file, where each word is one of the alternative pre-written in the grammar file. I understand that I would have to train Sphinx somehow to do this.
But I am a beginner in sphinx. How to start training Sphinx to convert unguided speech? Is it possible to attain unguided conversion with Sphinx?
The task you are up to is, as of right now, is not yet possible to complete, at least not with satisfying accuracy.
As for the Sphinx-based solution: you will have to create dictionary with all the words to be recognized. There is no other way.
Once you have the dictionary, you can generate a simple n-gram model based on it, with ony unigrams - each unigram will be one word. The probability of each may be the same, or you may attempt to do some statistical analysis of the words that will be used.

Baum-Welch algorithm for pos tagger

everyone.
I'm using the Baum-Welch algorithm to train a pos tagger,it is totally in the unsupervised way.
Here comes the problem:
When i get the label result, I only get a sequence of numbers.
I can't figure out which label stands for VV,NN,DT.
How can I solve this problem?
In general, there's no way to do that. Baum-Welch will find classes of word uses that have similar distributions, but there's no particular reason to suppose that those classes will map in any straightforward way to categories posited by any specific linguistic theory. Therefore, unsupervised POS taggers are mainly useful for applications where you care about equivalence classes of words or phrases but not about the specific tags being assigned.
If you really need human-readable labels, though (e.g., during development, to evaluate whether the results you're getting are even remotely plausible), I'd hand-tag a few dozen sentences. Then you could apply your B-W-derived tagger to that labeled mini-corpus to induce a mapping between class numbers and POS labels.

Resources