Difference between GMM and HMM - gaussian

From what I understand:
GMM is a probabilistic model which can model N sub population normally distributed. Each component in GMM is a Gaussian
distribution.
HMM is a statistical Markov model with hidden states. When the data is continuous, each hidden state is modeled as Gaussian
distribution.
If these two informations are correct, what is the difference between GMM and HMM ?
Also, in time series case (continuous data), each state is one and only one Gaussian distribution ? there is no emission probability matrix ?
Thanks for your help !!! :)

Those two pieces of information are NOT correct. First, you have to understand the 'process' versus random variable (RV). HMM, even its name is 'model' actually it is a (stochastic) process that RV(s) change over time index t. GMM is usually used for the emission of the process (HMM). Comparing HMM to GMM is not apple to apple, one HMM is a stochastic process even its name is a model. and GMM is related to the distribution of random variables. HMM need time index (t or n) while GMM doesn't.
Q: Also, in time series case (continuous data), each state is one and only one Gaussian distribution? there is no emission probability matrix?
A: This question is not understandable

Related

How can I specify confidence in training data?

I am classifying data with categorical variables. It is data where people have provided information.
My training dataset is of varying quality. I have a greater confidence in some of the data i.e. I have a higher confidence that people have provided correct information whereas in some the data I am not so sure.
How can I pass this information into a classification algorithm such as Naive Bayes or K nearest neighbour?
Or should I instead look to another algorithm?
I think what you want to do, is to provide individual weights (for the importance/confidence) for each data point you have.
For instance, if you are very certain that one data point is of higher quality and should have a higher weight than others, in which you are less confident in, you can specify that when fitting your classifier.
Sklearn provides for instance the Gaussian Naive Bayes classifier (GaussianNB) for that.
Here, you can specify sample_weights when calling the fit() method.

RandomForestClassifier in Multi-label problem - how it works?

How does the RandomForestClassifier of sklearn handle a multilabel problem (under the hood)?
For example, does it brake the problem in distinct one-label problems?
Just to be clear, I have not really tested it yet but I see y : array-like, shape = [n_samples] or [n_samples, n_outputs] at the .fit() function of the RandomForestClassifier.
Let me cite scikit-learn. The user guide of random forest:
Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples, n_outputs]).
The section multi-output problems of the user guide of decision trees:
… to support multi-output problems. This requires the following changes:
Store n output values in leaves, instead of 1;
Use splitting criteria that compute the average reduction across all n outputs.
And I hope this will answer your question. If not, you can look at the section's reference:
M. Dumont et al., Fast multi-class image annotation with random subwindows and multiple output randomized trees, International Conference on Computer Vision Theory and Applications, 2009.
I was a bit confused when I started using trees. If you refer to the sklearn doc:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
If you go down on the methods to predict_proba, you can see:
"The predicted class probability is the fraction of samples of the same class in a leaf."
So in predict, the class is the mode of the classes on that node. This can change if you use weighted classes
"class_weight : dict, list of dicts, “balanced” or None, default=None
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one."
Hope this helps! :)

Is TF-IDF necessary when using SVM?

I'm using Support Vector Machines to classify phrases. Before using the SVM, I understand I should do some kind of normalization on the phrase-vectors. One popular method is TF-IDF.
The terms with the highest TF-IDF score are often the terms that best characterize the topic of the document.
But isn't that exactly what SVM does anyway? Giving the highest weight to the terms that best characterize the document?
Thanks in advance :-)
The weight of a term (as assigned by an SVM classifier) may or may not be directly proportional to the relevance of that term to a particular class. This depends on the kernel of the classifier as well as the regularization used. SVM does NOT assign weights to terms that best characterize a single document.
Term-frequency (tf) and inverse document frequency (idf) are used to encode the value of a term in a document vector. This is independent of the SVM classifier.

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

forward-backward algorithm for secondary structure prediction

I want to use HMM (forward backward model) for protein secondary structure prediction.
Basically, a three-state model is used: States = {H=alpha helix, B=beta sheet, C=coil}
and each state has a emission probability pmf of 1-by-20 (for the 20 amino acids).
After using a "training set" of sequences on the forward backward model, the expectation maximization converges for an optimal transitions matrix (3-by-3 between the three states), and emission probability pmf for each state.
Does anyone know of a dataset (preferably very small) of sequences for which the "correct" values of the transition matrix and emission probabilities are determined. I would like to use that dataset in Excel to apply the forward backward algorithm and build my confidence to determine whether or not I can get the same result.
And then move on to something less primitive than Excel :o)
The best way to do this is probably to produce your own simulated data from distributions you decide. Then you run your program to see if the parameter estimations converge towards your known parameters.
In your case, this will involve writing a Markov chain that changes from state to state with some known and arbitrary probability (for instance, P(Helix to Chain)=0.001) and then emits an amino acid with probability (for instance, P(methionine)=0.11). For each step, print out the state and emission. You can then watch your posterior probabilities approach the state for each site.
You can make these as arbitrary as you want, because when you run your HMM you should converge on the proper distributions.
Good luck!

Resources