i need to know how to use HMM on top of Apache Spark. Its not present in MLlib.
Is there any alternatives ?
Thanks
Elsayed
Best I can find is a 2 year old implementation on spark.
You might want to investigate using something other than spark or HMM or just bite the bullet and implement it yourself. Implementing the viterbi algorithm is not particularly hard, here is my many years old implementation.
HMM algorithm - excerpts from https://en.wikipedia.org/wiki/Hidden_Markov_model
Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (i.e. hidden) states. The hidden markov model can be represented as the simplest dynamic Bayesian network.
A hidden Markov model can be considered a generalization of a mixture model where the hidden variables (or latent variables), which control the mixture component to be selected for each observation, are related through a Markov process rather than independent of each other.
Applying the principle of dynamic programming, this problem, too, can be handled efficiently using the forward algorithm.
Have not seen algorithms around the above concepts implemented on Spark.
Spark can support "beyond map-reduce" algorithms but the only thing with dynamic programming I could find was https://github.com/bbengfort/brisera
A Python implementation of a distributed seed and reduce algorithm (similar to BlastReduce and CloudBurst) that utilizes RDDs (resilient distributed datasets) to perform fast iterative analyses and dynamic programming without relying on "chained MapReduce jobs".
Mahout has an HMM implementation but unsure if it is distributed
https://mahout.apache.org/users/classification/hidden-markov-models.html
Related
I am developing a machine learning project which analyzes requirement specification and categories the non-functional requirements in to categories like database, web socket, backend technology, etc. As I have researched Naive Bayes is the better way to categorize but due to lack of dataset I have planned to go with Seed LDA for topic modeling. Would it be okay to use LDA or should I use something else?
You can try either LDA or clustering.
Based on my experiences, k-mean clustering could help you have a better visualization about what are you doing and what is happening.
With LDA, it could also be good. You can try it first since k-means take much more time.
I implemented an issue tracking system here using k-means, may you like to take a look. issue tracker
I am trying for setting the initial weights or parameters for a machine learning (Classification) algorithm in Spark 2.x. Unfortunately, except for MultiLayerPerceptron algorithm, no other algorithm is providing a way to set the initial weights/parameter values.
I am trying to solve Incremental learning using spark. Here, I need to load old model re-train the old model with new data in the system. How can I do this?
How can I do this for other algorithms like:
Decision Trees
Random Forest
SVM
Logistic Regression
I need to experiment multiple algorithms and then need to choose the best performing one.
How can I do this for other algorithms like:
Decision Trees
Random Forest
You cannot. Tree based algorithms are not well suited for incremental learning, as they look at the global properties of the data and have no "initial weights or values" that can be used to bootstrap the process.
Logistic Regression
You can use StreamingLogisticRegressionWithSGD which exactly implements required process, including setting initial weights with setInitialWeights.
SVM
In theory it could be implemented similarly to streaming regression StreamingLogisticRegressionWithSGD or StreamingLinearRegressionWithSGD, by extending StreamingLinearAlgorithm, but there is no such implementation built-in, ans since org.apache.spark.mllib is in a maintanance mode, there won't be.
It's not based on spark, but there is a C++ incremental decision tree.
see gaenari.
Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.
In one hand I want to use spark capability to compute TF-IDF for a collection of documents, on the other hand, the typical definition of TF-IDF (that Spark implementation is based on that) is not fit in my case. I want the TF to be term frequency among all documents, but in the typical TF-IDF, it's for each pair of (word, document). The IDF definition is the same as the typical definition.
I implemented my customized TF-IDF using Spark RDDs, but I was wondering if there any way to customize the source of the Spark TF-IDF so that I can use the capability of that, like Hashing.
Actually, I need something like :
public static class newHashingTF implements Something<String>
Thanks
It is pretty simple to implement different hashing strategies, as you can see by the simplicity of HashingTF:
(modern) Dataset version
(old) RDD version
This talk and its slides can help and there are many others online.
I'm trying to develop an program in Python that can process raw chat data and cluster sentences with similar intents so they can be used as training examples to build a new chatbot. The goal is to make it as quick and automatic (i.e. no parameters to enter manually) as possible.
1- For feature extraction, I tokenize each sentence, stem its words and vectorize it using Sklearn's TfidfVectorizer.
2- Then I perform clustering on those sentence vectors with Sklearn's DBSCAN. I chose this clustering algorithm because it doesn't require the user to specify the desired number of clusters (like the k parameter in k-means). It throws away a lot of sentences (considering them as outliers), but at least its clusters are homogeneous.
The overall algorithm works on relatively small datasets (10000 sentences) and generates meaningful clusters, but there are a few issues:
On large datasets (e.g. 800000 sentences), DBSCAN fails because it requires too much memory, even with parallel processing on a powerful machine in the cloud. I need a less computationally-expensive method, but I can't find another algorithm that doesn't make weird and heterogeneous sentence clusters. What other options are there? What algorithm can handle large amounts of high-dimensional data?
The clusters that are generated by DBSCAN are sentences that have similar wording (due to my feature extraction method), but the targeted words don't always represent intents. How can I improve my feature extraction so it better captures the intent of a sentence? I tried Doc2vec but it didn't seem to work well with small datasets made of documents that are the size of a sentence...
A standard implementation of DBSCAN is supposed to need only O(n) memory. You cannot get lower than this memory requirement. But I read somewhere that sklearn's DBSCAN actually uses O(n²) memory, so it is not the optimal implementation. You may need to implement this yourself then, to use less memory.
Don't expect these methods to be able to cluster "by intent". There is no way an unsupervised algorithm can infer what is intended. Most likely, the clusters will just be based on a few key words. But this could be whether people say "hi" or "hello". From an unsupervised point of view, this distinction gives two nice clusters (and some noise, maybe also another cluster "hola").
I suggest to train a supervised feature extraction based on a subset where you label the "intent".
I'm trying to implement a Convolutional Neural Network algorithm on Spark and I wanted to ask two questions before moving forward.
I need to implement my code such that, it is highly integrated with Spark and also follows the principles of machine learning algorithms in Spark. I found that Spark ML is an established ground for machine learning codes and it has a specific foundation, which all written algorithms are following. Also, the implemented algorithms are offloading their heavy mathematical operations to third party libraries such as BLAS, to do calculations fast.
Now I wanted to ask:
1) Is ML the right place to start? By following the ML structure, does my code going to be highly integrable with the rest of the spark ML ecosystem?
2) Am I right about the bottom of the ML codes, where they offload the processing into another mathematical library? Does it mean I can decide to change that layer to do the heavy processings in a customized fashion?
Would appreciate any suggestions.