I can't seem to find a way to visualize my RF model, obtained using Spark's MLLib RandomForestModel. The model, printed as a string, is just a bunch of nested IF statements.. it seems natural to want to visualize like is possible in R. I am using Spark Python API, and Java API.. open to use anything that will produce an R-like visualization of my RF model.
There is a library out there to help with this, EurekaTrees. Basically it just takes the debug string builds a tree and then displays it as a webpage using d3.js
from Databricks (Oct 2015):
"The plots listed above as Scala-only will soon be available in Python notebooks as well. There are also other machine learning model visualizations on the way. Stay tuned for Decision Tree and Machine Learning Pipeline visualizations!"
Related
I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!
Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.
For example for Vectors you can find conversion code:
SparseVector.asBreeze
DenseVector.asBreeze
Vector.fromBreeze
For Matrices please see asBreeze / fromBreeze in Matrices.scala
It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.
There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.
In this link - LINK, it is mentioned that a machine learning model which has been constructed offline can be used against streaming data for testing.
Excerpt from the Apache Spark Streaming MLlib link:
" You can also easily use machine learning algorithms provided by MLlib. First of all, there are streaming machine learning algorithms (e.g. Streaming Linear Regression, Streaming KMeans, etc.) which can simultaneously learn from the streaming data as well as apply the model on the streaming data. Beyond these, for a much larger class of machine learning algorithms, you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data. See the MLlib guide for more details.
"
Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program? Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?
In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?
Am I getting this right? Please let me know if my understanding for both the points mentioned above is correct.
Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program?
Yes, you can train a model like Random Forest in batch mode and store the model for predictions later. In case you want to integrate this with a streaming application where values are coming continuously for prediction you just need to load the model(which actually reads the feature vector and its weight) in memory and do prediction till the end.
Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?
Yes.
In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?
Training a model does nothing more than updating weight vector for features. You still have to choose alpha(learning rate) and lambda(regularisation parameter). So, when you will be using StreamingLinearRegression (or other streaming equivalents) you will have two dStreams one for training and other for prediction for obvious purposes.
I have a predictive model (Logistic Regression) built in Spark 1.6 that has been saved to disk for later reuse with new data records. I want to invoke it with multiple clients with each client passing in single data record. It seems that using a Spark job to run single records through would have way too much overhead and would not be very scalable (each invocation will only pass in a single set of 18 values). The MLlib API to load a saved model requires the Spark Context though so am looking for suggestions of how to do this in a scalable way. Spark Streaming with Kafka input comes to mind (each client request would be written to a Kafka topic). Any thoughts on this idea or alternative suggestions ?
Non-distributed (in practice it is majority) models from o.a.s.mllib don't require an active SparkContext for single item predictions. If you check API docs you'll see that LogisticRegressionModel provides predict method with signature Vector => Double. It means you can serialize model using standard Java tools, read it later and perform prediction on local o.a.s.mllib.Vector object.
Spark also provides a limited PMML support (not for logistic regression) so you share your models with any other library which supports this format.
Finally non-distributed models are usually not so complex. For linear models all you need is intercept, coefficients and some basic math functions and linear algebra library (if you want a decent performance).
o.a.s.ml models are slightly harder to handle but there are some external tools which try to address that. You can check related discussion on the developers list, (Deploying ML Pipeline Model) for details.
For distributed models there is really no good workaround. You'll have to start a full job on distributed dataset one way or another.
I am using Apache Spark Mllib 1.4.1 (PySpark, the python implementation of Spark) to generate a decision tree based on LabeledPoint data I have. The tree generates correctly and I can print it to the terminal (extract the rules as this user calls it How to extract rules from decision tree spark MLlib) using:
model = DecisionTree.trainClassifier( ... )
print(model.toDebugString()
But what I want to do is visualize or plot the decision tree rather than printing it to the terminal. Is there any way I can plot the decision tree in PySpark or maybe I can save the decision tree data and use R to plot it? Thanks!
There is this project Decision-Tree-Visualization-Spark for visualizing decision tree model
It has two steps
Parse Spark Decision Tree output to a JSON format.
Use the JSON file as an input to a D3.js visualization.
For the parser check Dt.py
The input to the function def tree_json(tree) is your models toDebugString()
Answer from question
We just released dtreeviz 1.1 version, with support for Decision Trees from Spark. You can visualize a lot of things, like the whole tree, just the prediction path, leaf information like number of samples or criterion.
You can check many visualizations in this notebook
Though this is a little old post, just to provide my answer so that others coming to this post from now on can be benefitted.
Alternatively you can use "graphviz" python Package for use in PySpark. It will print the decision tree model into a neat tree structure rather than usual if loop structure.
More details can be found in this link : https://pypi.python.org/pypi/graphviz
I would like to classify a bunch of documents using Apache Mahout and by using a naive bayes classifier. I do all the pre-processing and convert my training data set into feature vector and then train the classifier. Now I want to pass a bunch of new instances (to-be-classified instances) to my model in order to classify them.
However, I'm under the impression that the pre-processing must be done for my to-be-classified instances and the training data set together? If so, how come I can use the classifier in real world scenarios where I don't have the to-be-classified instances at the time I'm building my model?
How about Apache Spark? Howe thing work there? Can I make a classification model and the use it to classify unseen instances later?
As of Mahout 0.10.0, Mahout provides a Spark backed Naive Bayes implementation which can be run from the CLI, the Mahout shell or embedded into an application:
http://mahout.apache.org/users/algorithms/spark-naive-bayes.html
Regarding the classification of new documents outside of the training/testing sets, there is a tutorial here:
http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html
Which explains how to tokenize (using trival java native String methods), vectorize and classify unseen text using the dictionary and the df-count from the training/testing sets.
Please note that the tutorial is meant to be used from the Mahout-Samsara Environment's spark-shell, however the basic idea can be adapted and embedded into an application.