PySpark Vs. Gradient-Boosted Trees (with predict_proba) Integration - apache-spark

I want to build a GBT based binary-classification model in pyspark that allows producing prediction probabilities (mandatory), preferably the state of the art of GBT variant like XGBoost.
All I can find are unmaintained, non-official, and unstable packages which I find really hard to install and operate.
Can you please help me find a solution?

Related

PyMC3/Edward/Pyro on Spark?

Has anyone tried using a python probabilistic programming library with Spark? Or does anyone have a good idea of what it would take?
I have a feeling Edward would be simplest because there are already tools connecting Tensorflow and Spark, but still hazy about what low-level code changes would be required.
I know distributed MCMC is still an area of active research (see MC-Stan on Spark?), so is this even reasonable to implement? Thanks!
You can use Tensorflow connectors with Edward since it is based on Tensorflow, one of the main drawbacks of MCMC is very computational intensive, you may try Variational inference for your Bayesian models it approximates the target distribution. (this also applies to Pyro and PyMC3 I believe), you can also work with Tensorflow distributed tensorflow distributed
I also recommend you to use/try a library called "Dask
"https://dask.pydata.org/en/latest/Dask, you can scale your model from your workstation to a cluster it also has Tensorflow connectors.
Hope this helps
I've seen people run Pyro+PyTorch in PySpark, but the use case was CPU-only and did not involve distributed training.

PySpark with scikit-learn

I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.

Finding variables that contributes the most for a decision tree prediction in H2o

How to find the variables that contributes the most for a particular prediction in case of a decision tree? For eg. If there are features A,B,C,D,E and we build a decision tree on top of the dataset. Then for a sample x, lets say variables C,D contributes the most to the prediction(x). How to find these variables that contributed the most for prediction(x) in H2O? I know the H2O gives global importance of variables once the decision tree is built. My question applies in the case when we are using that particular tree to make a decision and in finding the variables that contributed to that particular decision. Scikit learn has functions to extract the rules that were used to predict a sample. Does H2O have any such functionality?
H2O has no support for doing this currently (as of Feb 2017, h2o 3.10.3.x); Erin opened a JIRA for it: https://0xdata.atlassian.net/browse/PUBDEV-4007

Can I extract significane values for Logistic Regression coefficients in pyspark

Is there a way to get the significance level of each coefficient we receive after we fit a logistic regression model on training data?
I was trying to find out a way and could not figure out myself.
I think I may get the significance level of each feature if I run chi sq test but first of all not sure if I can run the test on all features together and secondly I have numeric data value so if it will give me right result or not that remains a question as well.
Right now I am running the modeling part using statsmodel and scikit learn but certainly, want to know, how can I get these results from PySpark ML or MLLib itself
If anyone can shed some light, it will be helpful
I use only mllib, I think that when you train a model you can use toPMML method to export your model un PMML format (xml file), then you can parse the xml file to get features weights, here an example
https://spark.apache.org/docs/2.0.2/mllib-pmml-model-export.html
Hope that will help

is there a way to visualize Spark mllib Random Forest Model?

I can't seem to find a way to visualize my RF model, obtained using Spark's MLLib RandomForestModel. The model, printed as a string, is just a bunch of nested IF statements.. it seems natural to want to visualize like is possible in R. I am using Spark Python API, and Java API.. open to use anything that will produce an R-like visualization of my RF model.
There is a library out there to help with this, EurekaTrees. Basically it just takes the debug string builds a tree and then displays it as a webpage using d3.js
from Databricks (Oct 2015):
"The plots listed above as Scala-only will soon be available in Python notebooks as well. There are also other machine learning model visualizations on the way. Stay tuned for Decision Tree and Machine Learning Pipeline visualizations!"

Resources