I am working on a multiclass problem with six different classes and I am using OneVsRestClassifier.
I have then performed hyperparameter tuning with GridSearchCV and obtained the optimized classifier with clf.best_estimator_.
As far as I understand, this returns one set of the hyperparameters for the aggregated model/every base estimator.
Is there a way to perform hyperparameter tuning separately for each base estimator?
Sure, just reverse the order of the search and the multiclass wrapper:
one_class_clf = GridSearchCV(base_classifier, params, ...)
clf = OneVsRestClassifier(one_class_clf)
Fitting clf generates the one-vs-rest problems, and for each of those fits a copy of the grid-searched base_classifier.
Is it possible to split a dataset on a categorical feature, and train a different model for each value of the splitting feature using make_pipeline (or Pipeline)?
I would like to use the model I built using sklearn pipeline in pyspark. The pipeline takes care of imputation, scaling and one-hot encoding and Random Forest Classification.I tried broadcasting the model and using pandas udf to predict.it did not work, got py4jjavaerror.
Is it possible to train an XGboost model in python and use the saved model to predict in spark environment ? That is, I want to be able to train the XGboost model using sklearn, save the model. Load the saved model in spark and predict in spark. Is this possible ?
edit:
Thanks all for the answer , but my question is really this. I see the below issues when I train and predict different bindings of XGBoost.
During training I would be using XGBoost in python, and when predicting I would be using XGBoost in mllib.
I have to load the saved model from XGBoost python (Eg: XGBoost.model file) to be predicted in spark, would this model be compatible to be used with the predict function in the mllib
The data input formats of both XGBoost in python and XGBoost in spark mllib are different. Spark takes vector assembled format but with python, we can feed the dataframe as such. So, how do I feed the data when I am trying to predict in spark with a model trained in python. Can I feed the data without vector assembler ? Would XGboost predict function in spark mllib take non-vector assembled data as input ?
You can run your python script on spark using spark-submit command so that can compile your python code on spark and then you can predict the value in spark.
you can
load data/ munge data using pyspark sql,
then bring data to local driver using collect/topandas(performance bottleneck)
then train xgboost on local driver
then prepare test data as RDD,
broadcast the xgboost model to each RDD partition, then predict data in parallel
This all can be in one script, you spark-submit, but to make the things more concise, i will recommend split train/test in two script.
Because step2,3 are happening at driver level, not using any cluster resource, your worker are not doing anything
Here is a similar implementation of what you are looking for. I have a SO post explaining details as I am trying to troubleshoot the errors described in the post to get the code in the notebook working .
XGBoost Spark One Model Per Worker Integration
The idea is to train using xgboost and then via spark orchestrate each model to run on a spark worker and then predictions can be applied via xgboost predict_proba() or spark ml predict().
In sklearn there is the make_scorer function that allows you to define custom scoring functions to be inserted into GridsearchCV. How can I do the same thing in Spark ML as well? For example, in the MultiClassClassificationEvaluator in spark, there is no provision to pass in a custom scoring function.
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.html
Ideally, I would like the below interface. Has anyone done this before?
evaluator = MulticlassClassificationEvaluator(metricName=CustomScoringFunction)