Get feature importance with PySpark and XGboost - apache-spark

I have trained a model using XGboost and PySpark
params = {
'eta': 0.1,
'gamma': 0.1,
'missing': 0.0,
'treeMethod': 'gpu_hist',
'maxDepth': 10,
'maxLeaves': 256,
'growPolicy': 'depthwise',
'objective': 'binary:logistic',
'minChildWeight': 30.0,
'lambda_': 1.0,
'scalePosWeight': 2.0,
'subsample': 1.0,
'nthread': 1,
'numRound': 100,
'numWorkers': 1,
}
classifier = XGBoostClassifier(**params).setLabelCol(label).setFeaturesCols(features)
model = classifier.fit(train_data)
When I try to get the feature importance using
model.nativeBooster.getFeatureScore()
It returns the following error:
Py4JError: An error occurred while calling o2167.getFeatureScore. Trace:
py4j.Py4JException: Method getFeatureScore([]) does not exist
Is there a correct way of getting feature importance when using XGboost with PySpark

I am a newbie in this field. I happened to encounter what you are experiencing. You may want to try using: model.nativeBooster.getScore("", "gain") or model.nativeBooster.getFeatureScore('').
My 'model' is of type "sparkxgb.xgboost.XGBoostClassificationModel".
Regards

Related

Get feature importance for the model trained using RandomizedSearchCV and Multinomial Naiye Bayes

After fitting the model, can not get the feature importance. I have done following steps:
model_bow = RandomizedSearchCV(MultinomialNB(class_prior=[0.5,0.5]),param_distributions={'alpha':alpha},scoring='roc_auc',n_iter=10,return_train_score=True,)
model_bow.fit(X_train_en_bow,y_train)
model_bow.best_params_ gives me best value is {'alpha': 0.5}
now if i want the feature importance using model_bow.estimator.feature_log_prob_, it gives an error as below.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-40-cc27acc11117> in <module>
----> 1 model_bow.estimator.feature_log_prob_()
AttributeError: 'MultinomialNB' object has no attribute 'feature_log_prob_'
when i print model_bow it shows
param_distributions={'alpha': [1e-05, 0.0001, 0.001, 0.01,
0.1, 0.5, 1, 5, 10, 50,
100]},
return_train_score=True, scoring='roc_auc')
Please advice where i am missing !
To get feature importance, get the best model trained using
model_bow.best_params_
then you can call the method coef_ (will get depreciated soon) or feature_log_prob_ to get feature importance.

I am using SVR() function for regression. I am unable to optimize it's parameter using #pso by #Pyswarm

Optimizing parameters of #SVR() using #pyswarm #PSO function.
I have 200 inputs of the dataset with 9 features of each input. I have to predict one output parameter. I already did it by calling using #SVR() function using it's default parameters. The results are not satisfactory. Now I want to optimize its parameters using the "PSO" algorithm but unable to do it.
model = SVR()model.fit(Xtrain,ytrain)
pred_y = model.predict(Xtest)
param = {'kernel' : ('linear', 'poly', 'rbf', 'sigmoid'),'C':[1,5,10],'degree' : [3,8],'coef0' : [0.01,10,0.5],'gamma' : ('auto','scale')}
import pyswarms as ps
optimizer = ps.single.GlobalBestPSO(n_particles=10, dimensions=2,options=param)
best_cost, best_pos = optimizer.optimize(model, iters=100)-
2019-08-13 12:19:48,551 - pyswarms.single.global_best - INFO - Optimize for 100 iters with {'kernel': ('linear', 'poly', 'rbf', 'sigmoid'), 'C': [1, 5, 10], 'degree': [3, 8], 'coef0': [0.01, 10, 0.5], 'gamma': ('auto', 'scale')}
pyswarms.single.global_best: 0%| |0/100
TypeError: 'SVR' object is not callable
There is an error in the first two lines. Two lines of code got mixed there by mistake.
1. model = SVR()
2. model.fit(Xtrain,ytrain)

Exception during xgboost prediction: can not initialize DMatrix from DMatrix

I trained a xgboost model in Python using the Scikit-Learn Python API, and serialized it using pickle library. I uploaded the model to ML Engine, but when I try to do online predictions, i get the following exception:
Prediction failed: Exception during xgboost prediction: can not initialize DMatrix from DMatrix
An example of the json I'm using for prediction is the following:
{
"instances":[
[
24.90625,
21.6435643564356,
20.3762376237624,
24.3679245283019,
30.2075471698113,
28.0947368421053,
16.7797359774725,
14.9262079299572,
17.9888028979966,
15.3333284503293,
19.6535308744024,
17.1501961307627,
0.0,
0.0,
0.0,
0.0,
0.0,
509.0,
497.0,
439.0,
427.0,
407.0,
1.0,
1.0,
1.0,
1.0,
1.0,
2.0,
23.0,
10.0,
58.0,
11.0,
20.0,
23.3617021276596,
23.3617021276596,
23.3617021276596,
23.3617021276596,
23.3617021276596,
23.9423076923077,
26.3082269243683,
23.6212606363851,
22.6752334301282,
27.4343583104833,
34.0090408101173,
11.1991944104063,
7.33420726455092,
8.15160392948917,
11.4119236389594,
17.9429092915607,
18.0573102225845,
32.8902876598084,
-0.00286123032904149,
-0.00286123032904149,
-0.00286123032904149,
-0.00286123032904149,
-0.00286123032904149,
-0.0028328611898017,
0.0534138904223018,
0.0534138904223018,
0.0534138904223018,
0.0534138904223018,
0.0534138904223018,
0.0531491870801522
]
]
}
I use the following code to train my model:
def _train_model(X, y):
clf = xgb.XGBClassifier(max_depth=6,
learning_rate=0.01,
n_estimators=100,
n_jobs=-1)
clf.fit(X, y)
return clf
Where X and y are both numpy.ndarray:
Type of X: <class 'numpy.ndarray'> Type of y: <class 'numpy.ndarray'>
Also I'm using xgboost 0.72.1, Python 3.5 and ML runtime 1.9.
Any one knows what can be the source of the problem?
Thanks!
Seems like the issue is due to the pickling. I was able to reproduce it and working on a fix, but meanwhile could you try exporting your classifier like below instead?
clf._Booster.save_model('./model.bst')
That should unblock you for now. If it didn't, feel free to reach out to cloudml-feedback#google.com.
I also faced similar problem or feature mismatch when I tried score the test data using the the trained XGBoost model that was dumped in .pkl format.
However after saving the model in .bst format, I was able to score the same training data without any issues. Looks like there is a difference in the two implementations of .pkl and .bst format when it comes to XGBoost.
Going a little further, and answering kuza's question above on loading the saved model:
save model:
clf._Booster.save_model('./model.bst')
loading the saved model:
model = xgboost.Booster({'nthread': 4}) # initialize before loading model
model.load_model('./model.bst') # load model
This cleared up 2 issues that I had with using pickle on the model. Issue 1 was a weird exeption: ValueError: feature_names mismatch:
Also check if you are using predict_proba on the loaded model, and getting a weird exception. The fix for that was just to use the straight predict function vice predict_proba.

Get survival function with Spark ML

I am training an Accelerated failure time model with PySpark (from pyspark.ml.regression import AFTSurvivalRegression)
Now I want to apply the model to new data and get the probability that the event will happen before time t (survival function), which method should I use ? The documentation is not clear to me : https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.regression.AFTSurvivalRegression
For example, if I do the following:
from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors
training = spark.createDataFrame([
(1.218, 1.0, Vectors.dense(1.560, -0.605)),
(2.949, 0.0, Vectors.dense(0.346, 2.158)),
(3.627, 0.0, Vectors.dense(1.380, 0.231)),
(0.273, 1.0, Vectors.dense(0.520, 1.151)),
(4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"])
quantileProbabilities = [0.25, 0.75]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
quantilesCol="quantiles")
model = aft.fit(training)
model.transform(training).show(truncate=False)
I get as an output :
Does it mean that for the first line, P(event happening between 0.832 and 9.48) = 50% ?
Thanks

How to get all parameters of estimator in PySpark

I have a RandomForestRegressor, GBTRegressor and I'd like to get all parameters of them. The only way I found it could be done with several get methods like:
from pyspark.ml.regression import RandomForestRegressor, GBTRegressor
est = RandomForestRegressor()
est.getMaxDepth()
est.getSeed()
But RandomForestRegressor and GBTRegressor have different parameters so it's not a good idea to hardcore all that methods.
A workaround could be something like this:
get_methods = [method for method in dir(est) if method.startswith('get')]
params_est = {}
for method in get_methods:
try:
key = method[3:]
params_est[key] = getattr(est, method)()
except TypeError:
pass
Then output will be like this:
params_est
{'CacheNodeIds': False,
'CheckpointInterval': 10,
'FeatureSubsetStrategy': 'auto',
'FeaturesCol': 'features',
'Impurity': 'variance',
'LabelCol': 'label',
'MaxBins': 32,
'MaxDepth': 5,
'MaxMemoryInMB': 256,
'MinInfoGain': 0.0,
'MinInstancesPerNode': 1,
'NumTrees': 20,
'PredictionCol': 'prediction',
'Seed': None,
'SubsamplingRate': 1.0}
But I think there should be a better way to do that.
extractParamMap can be used to get all params from every estimator, for example:
>>> est = RandomForestRegressor()
>>> {param[0].name: param[1] for param in est.extractParamMap().items()}
{'numTrees': 20, 'cacheNodeIds': False, 'impurity': 'variance', 'predictionCol': 'prediction', 'labelCol': 'label', 'featuresCol': 'features', 'minInstancesPerNode': 1, 'seed': -5851613654371098793, 'maxDepth': 5, 'featureSubsetStrategy': 'auto', 'minInfoGain': 0.0, 'checkpointInterval': 10, 'subsamplingRate': 1.0, 'maxMemoryInMB': 256, 'maxBins': 32}
>>> est = GBTRegressor()
>>> {param[0].name: param[1] for param in est.extractParamMap().items()}
{'cacheNodeIds': False, 'impurity': 'variance', 'predictionCol': 'prediction', 'labelCol': 'label', 'featuresCol': 'features', 'stepSize': 0.1, 'minInstancesPerNode': 1, 'seed': -6363326153609583521, 'maxDepth': 5, 'maxIter': 20, 'minInfoGain': 0.0, 'checkpointInterval': 10, 'subsamplingRate': 1.0, 'maxMemoryInMB': 256, 'lossType': 'squared', 'maxBins': 32}
As described in How to print best model params in pyspark pipeline , you can get any model parameter that is available in the original JVM object of any model using the following structure
<yourModel>.stages[<yourModelStage>]._java_obj.<getYourParameter>()
All get-parameters are available here
https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/RandomForestClassificationModel.html
For example, if you want to get MaxDepth of your RandomForest after cross-validation (getMaxDepth is not available in PySpark) you use
cvModel.bestModel.stages[-1]._java_obj.getMaxDepth()

Resources