How to get all parameters of estimator in PySpark - apache-spark

I have a RandomForestRegressor, GBTRegressor and I'd like to get all parameters of them. The only way I found it could be done with several get methods like:
from pyspark.ml.regression import RandomForestRegressor, GBTRegressor
est = RandomForestRegressor()
est.getMaxDepth()
est.getSeed()
But RandomForestRegressor and GBTRegressor have different parameters so it's not a good idea to hardcore all that methods.
A workaround could be something like this:
get_methods = [method for method in dir(est) if method.startswith('get')]
params_est = {}
for method in get_methods:
try:
key = method[3:]
params_est[key] = getattr(est, method)()
except TypeError:
pass
Then output will be like this:
params_est
{'CacheNodeIds': False,
'CheckpointInterval': 10,
'FeatureSubsetStrategy': 'auto',
'FeaturesCol': 'features',
'Impurity': 'variance',
'LabelCol': 'label',
'MaxBins': 32,
'MaxDepth': 5,
'MaxMemoryInMB': 256,
'MinInfoGain': 0.0,
'MinInstancesPerNode': 1,
'NumTrees': 20,
'PredictionCol': 'prediction',
'Seed': None,
'SubsamplingRate': 1.0}
But I think there should be a better way to do that.

extractParamMap can be used to get all params from every estimator, for example:
>>> est = RandomForestRegressor()
>>> {param[0].name: param[1] for param in est.extractParamMap().items()}
{'numTrees': 20, 'cacheNodeIds': False, 'impurity': 'variance', 'predictionCol': 'prediction', 'labelCol': 'label', 'featuresCol': 'features', 'minInstancesPerNode': 1, 'seed': -5851613654371098793, 'maxDepth': 5, 'featureSubsetStrategy': 'auto', 'minInfoGain': 0.0, 'checkpointInterval': 10, 'subsamplingRate': 1.0, 'maxMemoryInMB': 256, 'maxBins': 32}
>>> est = GBTRegressor()
>>> {param[0].name: param[1] for param in est.extractParamMap().items()}
{'cacheNodeIds': False, 'impurity': 'variance', 'predictionCol': 'prediction', 'labelCol': 'label', 'featuresCol': 'features', 'stepSize': 0.1, 'minInstancesPerNode': 1, 'seed': -6363326153609583521, 'maxDepth': 5, 'maxIter': 20, 'minInfoGain': 0.0, 'checkpointInterval': 10, 'subsamplingRate': 1.0, 'maxMemoryInMB': 256, 'lossType': 'squared', 'maxBins': 32}

As described in How to print best model params in pyspark pipeline , you can get any model parameter that is available in the original JVM object of any model using the following structure
<yourModel>.stages[<yourModelStage>]._java_obj.<getYourParameter>()
All get-parameters are available here
https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/RandomForestClassificationModel.html
For example, if you want to get MaxDepth of your RandomForest after cross-validation (getMaxDepth is not available in PySpark) you use
cvModel.bestModel.stages[-1]._java_obj.getMaxDepth()

Related

tune hyperparameters of XGBRanker

I try to optimize my hyperparameters of my XGBoost Ranker model, but I can't
Here is what my table (df on code) looks like :
query
relevance
features
1
5
5.4.7....
1
3
6........
2
5
3........
2
3
8........
3
2
1........
Then I split my table on train test with on the test table only one query:
gss = GroupShuffleSplit(test_size=1, n_splits=1,).split(df, groups=df['query'])
X_train_inds, X_test_inds = next(gss)
train_data= df.iloc[X_train_inds]
X_train=train_data.drop(columns=["relevance"])
Y_train=train_data.relevance
test_data= df.iloc[X_test_inds]
X_test=test_data.drop(columns=["relevance"])
Y_test=test_data.relevance
and constitute groups which is the number of lines by query:
groups = train_data.groupby('query').size().to_frame('size')['size'].to_numpy()
And then I run my model and try to optimize the hyperparameters with a RandomizedSearchCV:
param_dist = {'n_estimators': randint(40, 1000),
'learning_rate': uniform(0.01, 0.59),
'subsample': uniform(0.3, 0.6),
'max_depth': [3, 4, 5, 6, 7, 8, 9],
'colsample_bytree': uniform(0.5, 0.4),
'min_child_weight': [0.05, 0.1, 0.02]
}
scoring = sklearn.metrics.make_scorer(sklearn.metrics.ndcg_score, k=10,
greater_is_better=True)
model = xgb.XGBRanker(
tree_method='hist',
booster='gbtree',
objective='rank:ndcg',)
clf = RandomizedSearchCV(model,
param_distributions=param_dist,
cv=5,
n_iter=5,
scoring=scoring,
error_score=0,
verbose=3,
n_jobs=-1)
clf.fit(X_train,Y_train, group=groups)
Then I have the following error message which it seems be related to my construction of groups but I don't see why (Knowing that without the randomsearch the model works) :
Check failed: group_ptr_.back() == num_row_ (11544 vs. 9235) : Invalid group structure. Number of rows obtained from groups doesn't equal to actual number of rows given by data.
Same problem as here:(Tuning XGBRanker produces error for groups)

Get feature importance with PySpark and XGboost

I have trained a model using XGboost and PySpark
params = {
'eta': 0.1,
'gamma': 0.1,
'missing': 0.0,
'treeMethod': 'gpu_hist',
'maxDepth': 10,
'maxLeaves': 256,
'growPolicy': 'depthwise',
'objective': 'binary:logistic',
'minChildWeight': 30.0,
'lambda_': 1.0,
'scalePosWeight': 2.0,
'subsample': 1.0,
'nthread': 1,
'numRound': 100,
'numWorkers': 1,
}
classifier = XGBoostClassifier(**params).setLabelCol(label).setFeaturesCols(features)
model = classifier.fit(train_data)
When I try to get the feature importance using
model.nativeBooster.getFeatureScore()
It returns the following error:
Py4JError: An error occurred while calling o2167.getFeatureScore. Trace:
py4j.Py4JException: Method getFeatureScore([]) does not exist
Is there a correct way of getting feature importance when using XGboost with PySpark
I am a newbie in this field. I happened to encounter what you are experiencing. You may want to try using: model.nativeBooster.getScore("", "gain") or model.nativeBooster.getFeatureScore('').
My 'model' is of type "sparkxgb.xgboost.XGBoostClassificationModel".
Regards

Hyperopt tuning parameters get stuck

I'm testing to tune parameters of SVM with hyperopt library.
Often, when i execute this code, the progress bar stop and the code get stuck.
I do not understand why.
Here is my code :
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
X_train = normalize(X_train)
def hyperopt_train_test(params):
if 'decision_function_shape' in params:
if params['decision_function_shape'] == "ovo":
params['break_ties'] = False
clf = svm.SVC(**params)
y_pred = clf.fit(X_train, y_train).predict(X_test)
return precision_recall_fscore_support(y_test, y_pred, average='macro')[0]
space4svm = {
'C': hp.uniform('C', 0, 20),
'kernel': hp.choice('kernel', ['linear', 'sigmoid', 'poly', 'rbf']),
'degree': hp.uniform('degree', 10, 30),
'gamma': hp.uniform('gamma', 10, 30),
'coef0': hp.uniform('coef0', 15, 30),
'shrinking': hp.choice('shrinking', [True, False]),
'probability': hp.choice('probability', [True, False]),
'tol': hp.uniform('tol', 0, 3),
'decision_function_shape': hp.choice('decision_function_shape', ['ovo', 'ovr']),
'break_ties': hp.choice('break_ties', [True, False])
}
def f(params):
print(params)
precision = hyperopt_train_test(params)
return {'loss': -precision, 'status': STATUS_OK}
trials = Trials()
best = fmin(f, space4svm, algo=tpe.suggest, max_evals=35, trials=trials)
print('best:')
print(best)
I would suggest restricting the space of your parameters and see if that works. Fix the probability parameter to False and see if the model trains. Also, gamma needs to be {‘scale’, ‘auto’} according to the documentation.
Also at every iteration print out your params to better understand which combination is causing the model to get stuck.

I am using SVR() function for regression. I am unable to optimize it's parameter using #pso by #Pyswarm

Optimizing parameters of #SVR() using #pyswarm #PSO function.
I have 200 inputs of the dataset with 9 features of each input. I have to predict one output parameter. I already did it by calling using #SVR() function using it's default parameters. The results are not satisfactory. Now I want to optimize its parameters using the "PSO" algorithm but unable to do it.
model = SVR()model.fit(Xtrain,ytrain)
pred_y = model.predict(Xtest)
param = {'kernel' : ('linear', 'poly', 'rbf', 'sigmoid'),'C':[1,5,10],'degree' : [3,8],'coef0' : [0.01,10,0.5],'gamma' : ('auto','scale')}
import pyswarms as ps
optimizer = ps.single.GlobalBestPSO(n_particles=10, dimensions=2,options=param)
best_cost, best_pos = optimizer.optimize(model, iters=100)-
2019-08-13 12:19:48,551 - pyswarms.single.global_best - INFO - Optimize for 100 iters with {'kernel': ('linear', 'poly', 'rbf', 'sigmoid'), 'C': [1, 5, 10], 'degree': [3, 8], 'coef0': [0.01, 10, 0.5], 'gamma': ('auto', 'scale')}
pyswarms.single.global_best: 0%| |0/100
TypeError: 'SVR' object is not callable
There is an error in the first two lines. Two lines of code got mixed there by mistake.
1. model = SVR()
2. model.fit(Xtrain,ytrain)

Training a Random Forest on Tensorflow

I am trying to train a tensorflow based random forest regression on numerical and continuos data.
When I try to fit my estimator it begins with the message below:
INFO:tensorflow:Constructing forest with params =
INFO:tensorflow:{'num_trees': 10, 'max_nodes': 1000, 'bagging_fraction': 1.0, 'feature_bagging_fraction': 1.0, 'num_splits_to_consider': 10, 'max_fertile_nodes': 0, 'split_after_samples': 250, 'valid_leaf_threshold': 1, 'dominate_method': 'bootstrap', 'dominate_fraction': 0.99, 'model_name': 'all_dense', 'split_finish_name': 'basic', 'split_pruning_name': 'none', 'collate_examples': False, 'checkpoint_stats': False, 'use_running_stats_method': False, 'initialize_average_splits': False, 'inference_tree_paths': False, 'param_file': None, 'split_name': 'less_or_equal', 'early_finish_check_every_samples': 0, 'prune_every_samples': 0, 'feature_columns': [_NumericColumn(key='Average_Score', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='lat', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='lng', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)], 'num_classes': 1, 'num_features': 2, 'regression': True, 'bagged_num_features': 2, 'bagged_features': None, 'num_outputs': 1, 'num_output_columns': 2, 'base_random_seed': 0, 'leaf_model_type': 2, 'stats_model_type': 2, 'finish_type': 0, 'pruning_type': 0, 'split_type': 0}
Then the process breaks down and I get a value error below:
ValueError: Shape must be at least rank 2 but is rank 1 for 'concat' (op: 'ConcatV2') with input shapes: [?], [?], [?], [] and with computed input tensors: input[3] = <1>.
This is the code I am using:
import tensorflow as tf
from tensorflow.contrib.tensor_forest.python import tensor_forest
from tensorflow.python.ops import resources
import pandas as pd
from tensorflow.contrib.tensor_forest.client import random_forest
from tensorflow.python.estimator.inputs import numpy_io
import numpy as np
def getFeatures():
Average_Score = tf.feature_column.numeric_column('Average_Score')
lat = tf.feature_column.numeric_column('lat')
lng = tf.feature_column.numeric_column('lng')
return [Average_Score,lat ,lng]
# Import hotel data
Hotel_Reviews=pd.read_csv("./DataMining/Hotel_Reviews.csv")
Hotel_Reviews_Filtered=Hotel_Reviews[(Hotel_Reviews.lat.notnull() |
Hotel_Reviews.lng.notnull())]
Hotel_Reviews_Filtered_Target = Hotel_Reviews_Filtered[["Reviewer_Score"]]
Hotel_Reviews_Filtered_Features = Hotel_Reviews_Filtered[["Average_Score","lat","lng"]]
#Preprocess the data
x=Hotel_Reviews_Filtered_Features.to_dict('list')
for key in x:
x[key] = np.array(x[key])
y=Hotel_Reviews_Filtered_Target.values
#specify params
params = tf.contrib.tensor_forest.python.tensor_forest.ForestHParams(
feature_colums= getFeatures(),
num_classes=1,
num_features=2,
regression=True,
num_trees=10,
max_nodes=1000)
#build the graph
graph_builder_class = tensor_forest.RandomForestGraphs
est=random_forest.TensorForestEstimator(
params, graph_builder_class=graph_builder_class)
#define input function
train_input_fn = numpy_io.numpy_input_fn(
x=x,
y=y,
batch_size=1000,
num_epochs=1,
shuffle=True)
est.fit(input_fn=train_input_fn, steps=500)
The variables x is a list of numpy array of shape (512470,):
{'Average_Score': array([ 7.7, 7.7, 7.7, ..., 8.1, 8.1, 8.1]),
'lat': array([ 52.3605759, 52.3605759, 52.3605759, ..., 48.2037451,
48.2037451, 48.2037451]),
'lng': array([ 4.9159683, 4.9159683, 4.9159683, ..., 16.3356767,
16.3356767, 16.3356767])}
The variable y is numpy array of shape (512470,1):
array([[ 2.9],
[ 7.5],
[ 7.1],
...,
[ 2.5],
[ 8.8],
[ 8.3]])
Force each array in x to be 2 dim using ndmin=2. Then the shapes should match and concat should be able to operate.

Resources