Xgboost on Spark Validation Indicator Column and Evaluation Metric - apache-spark

I am using the xgboost PySpark API. This API is experimental but it supports most of the features of the xgboost API.
As per the documentation below, eval_set parameter is not supported and instead, validationIndicatorCol parameter should be used.
https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.spark
https://databricks.github.io/spark-deep-learning/#module-sparkdl.xgboost
xgb = XgboostClassifier(featuresCol = "features",
labelCol="label",
num_workers = 40,
random_state = 1,
missing = None,
objective = 'binary:logistic',
validationIndicatorCol = 'isVal',
eval_metric = 'aucpr' ,
n_estimators = best_n_estimators,
max_depth = best_max_depth,
learning_rate = best_learning_rate
)
pipeline = Pipeline(stages=[vectorAssembler,xgb])
pipelineModel = pipeline.fit(sampled_df)
It seems to be running without any errors which is great.
How do you print and look at the evaluation results? Traditional xgboost has evals_result() method which pipelineModel.stages[-1].evals_result() doesn't seem to work in the PySpark API. This method should normally work since the PySpark API documentation doesn't say otherwise. Any idea on how to make it work?

Related

pyspark: Stage failure due to One hot encoding

I am facing the below error while fitting my model. I am trying to run a model with cross validation with a pipeline inside of it.
Below is the code snippet for data transformation:
qd = QuantileDiscretizer(relativeError=0.01, handleInvalid="error", numBuckets=4,
inputCols=["time"], outputCols=["time_qd"])
#Normalize Vector
scaler = StandardScaler()\
.setInputCol ("vectorized_features")\
.setOutputCol ("features")
#Encoder for VesselTypeGroupName
encoder = StringIndexer(handleInvalid='skip')\
.setInputCols (["type"])\
.setOutputCols (["type_enc"])
#OneHot encoding categorical variables
encoder1 = OneHotEncoder()\
.setInputCols (["type_enc", "ID1", "ID12", "time_qd"])\
.setOutputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"])
#Assembling Variables
assembler = VectorAssembler(handleInvalid="keep")\
.setInputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"]) \
.setOutputCol ("vectorized_features")
The total number of features after one hot encoding will not exceed 200. The model code is below:
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label',
weightCol='classWeightCol')
pipeline_stages = Pipeline(stages=[qd , encoder, encoder1 , assembler , scaler, lr])
#Create Logistic Regression parameter grids for parameter tuning
paramGrid_lr = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 0.5, 2.0])# regularization parameter
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])# Elastic Net Parameter (Ridge = 0)
.addGrid(lr.maxIter, [1, 10, 20])# Number of iterations
.build())
cv_lr = CrossValidator(estimator=pipeline_stages, estimatorParamMaps=paramGrid_lr,
evaluator=BinaryClassificationEvaluator(), numFolds=5, seed=42)
cv_lr_model = cv_lr.fit(train_df)
.fit method throws the below error:
I have tried increasing the driver memory but still facing the same error. Please suggest what is the cause of this issue.

Saving a model that uses tensorflow.lookup.StaticVocabularyTable in .pb format in Tensorflow 2

I am building a model that accepts as input a 2d array of string tokens, then uses a lookup table to get the assigned indices of the input tokens in the vocabulary. The model then uses those indices to compute an embedded representation of the input tokens by fetching associated token embeddings and adding them together. The compounded embedding is then compared agains another matrix using a nearest-neighbors lookup and then the indices of the top-k most similar entries are returned.
The model is saved in .pb format and is then used in a container running the TensorFlow Serving image for inference.
At the moment I have something that works just fine in TensorFlow 1.15, however I am trying to migrate my code to TensorFlow 2.4 and can't find a way to make it work.
Here is a slightly modified version of the code I am working with at the moment in TensorFlow 1.15
import tensorflow as tf
graph = tf.get_default_graph()
session = tf.Session()
tf.global_variables_initializer()
vocabulary = ['one', 'two', 'three', 'four', 'five', 'six']
embedding_dimension = 512
n_tokens = len(vocabulary)
token_embeddings = np.random.random((n_tokens, embedding_dimension))
matrix = np.random.random((100, embedding_dimension))
lookup_table_initializer = tf.lookup.KeyValueTensorInitializer(vocabulary, np.arange(n_tokens))
lookup_table = tf.lookup.StaticVocabularyTable(lookup_table_initializer, num_oov_buckets=1)
token_embeddings_with_oov_token = np.vstack([token_embeddings, np.zeros(embedding_dimension)])
token_embeddings_tensor = tf.convert_to_tensor(token_embeddings_with_oov_token, dtype=tf.float32)
matrix = tf.convert_to_tensor(matrix, dtype=tf.float32)
model_input = tf.placeholder(tf.string, [None, None], name="input")
input_tokens_indices = lookup_table.lookup(model_input)
input_token_indices_one_hot = tf.one_hot(input_tokens_indices, tf.dtypes.cast(value, dtype=np.int32)(lookup_table.size()))
encoded_text = tf.math.reduce_sum(input_token_indices_one_hot, axis=1, keepdims=True)
embedded_text = tf.linalg.matmul(encoded_text, token_embeddings_tensor)
embedded_text_pooled = tf.math.reduce_sum(embedded_text, axis=1)
embedded_text_normed = tf.divide(embedded_text_pooled, tf.norm(embedded_text_pooled, ord=2))
neighbors = tf.linalg.matmul(embedded_text_normed, product_embeddings_tensor, transpose_b=True, name="output")
tf.saved_model.simple_save(
session,
"model.pb",
inputs={"input": model_input},
outputs={"output": neighbors},
legacy_init_op=tf.tables_initializer(),
)
The issue that I am facing is when converting the above code to TensorFlow 2. First of all, the tf.placeholder is no more and I have read on other posts suggestions to replace that with tf.keras.layers.Input((), dtype=tf.dtypes.string), however then I get an error when I try to carry out the lookup_table.lookup() step, as apparently I cannot pass a symbolic tensor to that function. As a result I am stuck and do not know which way to proceed to make my model compatible with tf2 and after hours searching online for solutions I can't seem to find something that works.

SparkML LogisticRegression vs Sklearn's: Different coefficients and intercepts

I may be missing some initialization parameters or something like that.
I created an LR in pyspark and then one using Scikit. I trained then using the same data. When I got the resulting models I compared their parameters. The Coefficients were very different. The intercepts in pyspark turned out to be a single number and that is still very different from that of sklearn.
The predictions are the same in terms of labels however the probability values are different
pyspark==2.3.2
scikit-learn==0.20.2
pyspark LR creation:
data = self.spark.read.format("libsvm").load(input_path)
lr = LogisticRegression(maxIter=10, tol=0.0001)
model = lr.fit(data)
predicted = model.transform(data)
sklearn creation:
sk_model = linear_model.LogisticRegression()
sk_model.fit(np_x, np_y)
sk_expected = [sk_model.predict(np_x), sk_model.predict_proba(np_x)]
Data conversion from spark to numpy:
self.spark.udf.register("sparseToArray", lambda x: x.toArray().tolist(),
ArrayType(elementType=FloatType(), containsNull=False))
np_y = data.select("label").toPandas().label.values.astype(numpy.float32)
np_x = data.selectExpr("sparseToArray(features) as features").toPandas().features.apply(pandas.Series).values.astype(numpy.float32)

Keras 1.0: getting intermediate layer output

I am currently trying to visualize the output of an intermediate layer in Keras 1.0 (which I could do with Keras 0.3) but it does not work anymore.
x = model.input
y = model.layers[3].output
f = theano.function([x], y)
But I get the following error:
MissingInputError: ("An input of the graph, used to compute DimShuffle{x,x,x,x}(keras_learning_phase), was not provided and not given a value.Use the Theano flag exception_verbosity='high',for more information on this error.", keras_learning_phase)
Prior to Keras 1.0, with my graph model, I could just do:
x = graph.inputs['input'].input
y = graph.nodes[layer].get_output(train=False)
f = theano.function([x], y, allow_input_downcast=True)
So I suspect it to come from the "train=False" parameter which I don't know how to set in the new version.
Thank you for your help
Try:
In the import statements first give
from keras import backend as K
from theano import function
then
f = K.function([model.layers[0].input, K.learning_phase()],
[model.layers[3].output])
# output in test mode = 0
layer_output = get_3rd_layer_output([X_test, 0])[0]
# output in train mode = 1
layer_output = get_3rd_layer_output([X_train, 1])[0]
This was just answered by François Chollet on github:
Your model apparently has a different behavior in training and test mode, and so needs to know what mode it should be using.
Use
iterate = K.function([input_img, K.learning_phase()], [loss, grads])
and pass 1 or 0 as value for the learning phase, based on whether you want the model in training mode or test mode.
https://github.com/fchollet/keras/issues/2417

python feature selection in pipeline: how determine feature names?

i used pipeline and grid_search to select the best parameters and then used these parameters to fit the best pipeline ('best_pipe'). However since the feature_selection (SelectKBest) is in the pipeline there has been no fit applied to SelectKBest.
I need to know the feature names of the 'k' selected features. Any ideas how to retrieve them? Thank you in advance
from sklearn import (cross_validation, feature_selection, pipeline,
preprocessing, linear_model, grid_search)
folds = 5
split = cross_validation.StratifiedKFold(target, n_folds=folds, shuffle = False, random_state = 0)
scores = []
for k, (train, test) in enumerate(split):
X_train, X_test, y_train, y_test = X.ix[train], X.ix[test], y.ix[train], y.ix[test]
top_feat = feature_selection.SelectKBest()
pipe = pipeline.Pipeline([('scaler', preprocessing.StandardScaler()),
('feat', top_feat),
('clf', linear_model.LogisticRegression())])
K = [40, 60, 80, 100]
C = [1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]
penalty = ['l1', 'l2']
param_grid = [{'feat__k': K,
'clf__C': C,
'clf__penalty': penalty}]
scoring = 'precision'
gs = grid_search.GridSearchCV(estimator=pipe, param_grid = param_grid, scoring = scoring)
gs.fit(X_train, y_train)
best_score = gs.best_score_
scores.append(best_score)
print "Fold: {} {} {:.4f}".format(k+1, scoring, best_score)
print gs.best_params_
best_pipe = pipeline.Pipeline([('scale', preprocessing.StandardScaler()),
('feat', feature_selection.SelectKBest(k=80)),
('clf', linear_model.LogisticRegression(C=.0001, penalty='l2'))])
best_pipe.fit(X_train, y_train)
best_pipe.predict(X_test)
You can access the feature selector by name in best_pipe:
features = best_pipe.named_steps['feat']
Then you can call transform() on an index array to get the names of the selected columns:
X.columns[features.transform(np.arange(len(X.columns)))]
The output here will be the eighty column names selected in the pipeline.
Jake's answer totally works. But depending on what feature selector your using, there's another option that I think is cleaner. This one worked for me:
X.columns[features.get_support()]
It gave me an identical answer to Jake's answer. And you can see more about it in the docs, but get_support returns an array of true/false values for whether or not the column was used. Also, it's worth noting that X must be of identical shape to the training data used on the feature selector.
This could be an instructive alternative: I encountered a similar need as what was asked by OP. If one wants to get the k best features' indices directly from GridSearchCV:
finalFeatureIndices = gs.best_estimator_.named_steps["feat"].get_support(indices=True)
And via index manipulation, can get your finalFeatureList:
finalFeatureList = [initialFeatureList[i] for i in finalFeatureIndices]

Resources