'CrossValidatorModel' object has no attribute 'featureImportances'

'CrossValidatorModel' object has no attribute 'featureImportances' - apache-spark

I'm trying to extract the feature importance's of a random forest classifier model I have trained using Pyspark. I referred to the following article to get the feature importance scores for the random forest model I trained.
PySpark & MLLib: Random Forest Feature Importances
However, as I use the method describe in this article I get the following error
'CrossValidatorModel' object has no attribute 'featureImportances'
Here is the code I used to train my model
cols = new_data.columns
stages = []
label_stringIdx = StringIndexer(inputCol = 'Bought_Fibre', outputCol = 'label')
stages += [label_stringIdx]
numericCols = new_data.schema.names[1:-1]
assembler = VectorAssembler(inputCols=numericCols, outputCol="features")
stages += [assembler]
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(new_data)
new_data.fillna(0, subset=cols)
new_data = pipelineModel.transform(new_data)
new_data.fillna(0, subset=cols)
new_data.printSchema()
train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 1045)
train_initial.groupby('label').count().toPandas()
test.groupby('label').count().toPandas()
train_sampled = train_initial.sampleBy("label", fractions={0: 0.1, 1: 1.0}, seed=0)
train_sampled.groupBy("label").count().orderBy("label").show()
labelIndexer = StringIndexer(inputCol='label',
outputCol='indexedLabel').fit(train_sampled)
featureIndexer = VectorIndexer(inputCol='features',
outputCol='indexedFeatures',
maxCategories=2).fit(train_sampled)
from pyspark.ml.classification import RandomForestClassifier
rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])
paramGrid = ParamGridBuilder() \
.addGrid(rf_model.numTrees, [ 200, 400,600,800,1000]) \
.addGrid(rf_model.impurity,['entropy','gini']) \
.addGrid(rf_model.maxDepth,[2,3,4,5]) \
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=5)
train_model = crossval.fit(train_sampled)
Please help to resolve the above mentioned error and help to extract the features

That's because the CrossValidatorModel doesn't have a feature importance attribute, but the RandomForestModel model has.
Since you are using a Pipeline and CrossValidator to fit your data, you'll need to get the underlying stage of the best fitted model :
# '2' is the index of your RandomForestModel inside of the Pipeline
your_model = cvModel.bestModel.stages[2]
var_imp = your_model.featureImportances

Related

pyspark: Stage failure due to One hot encoding

I am facing the below error while fitting my model. I am trying to run a model with cross validation with a pipeline inside of it.
Below is the code snippet for data transformation:
qd = QuantileDiscretizer(relativeError=0.01, handleInvalid="error", numBuckets=4,
inputCols=["time"], outputCols=["time_qd"])
#Normalize Vector
scaler = StandardScaler()\
.setInputCol ("vectorized_features")\
.setOutputCol ("features")
#Encoder for VesselTypeGroupName
encoder = StringIndexer(handleInvalid='skip')\
.setInputCols (["type"])\
.setOutputCols (["type_enc"])
#OneHot encoding categorical variables
encoder1 = OneHotEncoder()\
.setInputCols (["type_enc", "ID1", "ID12", "time_qd"])\
.setOutputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"])
#Assembling Variables
assembler = VectorAssembler(handleInvalid="keep")\
.setInputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"]) \
.setOutputCol ("vectorized_features")
The total number of features after one hot encoding will not exceed 200. The model code is below:
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label',
weightCol='classWeightCol')
pipeline_stages = Pipeline(stages=[qd , encoder, encoder1 , assembler , scaler, lr])
#Create Logistic Regression parameter grids for parameter tuning
paramGrid_lr = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 0.5, 2.0])# regularization parameter
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])# Elastic Net Parameter (Ridge = 0)
.addGrid(lr.maxIter, [1, 10, 20])# Number of iterations
.build())
cv_lr = CrossValidator(estimator=pipeline_stages, estimatorParamMaps=paramGrid_lr,
evaluator=BinaryClassificationEvaluator(), numFolds=5, seed=42)
cv_lr_model = cv_lr.fit(train_df)
.fit method throws the below error:
I have tried increasing the driver memory but still facing the same error. Please suggest what is the cause of this issue.

Text Classification on a custom dataset with spacy v3

I am really struggling to make things work with the new spacy v3 version. The documentation is full. However, I am trying to run a training loop in a script.
(I am also not able to perform text classification training with CLI approach).
Data are publically available here.
import pandas as pd
from spacy.training import Example
import random
TRAIN_DATA = pd.read_json('data.jsonl', lines = True)
nlp = spacy.load('en_core_web_sm')
config = {
"threshold": 0.5,
}
textcat = nlp.add_pipe("textcat", config=config, last=True)
label = TRAIN_DATA['label'].unique()
for label in label:
textcat.add_label(str(label))
nlp = spacy.blank("en")
nlp.begin_training()
# Loop for 10 iterations
for itn in range(100):
# Shuffle the training data
losses = {}
TRAIN_DATA = TRAIN_DATA.sample(frac = 1)
# Batch the examples and iterate over them
for batch in spacy.util.minibatch(TRAIN_DATA.values, size=4):
texts = [nlp.make_doc(text) for text, entities in batch]
annotations = [{"cats": entities} for text, entities in batch]
# uses an example object rather than text/annotation tuple
print(texts)
print(annotations)
examples = [Example.from_dict(a)]
nlp.update(examples, losses=losses)
if itn % 20 == 0:
print(losses)

fill-mask usage from transformers pipeline

I fine-tune a gpt2 language model and I am generation the text according to my model by using following lines of code:
generator = pipeline('text-generation', tokenizer='gpt2', model='data/out')
print(generator('Once upon a time', max_length=40)[0]['generated_text'])
Now I want to do the prediction of only next word with the probabilities. I know we can do it by using 'fill-mask' but I don't know how to do it. When I put 'fill-mask' inplace of 'text-generation', I am getting this error:
"Unrecognized configuration class <class 'transformers.models.gpt2.configuration_gpt2.GPT2Config'> for this kind of AutoModel: AutoModelForMaskedLM.
Model type should be one of BigBirdConfig, Wav2Vec2Config, ConvBertConfig, LayoutLMConfig, DistilBertConfig, AlbertConfig, BartConfig, MBartConfig, CamembertConfig, XLMRobertaConfig, LongformerConfig, RobertaConfig, SqueezeBertConfig, BertConfig, MobileBertConfig, FlaubertConfig, XLMConfig, ElectraConfig, ReformerConfig, FunnelConfig, MPNetConfig, TapasConfig, DebertaConfig, DebertaV2Config, IBertConfig.".
generator = pipeline('fill-mask', tokenizer='gpt2', model='data/out') // this line is giving me the above mentioned error.
Please let me know how can I fix this issue. Any kind of help would be greatly appreciated.
Thanks in advance.
The whole code for better understanding.
from transformers import (
GPT2Tokenizer,
DataCollatorForLanguageModeling,
TextDataset,
GPT2LMHeadModel,
TrainingArguments,
Trainer,
pipeline)
train_path = 'parsed_data.txt'
test_path = 'parsed_data.txt'
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
train_dataset = TextDataset(tokenizer=tokenizer, file_path=train_path, block_size=128)
test_dataset = TextDataset(tokenizer=tokenizer, file_path=test_path, block_size=128)
model = GPT2LMHeadModel.from_pretrained('gpt2')
training_args = TrainingArguments(output_dir = 'data/out', overwrite_output_dir = True, per_device_train_batch_size = 32, per_device_eval_batch_size = 32, learning_rate = 5e-5, num_train_epochs = 3,)
trainer = Trainer(model = model, args = training_args, data_collator=data_collator, train_dataset = train_dataset, eval_dataset = test_dataset)
trainer.train()
trainer.save_model()
generator = pipeline('fill-mask', tokenizer='gpt2', model='data/out')

cross validation in pyspark

I used cross validation to train a linear regression model using the following code:
from pyspark.ml.evaluation import RegressionEvaluator
lr = LinearRegression(maxIter=maxIteration)
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(training)
now I want to draw the roc curve, I used the following code but I get this error:
'LinearRegressionTrainingSummary' object has no attribute 'areaUnderROC'
trainingSummary = cvModel.bestModel.stages[-1].summary
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
I also want to check the objectiveHistory at each itaration, I know that I can get it at the end
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
but I want to get it at each iteration, how can I do this?
Moreover I want to evaluate the model on the test data, how can I do that?
prediction = cvModel.transform(test)
I know for the training data set I can write:
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
but how can I get these metrics for testing data set?

1) The area under the ROC curve (AUC) is defined only for binary classification, hence you cannot use it for regression tasks, as you are trying to do here.
2) The objectiveHistory for each iteration is only available when the solver argument in the regression is l-bfgs (documentation); here is a toy example:
spark.version
# u'2.1.1'
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
dataset = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.4),
(Vectors.dense([0.5]), 1.9),
(Vectors.dense([0.6]), 0.9),
(Vectors.dense([1.2]), 1.0)] * 10,
["features", "label"])
lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(dataset)
trainingSummary = cvModel.bestModel.summary
trainingSummary.totalIterations
# 2
trainingSummary.objectiveHistory # one value for each iteration
# [0.49, 0.4511834723904831]
3) You have already defined a RegressionEvaluator which you can use for evaluating your test set but, if used without arguments, it assumes the RMSE metric; here is a way to define evaluators with different metrics and apply them to your test set (continuing the code from above):
test = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.1),
(Vectors.dense([0.5]), 0.9),
(Vectors.dense([0.6]), 1.0)],
["features", "label"])
modelEvaluator.evaluate(cvModel.transform(test)) # rmse by default, if not specified
# 0.35384585061028506
eval_rmse = RegressionEvaluator(metricName="rmse")
eval_r2 = RegressionEvaluator(metricName="r2")
eval_rmse.evaluate(cvModel.transform(test)) # same as above
# 0.35384585061028506
eval_r2.evaluate(cvModel.transform(test))
# -0.001655087952929124

cross validation with pipe line in spark

Cross Validation outside from pipeline.
val naivebayes
val indexer
val pipeLine = new Pipeline().setStages(Array(indexer, naiveBayes))
val paramGrid = new ParamGridBuilder()
.addGrid(naiveBayes.smoothing, Array(1.0, 0.1, 0.3, 0.5))
.build()
val crossValidator = new CrossValidator().setEstimator(pipeLine)
.setEvaluator(new MulticlassClassificationEvaluator)
.setNumFolds(2).setEstimatorParamMaps(paramGrid)
val crossValidatorModel = crossValidator.fit(trainData)
val predictions = crossValidatorModel.transform(testData)
Cross Validation inside pipeline
val naivebayes
val indexer
// param grid for multiple parameter
val paramGrid = new ParamGridBuilder()
.addGrid(naiveBayes.smoothing, Array(0.35, 0.1, 0.2, 0.3, 0.5))
.build()
// validator for naive bayes
val crossValidator = new CrossValidator().setEstimator(naiveBayes)
.setEvaluator(new MulticlassClassificationEvaluator)
.setNumFolds(2).setEstimatorParamMaps(paramGrid)
// pipeline to execute compound transformation
val pipeLine = new Pipeline().setStages(Array(indexer, crossValidator))
// pipeline model
val pipeLineModel = pipeLine.fit(trainData)
// transform data
val predictions = pipeLineModel.transform(testData)
So i want to know which way is better and its pro & cons.
For both functions, i am getting same result and accuracy. Even second approach is little bit faster than first.

As per a training I attended - this should be the best practice :
cv = CrossValidator(estimator=lr,..)
pipelineModel = Pipeline(stages=[idx,assembler,cv])
cv_model= pipelineModel.fit(train)
This way your pipeline would fit only once and not with each recurring run with the param_grid which makes it run faster.
Hope this helps!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

'CrossValidatorModel' object has no attribute 'featureImportances' - apache-spark

Related

pyspark: Stage failure due to One hot encoding

Text Classification on a custom dataset with spacy v3

fill-mask usage from transformers pipeline

cross validation in pyspark

cross validation with pipe line in spark

Categories

Resources