failed to log_explanation from mlflow xgboost model - mlflow

model_m is a mlflow.pyfunc.loaded_model
X_pred is input dataframe with related input columns
but when I run
mlflow.shap.log_explanation(model_m.predict, X_pred)
it returns:
training data did not have the following fields:


Azure: Do I need a Azure ML resource to use AutoML in an Azure databricks notebook?

If I want to use AutoML to train models within a python Databricks notebook, do I need an Azure Machine Learning resource? It seems like this would be an unnecessary resource if Databricks has its own compute
If I understand your question correctly, yes AutoML and Databricks ML libraries are completely different things.
Generic Random Forest Regression:
from import Pipeline
from import RandomForestRegressor
from import VectorIndexer
from import RegressionEvaluator
# Load and parse the data file, converting it to a DataFrame.
data ="libsvm").load("data/mllib/sample_libsvm_data.txt")
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures")
# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])
# Train model. This also runs the indexer.
model =
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display."prediction", "label", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
rfModel = model.stages[1]
print(rfModel) # summary only
Generic Random Forest Classification:
from import Pipeline
from import RandomForestClassifier
from import IndexToString, StringIndexer, VectorIndexer
from import MulticlassClassificationEvaluator
# Load and parse the data file, converting it to a DataFrame.
data ="libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])
# Train model. This also runs the indexers.
model =
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display."predictedLabel", "label", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
rfModel = model.stages[2]
print(rfModel) # summary only
Please check out the resource below for more info.

How to read a csv to use in pyspark MLlib?

I have a csv file that I'm trying to use as input of a KMeans algorithm in pyspark. I'm using the code from MLlib documentation.
from import KMeans
from import ClusteringEvaluator
# Loads data.
dataset ="libsvm").load("P.txt")
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model =
# Make predictions
predictions = model.transform(dataset)
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
I'm getting the error:
java.lang.NumberFormatException: For input string: "-6.71,-1.14"
I tried to read the file as
dataset ="csv").load("P.txt")
But I get another error:
java.lang.IllegalArgumentException: Field "features" does not exist. Available fields: _c0, _c1
I'm beginner in pyspark, I tried to look for tutorials on that but I did't find any.
I found the problem. A DataFrame input of needs to have a field "features", as the error java.lang.IllegalArgumentException: Field "features" does not exist. Available fields: _c0, _c1 was indicating.
To do this we need a VectorAssembler, but before we need to convert the columns to a numeric type, otherwise we get the error java.lang.IllegalArgumentException: Data type string of column _c0 is not supported.
from pyspark.sql.functions import col
df ='P.txt')
# Convert columns to float
df =*(col(c).cast("float").alias(c) for c in df.columns))
assembler = VectorAssembler(
inputCols=["_c0", "_c1"],
df = assembler.transform(df)
df = df.drop("_c0")
df = df.drop("_c1")
Check This method for reading CSV files:
df ='csvFile.csv')
Available fields: _c0, _c1
Check the first row of your data file. There is a high probability, that you didn't use headers=True parameter when saving it to hdfs while creating.

Getting TypeErrror:DecisionTreeClassifier' object is not iterable in sparkml lib

I am trying to implement a decision tree in spark Mllib with help of Coursera "Machine learning for big data". I have got below error
<class ''>
Traceback (most recent call last):
File "C:/sparkcourse/Pycharmproject/", line 65, in <module>
File "C:\spark\python\lib\\pyspark\ml\", line 64, in fit
File "C:\spark\python\lib\\pyspark\ml\", line 93, in _fit
TypeError: 'DecisionTreeClassifier' object is not iterable
Here is the code
from pyspark.sql import SparkSession
from pyspark.sql import DataFrameNaFunctions
#pipeline is estimator or transformer
from import Pipeline
from import DecisionTreeClassifier
from import Binarizer
from import VectorAssembler,VectorIndexer,StringIndexer
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").enableHiveSupport().getOrCreate()"file:///SparkCourse/daily_weather.csv",header="true",inferSchema="true")
#for input features we explicitly take the columns
featurescolumn=['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am', 'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am', 'rain_accumulation_9am', 'rain_duration_9am']
#missing value dealing
#create a categorical variable to denote if humid is not low(we weill deal heare relative_humidity_3pm column).if value is
#less than 25% then categorical value is 0 or if higher it will be 1. using binarizer will solve this
#we transform whole weatherdata into Binarizer categorical value
#aggregating the fetures that will be used to make prediction into single columns
#The inputCols argument specifies our list of column names we defined earlier, and outputCol is the name of the new column. The second line creates a new DataFrame with the aggregated features in a column.
#spliting Train and Test data by calling randomsplit
(traindata, testdata)=assembled.randomSplit([0.80,0.20],seed=1234)
#data counting
#create decision trees Model
#The labelCol argument is the column we are trying to predict, featuresCol specifies the aggregated features column, maxDepth is stopping criterion for tree induction based on maximum depth of tree
#minInstancesPerNode is stopping criterion for tree induction based on minimum number of samples in a node
#impurity is the impurity measure used to split nodes.
#creating model by training the decision tree, pipeline solve this
#predicting test data
#showing predictedvalue'prediction','label').show(5)
The course is using spark 1.6 in cloud era VM. but i have integrated spark 2.1.0 with PyCharm.
stages should a sequence of PipelineStages (Transofmers or Esitmators), not a single Estimator. Replace:

How to train SparkML gradient boosting classifer given a RDD

Given the following rdd
training_rdd =
# Categorical features
col('device_os'), # 'ios', 'android'
# Numeric features
np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'),
# label
I am confused about the syntax to train a gradient boosting classifer.
I am following the this tutorial.
However, I am unsure about how to get my 4 feature columns into a vector. Because VectorIndexer assumes that all the features are already in one column.
You can use VectorAssembler to generate the feature vector. Please note that you will have to convert your rdd to a DataFrame first.
from import VectorAssembler
vectorizer = VectorAssembler()
And consequently, you will need to put vectorizer as the first stage into the Pipeline:
pipeline = Pipeline([vectorizer, ...])

pyspark Model interpretation from pipeline model

I am implementing DecisionTreeClassifier in pyspark using the Pipeline module as I have several feature engineering steps to perform on my dataset.
The code is similar to the example from Spark documentation:
from pyspark import SparkContext, SQLContext
from import Pipeline
from import DecisionTreeClassifier
from import StringIndexer, VectorIndexer
from import MulticlassClassificationEvaluator
# Load the data stored in LIBSVM format as a DataFrame.
data ="libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
# Train model. This also runs the indexers.
model =
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display."prediction", "indexedLabel", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="precision")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
treeModel = model.stages[2]
# summary only
The question is how do I perform the model interpretation on this? The pipeline model object does not have the method toDebugString() similar to the method in the DecisionTree.trainClassifier class
And I cannot use the DecisionTree.trainClassifier in my pipeline because the trainclassifier() takes the training data as a parameter.
Whereas the pipeline accepts the training data as an argument in the fit() method and transform() on the test data
Is there a way to use the pipeline and still perform the model interpretation & find attribute importance?
Yes, I have used the method below in almost all my model interpretations in pyspark. The line below uses the naming conventions from your code excerpt.
dtm = model.stages[-1] # you estimator is the last stage in the pipeline
# hence the DecisionTreeClassifierModel will be the last transformer in the PipelineModel object
Now you have access to all the methods of the DecisionTreeClassifierModel. All the available methods and attributes can be found here. Code was not tested on your example.
