failed to log_explanation from mlflow xgboost model - mlflow

model_m is a mlflow.pyfunc.loaded_model
X_pred is input dataframe with related input columns
but when I run
mlflow.shap.log_explanation(model_m.predict, X_pred)
it returns:
training data did not have the following fields:

Related

Azure: Do I need a Azure ML resource to use AutoML in an Azure databricks notebook?

If I want to use AutoML to train models within a python Databricks notebook, do I need an Azure Machine Learning resource? It seems like this would be an unnecessary resource if Databricks has its own compute
If I understand your question correctly, yes AutoML and Databricks ML libraries are completely different things.
Generic Random Forest Regression:
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures")
# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])
# Train model. This also runs the indexer.
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
rfModel = model.stages[1]
print(rfModel) # summary only
Generic Random Forest Classification:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])
# Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
rfModel = model.stages[2]
print(rfModel) # summary only
Please check out the resource below for more info.
https://spark.apache.org/docs/latest/ml-classification-regression.html

How to read a csv to use in pyspark MLlib?

I have a csv file that I'm trying to use as input of a KMeans algorithm in pyspark. I'm using the code from MLlib documentation.
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# Loads data.
dataset = spark.read.format("libsvm").load("P.txt")
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)
# Make predictions
predictions = model.transform(dataset)
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
I'm getting the error:
java.lang.NumberFormatException: For input string: "-6.71,-1.14"
I tried to read the file as
dataset = spark.read.format("csv").load("P.txt")
But I get another error:
java.lang.IllegalArgumentException: Field "features" does not exist. Available fields: _c0, _c1
I'm beginner in pyspark, I tried to look for tutorials on that but I did't find any.
I found the problem. A DataFrame input of kmeans.fit needs to have a field "features", as the error java.lang.IllegalArgumentException: Field "features" does not exist. Available fields: _c0, _c1 was indicating.
To do this we need a VectorAssembler, but before we need to convert the columns to a numeric type, otherwise we get the error java.lang.IllegalArgumentException: Data type string of column _c0 is not supported.
from pyspark.sql.functions import col
df = spark.read.csv('P.txt')
# Convert columns to float
df = df.select(*(col(c).cast("float").alias(c) for c in df.columns))
assembler = VectorAssembler(
inputCols=["_c0", "_c1"],
outputCol="features")
df = assembler.transform(df)
df = df.drop("_c0")
df = df.drop("_c1")
df.show()
Check This method for reading CSV files:
df = spark.read.options(header=True).csv('csvFile.csv')
df.show()
Available fields: _c0, _c1
Check the first row of your data file. There is a high probability, that you didn't use headers=True parameter when saving it to hdfs while creating.

Getting TypeErrror:DecisionTreeClassifier' object is not iterable in sparkml lib

I am trying to implement a decision tree in spark Mllib with help of Coursera "Machine learning for big data". I have got below error
<class 'pyspark.ml.classification.DecisionTreeClassifier'>
Traceback (most recent call last):
File "C:/sparkcourse/Pycharmproject/Decisiontree.py", line 65, in <module>
model=modelpipeline.fit(traindata)
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\base.py", line 64, in fit
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\pipeline.py", line 93, in _fit
TypeError: 'DecisionTreeClassifier' object is not iterable
Here is the code
from pyspark.sql import SparkSession
from pyspark.sql import DataFrameNaFunctions
#pipeline is estimator or transformer
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import Binarizer
from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").enableHiveSupport().getOrCreate()
weatherdata=spark.read.csv("file:///SparkCourse/daily_weather.csv",header="true",inferSchema="true")
#print(weatherdata.columns)
#for input features we explicitly take the columns
featurescolumn=['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am', 'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am', 'rain_accumulation_9am', 'rain_duration_9am']
#print(featurescolumn)
weatherdata=weatherdata.drop("number")
#print(weatherdata.columns)
#missing value dealing
weatherdata=weatherdata.na.drop()
#print(weatherdata.count(),len(weatherdata.columns))
#create a categorical variable to denote if humid is not low(we weill deal heare relative_humidity_3pm column).if value is
#less than 25% then categorical value is 0 or if higher it will be 1. using binarizer will solve this
binarizer=Binarizer(threshold=24.99999,inputCol='relative_humidity_3pm',outputCol='low_humid')
#we transform whole weatherdata into Binarizer categorical value
binarizerDf=binarizer.transform(weatherdata)
#binarizerDf.select("relative_humidity_3pm",'low_humid').show(4)
#aggregating the fetures that will be used to make prediction into single columns
#The inputCols argument specifies our list of column names we defined earlier, and outputCol is the name of the new column. The second line creates a new DataFrame with the aggregated features in a column.
assembler=VectorAssembler(inputCols=featurescolumn,outputCol="features")
assembled=assembler.transform(binarizerDf)
#assembled.select("features").show(1)
#spliting Train and Test data by calling randomsplit
(traindata, testdata)=assembled.randomSplit([0.80,0.20],seed=1234)
#data counting
print(traindata.count(),testdata.count())
#create decision trees Model
#----------------------------------
#The labelCol argument is the column we are trying to predict, featuresCol specifies the aggregated features column, maxDepth is stopping criterion for tree induction based on maximum depth of tree
#minInstancesPerNode is stopping criterion for tree induction based on minimum number of samples in a node
#impurity is the impurity measure used to split nodes.
decisiontree=DecisionTreeClassifier(labelCol="label",featuresCol="features",maxDepth=5,minInstancesPerNode=20,impurity="gini")
print(type(decisiontree))
#creating model by training the decision tree, pipeline solve this
modelpipeline=Pipeline(stages=decisiontree)
model=modelpipeline.fit(traindata)
#predicting test data
predictions=model.transform(testdata)
#showing predictedvalue
prediction=predictions.select('prediction','label').show(5)
The course is using spark 1.6 in cloud era VM. but i have integrated spark 2.1.0 with PyCharm.
stages should a sequence of PipelineStages (Transofmers or Esitmators), not a single Estimator. Replace:
Pipeline(stages=decisiontree)
with
Pipeline(stages=[decisiontree])

How to train SparkML gradient boosting classifer given a RDD

Given the following rdd
training_rdd = rdd.select(
# Categorical features
col('device_os'), # 'ios', 'android'
# Numeric features
col('30day_click_count'),
col('30day_impression_count'),
np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'),
# label
col('did_click').alias('label')
)
I am confused about the syntax to train a gradient boosting classifer.
I am following the this tutorial.
https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier
However, I am unsure about how to get my 4 feature columns into a vector. Because VectorIndexer assumes that all the features are already in one column.
You can use VectorAssembler to generate the feature vector. Please note that you will have to convert your rdd to a DataFrame first.
from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler()
vectorizer.setInputCols(["device_os",
"30day_click_count",
"30day_impression_count",
"30day_click_through_rate"])
vectorizer.setOutputCol("features")
And consequently, you will need to put vectorizer as the first stage into the Pipeline:
pipeline = Pipeline([vectorizer, ...])

pyspark Model interpretation from pipeline model

I am implementing DecisionTreeClassifier in pyspark using the Pipeline module as I have several feature engineering steps to perform on my dataset.
The code is similar to the example from Spark documentation:
from pyspark import SparkContext, SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load the data stored in LIBSVM format as a DataFrame.
data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
# Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="precision")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
treeModel = model.stages[2]
# summary only
print(treeModel)
The question is how do I perform the model interpretation on this? The pipeline model object does not have the method toDebugString() similar to the method in the DecisionTree.trainClassifier class
And I cannot use the DecisionTree.trainClassifier in my pipeline because the trainclassifier() takes the training data as a parameter.
Whereas the pipeline accepts the training data as an argument in the fit() method and transform() on the test data
Is there a way to use the pipeline and still perform the model interpretation & find attribute importance?
Yes, I have used the method below in almost all my model interpretations in pyspark. The line below uses the naming conventions from your code excerpt.
dtm = model.stages[-1] # you estimator is the last stage in the pipeline
# hence the DecisionTreeClassifierModel will be the last transformer in the PipelineModel object
dtm.explainParams()
Now you have access to all the methods of the DecisionTreeClassifierModel. All the available methods and attributes can be found here. Code was not tested on your example.

Resources