How to train SparkML gradient boosting classifer given a RDD - apache-spark

Given the following rdd
training_rdd = rdd.select(
# Categorical features
col('device_os'), # 'ios', 'android'
# Numeric features
col('30day_click_count'),
col('30day_impression_count'),
np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'),
# label
col('did_click').alias('label')
)
I am confused about the syntax to train a gradient boosting classifer.
I am following the this tutorial.
https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier
However, I am unsure about how to get my 4 feature columns into a vector. Because VectorIndexer assumes that all the features are already in one column.

You can use VectorAssembler to generate the feature vector. Please note that you will have to convert your rdd to a DataFrame first.
from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler()
vectorizer.setInputCols(["device_os",
"30day_click_count",
"30day_impression_count",
"30day_click_through_rate"])
vectorizer.setOutputCol("features")
And consequently, you will need to put vectorizer as the first stage into the Pipeline:
pipeline = Pipeline([vectorizer, ...])

Related

Azure: Do I need a Azure ML resource to use AutoML in an Azure databricks notebook?

If I want to use AutoML to train models within a python Databricks notebook, do I need an Azure Machine Learning resource? It seems like this would be an unnecessary resource if Databricks has its own compute
If I understand your question correctly, yes AutoML and Databricks ML libraries are completely different things.
Generic Random Forest Regression:
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures")
# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])
# Train model. This also runs the indexer.
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
rfModel = model.stages[1]
print(rfModel) # summary only
Generic Random Forest Classification:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])
# Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
rfModel = model.stages[2]
print(rfModel) # summary only
Please check out the resource below for more info.
https://spark.apache.org/docs/latest/ml-classification-regression.html

how do I standardize test dataset using StandardScaler in PySpark?

I have train and test datasets as below:
x_train:
inputs
[2,5,10]
[4,6,12]
...
x_test:
inputs
[7,8,14]
[5,5,7]
...
The inputs column is a vector containing the models features after applying the VectorAssembler class to 3 separate columns.
When I try to transform the test data using the StandardScaler as below, I get an error saying it doesn't have the transform method:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaledTrainDF = scaler.fit(x_train).transform(x_train)
scaledTestDF = scaler.transform(x_test)
I am told that I should fit the standard scaler on the training data only once and use those parameters to transform the test set, so it is not accurate to do:
scaledTestDF = scaler.fit(x_test).transform(x_test)
So how do I deal with the error mentioned above?
Here is the correct syntax to use the scaler. You need to call transform on a fitted model, not on the scaler itself.
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaler_model = scaler.fit(x_train)
scaledTrainDF = scaler_model.transform(x_train)
scaledTestDF = scaler_model.transform(x_test)

Getting TypeErrror:DecisionTreeClassifier' object is not iterable in sparkml lib

I am trying to implement a decision tree in spark Mllib with help of Coursera "Machine learning for big data". I have got below error
<class 'pyspark.ml.classification.DecisionTreeClassifier'>
Traceback (most recent call last):
File "C:/sparkcourse/Pycharmproject/Decisiontree.py", line 65, in <module>
model=modelpipeline.fit(traindata)
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\base.py", line 64, in fit
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\pipeline.py", line 93, in _fit
TypeError: 'DecisionTreeClassifier' object is not iterable
Here is the code
from pyspark.sql import SparkSession
from pyspark.sql import DataFrameNaFunctions
#pipeline is estimator or transformer
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import Binarizer
from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").enableHiveSupport().getOrCreate()
weatherdata=spark.read.csv("file:///SparkCourse/daily_weather.csv",header="true",inferSchema="true")
#print(weatherdata.columns)
#for input features we explicitly take the columns
featurescolumn=['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am', 'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am', 'rain_accumulation_9am', 'rain_duration_9am']
#print(featurescolumn)
weatherdata=weatherdata.drop("number")
#print(weatherdata.columns)
#missing value dealing
weatherdata=weatherdata.na.drop()
#print(weatherdata.count(),len(weatherdata.columns))
#create a categorical variable to denote if humid is not low(we weill deal heare relative_humidity_3pm column).if value is
#less than 25% then categorical value is 0 or if higher it will be 1. using binarizer will solve this
binarizer=Binarizer(threshold=24.99999,inputCol='relative_humidity_3pm',outputCol='low_humid')
#we transform whole weatherdata into Binarizer categorical value
binarizerDf=binarizer.transform(weatherdata)
#binarizerDf.select("relative_humidity_3pm",'low_humid').show(4)
#aggregating the fetures that will be used to make prediction into single columns
#The inputCols argument specifies our list of column names we defined earlier, and outputCol is the name of the new column. The second line creates a new DataFrame with the aggregated features in a column.
assembler=VectorAssembler(inputCols=featurescolumn,outputCol="features")
assembled=assembler.transform(binarizerDf)
#assembled.select("features").show(1)
#spliting Train and Test data by calling randomsplit
(traindata, testdata)=assembled.randomSplit([0.80,0.20],seed=1234)
#data counting
print(traindata.count(),testdata.count())
#create decision trees Model
#----------------------------------
#The labelCol argument is the column we are trying to predict, featuresCol specifies the aggregated features column, maxDepth is stopping criterion for tree induction based on maximum depth of tree
#minInstancesPerNode is stopping criterion for tree induction based on minimum number of samples in a node
#impurity is the impurity measure used to split nodes.
decisiontree=DecisionTreeClassifier(labelCol="label",featuresCol="features",maxDepth=5,minInstancesPerNode=20,impurity="gini")
print(type(decisiontree))
#creating model by training the decision tree, pipeline solve this
modelpipeline=Pipeline(stages=decisiontree)
model=modelpipeline.fit(traindata)
#predicting test data
predictions=model.transform(testdata)
#showing predictedvalue
prediction=predictions.select('prediction','label').show(5)
The course is using spark 1.6 in cloud era VM. but i have integrated spark 2.1.0 with PyCharm.
stages should a sequence of PipelineStages (Transofmers or Esitmators), not a single Estimator. Replace:
Pipeline(stages=decisiontree)
with
Pipeline(stages=[decisiontree])

pyspark Model interpretation from pipeline model

I am implementing DecisionTreeClassifier in pyspark using the Pipeline module as I have several feature engineering steps to perform on my dataset.
The code is similar to the example from Spark documentation:
from pyspark import SparkContext, SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load the data stored in LIBSVM format as a DataFrame.
data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
# Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="precision")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
treeModel = model.stages[2]
# summary only
print(treeModel)
The question is how do I perform the model interpretation on this? The pipeline model object does not have the method toDebugString() similar to the method in the DecisionTree.trainClassifier class
And I cannot use the DecisionTree.trainClassifier in my pipeline because the trainclassifier() takes the training data as a parameter.
Whereas the pipeline accepts the training data as an argument in the fit() method and transform() on the test data
Is there a way to use the pipeline and still perform the model interpretation & find attribute importance?
Yes, I have used the method below in almost all my model interpretations in pyspark. The line below uses the naming conventions from your code excerpt.
dtm = model.stages[-1] # you estimator is the last stage in the pipeline
# hence the DecisionTreeClassifierModel will be the last transformer in the PipelineModel object
dtm.explainParams()
Now you have access to all the methods of the DecisionTreeClassifierModel. All the available methods and attributes can be found here. Code was not tested on your example.

How to convert type Row into Vector to feed to the KMeans

when i try to feed df2 to kmeans i get the following error
clusters = KMeans.train(df2, 10, maxIterations=30,
runs=10, initializationMode="random")
The error i get:
Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
df2 is a dataframe created as follow:
df = sqlContext.read.json("data/ALS3.json")
df2 = df.select('latitude','longitude')
df2.show()
latitude| longitude|
60.1643075| 24.9460844|
60.4686748| 22.2774728|
how can i convert this two columns to Vector and feed it to KMeans?
ML
The problem is that you missed the documentation's example, and it's pretty clear that the method train requires a DataFrame with a Vector as features.
To modify your current data's structure you can use a VectorAssembler. In your case it could be something like:
from pyspark.sql.functions import *
vectorAssembler = VectorAssembler(inputCols=["latitude", "longitude"],
outputCol="features")
# For your special case that has string instead of doubles you should cast them first.
expr = [col(c).cast("Double").alias(c)
for c in vectorAssembler.getInputCols()]
df2 = df2.select(*expr)
df = vectorAssembler.transform(df2)
Besides, you should also normalize your features using the class MinMaxScaler to obtain better results.
MLLib
In order to achieve this using MLLib you need to use a map function first, to convert all your string values into Double, and merge them together in a DenseVector.
rdd = df2.map(lambda data: Vectors.dense([float(c) for c in data]))
After this point you can train your MLlib's KMeans model using the rdd variable.
I got PySpark 2.3.1 to perform KMeans on a DataFrame as follows:
Write a list of the columns you want to include in the clustering analysis:
feat_cols = ['latitude','longitude']`
You need all of the columns to be numeric values:
expr = [col(c).cast("Double").alias(c) for c in feat_cols]
df2 = df2.select(*expr)
Create your features vector with mllib.linalg.Vectors:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=feat_cols, outputCol="features")
df3 = assembler.transform(df2).select('features')
You should normalize your features as normalization is not always required, but it rarely hurts (more about this here):
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(
inputCol="features",
outputCol="scaledFeatures",
withStd=True,
withMean=False)
scalerModel = scaler.fit(df3)
df4 = scalerModel.transform(df3).drop('features')\
.withColumnRenamed('scaledFeatures', 'features')
Turn your DataFrame object df4 into a dense vector RDD:
from pyspark.mllib.linalg import Vectors
data5 = df4.rdd.map(lambda row: Vectors.dense([x for x in row['features']]))
Use the obtained RDD object as input for KMeans training:
from pyspark.mllib.clustering import KMeans
model = KMeans.train(data5, k=3, maxIterations=10)
Example: classify a point p in your vector space:
prediction = model.predict(p)

Resources