cross validation in pyspark - apache-spark

I used cross validation to train a linear regression model using the following code:
from import RegressionEvaluator
lr = LinearRegression(maxIter=maxIteration)
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=pipeline,
cvModel =
now I want to draw the roc curve, I used the following code but I get this error:
'LinearRegressionTrainingSummary' object has no attribute 'areaUnderROC'
trainingSummary = cvModel.bestModel.stages[-1].summary
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
I also want to check the objectiveHistory at each itaration, I know that I can get it at the end
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
but I want to get it at each iteration, how can I do this?
Moreover I want to evaluate the model on the test data, how can I do that?
prediction = cvModel.transform(test)
I know for the training data set I can write:
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
but how can I get these metrics for testing data set?

1) The area under the ROC curve (AUC) is defined only for binary classification, hence you cannot use it for regression tasks, as you are trying to do here.
2) The objectiveHistory for each iteration is only available when the solver argument in the regression is l-bfgs (documentation); here is a toy example:
# u'2.1.1'
from import Pipeline
from import Vectors
from import RegressionEvaluator
from import LinearRegression
from import CrossValidator, ParamGridBuilder
dataset = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.4),
(Vectors.dense([0.5]), 1.9),
(Vectors.dense([0.6]), 0.9),
(Vectors.dense([1.2]), 1.0)] * 10,
["features", "label"])
lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=lr,
cvModel =
trainingSummary = cvModel.bestModel.summary
# 2
trainingSummary.objectiveHistory # one value for each iteration
# [0.49, 0.4511834723904831]
3) You have already defined a RegressionEvaluator which you can use for evaluating your test set but, if used without arguments, it assumes the RMSE metric; here is a way to define evaluators with different metrics and apply them to your test set (continuing the code from above):
test = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.1),
(Vectors.dense([0.5]), 0.9),
(Vectors.dense([0.6]), 1.0)],
["features", "label"])
modelEvaluator.evaluate(cvModel.transform(test)) # rmse by default, if not specified
# 0.35384585061028506
eval_rmse = RegressionEvaluator(metricName="rmse")
eval_r2 = RegressionEvaluator(metricName="r2")
eval_rmse.evaluate(cvModel.transform(test)) # same as above
# 0.35384585061028506
# -0.001655087952929124


Model selection & Selecting the number of active components in Bayesian Gaussian Mixture Models

I have generated 2 groups of 1-D data points which are visually clearly separable and I want to use a Bayesian Gaussian Mixture Model (BGMM) to ideally recover 2 clusters.
Since BGMMs maximize a lower bound on the model evidence (ELBO) and given that the ELBO is supposed to combine notions of accuracy and complexity, I would expect more complex models to be penalized.
However, when running Grid Search over the number of clusters, I often get a solution with more than 2 clusters. More specifically, I often get the maximal number of clusters on my grid search. In the example below, I would expect the best model to define 2 clusters. Instead, the best models defines 4 but assigns minimal weights to 2 out of 4 clusters.
I am really surprised, since 2 out of 4 clusters are therefore adding little information and this more complex model still gets selected as the best model.
Why is the BGMM then picking 4 clusters for the best model?
If this is indeed the behavior a BGMM should show, how can I then assess how many active components I actually have in my model? Visually? By defining an arbitrary threshold on the weights?
I have added the code to reproduce my example below.
# Import statements
import itertools
import multiprocessing
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from joblib import Parallel, delayed
from sklearn.mixture import BayesianGaussianMixture
from sklearn.utils import shuffle
def fitmodel(x, params):
Instantiates and fits Bayesian GMM
Used in the parallel for loop
# Gaussian mixture model
clf = BayesianGaussianMixture(**params)
# Fit
clf =, y=None)
return clf
def plot_results(X, means, covariances, title):
plt.plot(X, np.random.uniform(low=0, high=1, size=len(X)),'o', alpha=0.1, color='cornflowerblue', label='data points')
for i, (mean, covar) in enumerate(zip(
means, covariances)):
# Get normal PDF
n_sd = 2.5
x = np.linspace(mean - n_sd*covar, mean + n_sd*covar, 300)
x = x.ravel()
y = stats.norm.pdf(x, mean, covar).ravel()
if i == 0:
label = 'Component PDF'
label = None
plt.plot(x, y, color='darkorange', label=label)
# Generate data
g1 = np.random.uniform(low=-1.5, high=-1, size=(1,100))
g2 = np.random.uniform(low=1.5, high=1, size=(1,100))
X = np.append(g1, g2)
# Shuffle data
X = shuffle(X)
X = X.reshape(-1, 1)
# Define parameters for grid search
parameters = {
'n_components': [1, 2, 3, 4],
# Create permutations of parameter settings
keys, values = zip(*parameters.items())
param_grid = [dict(zip(keys, v)) for v in itertools.product(*values)]
# Run GridSearch using parallel for loop
list_clf = [None] * len(param_grid)
num_cores = multiprocessing.cpu_count()
list_clf = Parallel(n_jobs=num_cores)(delayed(fitmodel)(X, params) for params in param_grid)
# Print best model (based on lower bound on model evidence)
lower_bounds = [x.lower_bound_ for x in list_clf] # Extract lower bounds on model evidence
idx = int(np.where(lower_bounds == np.max(lower_bounds))[0]) # Find best model
best_estimator = list_clf[idx]
print(f'Parameter setting of best model: {param_grid[idx]}')
print(f'Components weights: {best_estimator.weights_}')
# Plot data points and gaussian components
ax = plt.subplot(2, 1, 1)
if best_estimator.weight_concentration_prior_type == 'dirichlet_process':
prior_label = 'Dirichlet process'
elif best_estimator.weight_concentration_prior_type == 'dirichlet_distribution':
prior_label = 'Dirichlet distribution'
plot_results(X, best_estimator.means_, best_estimator.covariances_,
f'Best Bayesian GMM | {prior_label} prior')
# Plot histogram of weights
ax = plt.subplot(2, 1, 2)
for k, w in enumerate(best_estimator.weights_):, w,
plt.text(k, w + 0.01, "%.1f%%" % (w * 100.),
ax.yaxis.grid(True, alpha=0.7)
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.4)
plt.ylabel('Component weight')
plt.ylim(0, np.max(best_estimator.weights_)+0.25*np.max(best_estimator.weights_))

Run.get_context() gives the same run id

I am submitting the training through a script file. Following is the content of the script. Azure ML is treating all these as one run (instead of run per alpha value as coded below) as Run.get_context() is returning the same Run id.
from azureml.opendatasets import Diabetes
from azureml.core import Run
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib
import math
import os
import logging
# Load dataset
dataset = Diabetes.get_tabular_dataset()
df = dataset.to_pandas_dataframe()
# Split X (independent variables) & Y (target variable)
x_df = df.dropna() # Remove rows that have missing values
y_df = x_df.pop("Y") # Y is the label/target variable
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=66)
print('Original dataset size:', df.size)
print("Size after dropping 'na':", x_df.size)
print("Training split size: ", x_train.size)
print("Test split size: ", x_test.size)
# Training
alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # Define hyperparameters
# Create and log interactive runs
output_dir = os.path.join(os.getcwd(), 'outputs')
for hyperparam_alpha in alphas:
# Get the experiment run context
run = Run.get_context()
print("Started run: ",
run.log("train_split_size", x_train.size)
run.log("test_split_size", x_train.size)
run.log("alpha_value", hyperparam_alpha)
# Train
print("Train ...")
model = Ridge(hyperparam_alpha) = x_train, y = y_train)
# Predict
print("Predict ...")
y_pred = model.predict(X = x_test)
# Calculate & log error
rmse = math.sqrt(mean_squared_error(y_true = y_test, y_pred = y_pred))
run.log("rmse", rmse)
print("rmse", rmse)
# Serialize the model to local directory
if not os.path.isdir(output_dir):
os.makedirs(output_dir, exist_ok=True)
print("Save model ...")
model_name = "model_alpha_" + str(hyperparam_alpha) + ".pkl" # Pickle file
file_path = os.path.join(output_dir, model_name)
joblib.dump(value = model, filename = file_path)
# Upload the model
run.upload_file(name = model_name, path_or_stream = file_path)
# Complete the run
Experiments view
Authoring code (i.e. control plane)
import os
from azureml.core import Workspace, Experiment, RunConfiguration, ScriptRunConfig, VERSION, Run
ws = Workspace.from_config()
exp = Experiment(workspace = ws, name = "diabetes-local-script-file")
# Create new run config obj
run_local_config = RunConfiguration()
# This means that when we run locally, all dependencies are already provided.
run_local_config.environment.python.user_managed_dependencies = True
# Create new script config
script_run_cfg = ScriptRunConfig(
source_directory = os.path.join(os.getcwd(), 'code'),
script = '',
run_config = run_local_config)
run = exp.submit(script_run_cfg)
Short Answer
Option 1: create child runs within run
run = Run.get_context() assigns the run object of the run that you're currently in to run. So in every iteration of the hyperparameter search, you're logging to the same run. To solve this, you need to create child (or sub-) runs for each hyperparameter value. You can do this with run.child_run(). Below is the template for making this happen.
run = Run.get_context()
for hyperparam_alpha in alphas:
# Get the experiment run context
run_child = run.child_run()
print("Started run: ",
run_child.log("train_split_size", x_train.size)
On the diabetes-local-script-file Experiment page, you can see that Run 9 was the parent run and Runs 10-19 were the child runs if you click "Include child runs" page. There is also a "Child runs" tab on Run 9 details page.
Long answer
I highly recommend abstracting the hyperparameter search away from the data plane (i.e. and into the control plane (i.e. "authoring code"). This becomes especially valuable as training time increases and you can arbitrarily parallelize and also choose Hyperparameters more intelligently by using Azure ML's Hyperdrive.
Option 2 Create runs from control plane
Remove the loop from your code, add the code like below (full data and control here)
import argparse
from pprint import pprint
parser = argparse.ArgumentParser()
parser.add_argument('--alpha', type=float, default=0.5)
args = parser.parse_args()
print("all args:")
# use the variable like this
model = Ridge(args.alpha)
below is how to submit a single run using a script argument. To submit multiple runs, just use a loop in the control plane.
alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # Define hyperparameters
list_rcs = [ScriptRunConfig(
source_directory = os.path.join(os.getcwd(), 'code'),
script = '',
run_config = run_local_config) for a in alphas]
list_runs = [exp.submit(rc) for rc in list_rcs]
Option 3 Hyperdrive (IMHO the recommended approach)
In this way you outsource the hyperparameter source to Hyperdrive. The UI will also report results exactly how you want them, and via the API you can easily download the best model. Note you can't use this locally anymore and must use AMLCompute, but to me it is a worthwhile trade-off.This is a great overview. Excerpt below (full code here)
param_sampling = GridParameterSampling( {
"alpha": choice(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
estimator = Estimator(
source_directory = os.path.join(os.getcwd(), 'code'),
entry_script = '',
environment_definition=Environment.get(workspace=ws, name="AzureML-Tutorial")
hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
run = exp.submit(hyperdrive_run_config)

best-found PCA estimator to be used as the estimator in RFECV

This works (mostly from the demo sample at sklearn):
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform
lregress = LinearRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('regress', lregress)])
# Plot the PCA spectrum
plt.figure(1, figsize=(16, 9))
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
# Prediction
n_components = uniform.rvs(loc=1, scale=data_num.shape[1], size=50,
# Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator_pca = GridSearchCV(pipe,
), data_labels)
linestyle=':', label='n_components chosen ' +
plt.plot(np.cumsum(pca.explained_variance_ratio_), linewidth=1)
And this works:
from sklearn.feature_selection import RFECV
estimator = LinearRegression()
selector = RFECV(estimator, step=1, cv=5, scoring='explained_variance')
selector =, data_labels)
print("Selected number of features : %d" % selector.n_features_)
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)
but this gives me the error "RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes" on the line "selector1 ="
pca_est = estimator_pca.best_estimator_
selector1 = RFECV(pca_est, step=1, cv=5, scoring='explained_variance')
selector1 =, data_labels)
print("Selected number of features : %d" % selector1.n_features_)
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector1.grid_scores_) + 1), selector1.grid_scores_)
How do I get my best-found PCA estimator to be used as the estimator in RFECV?
This is a known issue in pipeline design. Refer to the github page here:
Accessing fitted attributes:
Moreover, some fitted attributes are used by meta-estimators;
AdaBoostClassifier assumes its sub-estimator has a classes_ attribute
after fitting, which means that presently Pipeline cannot be used as
the sub-estimator of AdaBoostClassifier.
Either meta-estimators such as AdaBoostClassifier need to be
configurable in how they access this attribute, or meta-estimators
such as Pipeline need to make some fitted attributes of sub-estimators
Same goes for other attributes like coef_ and feature_importances_. They are parts of last estimator so not exposed by pipeline.
Now you can try to follow the last para here and try to circumvent this to include it in pipeline, by doing something like this:
class Mypipeline(Pipeline):
def coef_(self):
return self._final_estimator.coef_
def feature_importances_(self):
return self._final_estimator.feature_importances_
And then using this new pipeline class in your code instead of original Pipeline.
This should work in most cases but not yours. You are doing feature reduction using PCA inside the pipeline. But want to do feature selection using RFECV. This in my opinion is not a good combination.
RFECV will keep on decreasing the number of features to be used. But the n_components in your best selected pca from above grid-search will be fixed. Then it will again throw an error when number of features become less than n_components. You cannot do anything in that case.
So I would advise you to think over your use case and code.

Ax = b solver on coordinate matrix Apache Spark

How can I solve the Ax = b problem using Apache spark. My input is a coordinate matrix:
import numpy as np
import scipy
from scipy import sparse
row = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
A = sparse.coo_matrix((data, (row, col)), shape=(4, 4))
#take the first column of A
b = sparse.coo_matrix((data, (row, 1)), shape=(4, 1))
#Solve Ax = b
Now I want to solve for x in Ax=b using the python libraries of the Apache Spark framework so the solution should be [1,0,0,0] since b is the 1st column of A
Below is the Apache Spark linear regression. Now, how do I set up the problem such that the input is a coordinate matrix (A) and coordinate vector (b)?
from import LinearRegression
# Load training data
training ="libsvm")\
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the model
lrModel =
# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))
# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
How can I solve the Ax = b problem using Apache spark.
Directly (analytically) you cannot. Spark doesn't provide linear algebra library.
Indirectly - use to approximately solve OLS problem. You can refer to:
API docs.
MLlib guide
for details regarding expected input and required steps.

Wrong intercept in Spark linear regression

I am starting with Spark Linear Regression. I am trying to fit a line to a linear dataset. It seems that the intercept is not correctly adjusting, or probably I am missing something..
With intercept=False:
linear_model = LinearRegressionWithSGD.train(labeledData, iterations=100, step=0.0001, intercept=False)
This seems normal. But when I use intercept=True:
linear_model = LinearRegressionWithSGD.train(labeledData, iterations=100, step=0.0001, intercept=True)
The model that I get in the last case is exactly:
(weights=[0.0353471289751], intercept=1.0005127185289888)
I have tried with different datasets, step sizes and iterations, but always the model converges the intercept is about 1
EDIT - This is the code I am using:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
import numpy as np
import matplotlib.pyplot as plt
from pyspark import SparkContext
sc = SparkContext("local", "regression")
# Generate data
SIZE = 300
SLOPE = 0.1
BASE = -30
NOISE = 10
x = np.arange(SIZE)
delta = np.random.uniform(-NOISE,NOISE, size=(SIZE,))
y = BASE + SLOPE*x + delta
data = zip(range(len(y)), y) # zip with index
dataRDD = sc.parallelize(data)
# Normalize data
# mean = np.mean(data)
# std = np.std(data)
# dataRDD = r: (r[0], (float(r[1])-mean)/std))
labeledData = r: LabeledPoint(float(r[1]), [float(r[0])]))
# Create linear model
linear_model = LinearRegressionWithSGD.train(labeledData, iterations=1000, step=0.0002, intercept=True, convergenceTol=0.000001)
print linear_model
true_vs_predicted = p: (p.label, linear_model.predict(p.features))).collect()
fig = plt.figure()
ax = fig.add_subplot(111)
y_real = [x[0] for x in true_vs_predicted]
y_pred = [x[1] for x in true_vs_predicted]
plt.plot(range(len(y_real)), y_real, 'o', markersize=5, c='b')
plt.plot(range(len(y_pred)), y_pred, 'o', markersize=5, c='r')
This is because the number of iterations and the step size are both smaller. As a result, The trial process is ending before reaching the local optima.
