Spark Cross Validation with Training, Testing and Validation sets - apache-spark

I want to do two Cross Validation processes in Spark using RandomSplits like
CV_global: by splitting data into Training Set 90% and Testing Set 10%
1.1. CV_grid: grid search on half of Training Set, i.e. 45% of data.
1.2. Fit Model: on Training set (90%) using the best settings from CV_grid.
1.3 Test Model: on Testing set (10%)
Report Average metrics per 10-fold and global metrics.
The problem is I only find examples using CV and Grid search on the whole training set.
How can I get the parameters of the best performing model from CV_grid?
How to do CV without grid search but get stats per fold? e.g.
sklearn.cross_validation.cross_val_score

You have things like
crossval.setEstimatorParamMaps(paramGrid)
and then
cvModel = crossval.fit(trainingSetDF).bestModel
For single models (at least for some) there are functions like explainParams(). It's available in spark 1.6 (maybe it goes back to 1.4.2, I'm not sure).
Hope this helps

You have three questions into one. The answers for each:
1. The problem is I only find examples using CV and Grid search on the whole training set.
if you need just a portion of your training dataset, then sample at the wanted percentage, e.g.
training = training.sample(false, .45, 78L)
2. How can I get the parameters of the best performing model from CV_grid?
crossValidatedModel.bestModel().getParamMap()
get from there the parameters names , and then values.
3. How to do CV without grid search but get stats per fold? e.g.
duplicate of How can I access computed metrics for each fold in a CrossValidatorModel
Take a look here: Spark CrossValidatorModel access other models than the bestModel?

Related

when to normalize data with zscore (before or after split)

I was taking a udemy course, which made a strong case for normalizing only the train data (after the split from test data) since the model will typically used by fresh data, with features of the scale of the original set. And if you scale the test data, then you are not scoring the model properly.
On the other hand, what I found was that my two-class logistic regression model (created with Azure Machine Learning Studio) was getting terrible results after Z-Score scaling only the train data.
a. Is this a problem only with Azure's tools?
b. What is a good rule of thumb for when feature data needs to be scaled (one, two, or three orders of magnitude in difference)?
Not scoring the model properly due to normalized test set doesn't seem to make sense:
you would presumably also normalize data that you use for predictions in the future.
I found this similar question in datascience stackexchange and the top answer suggests not only that test data has to be normalized, but you need to apply the exact same scaling as you have done to the training data, because the scale of your data is also taken into account by your model: differently scaled test/prediction data would potentially lead to over/under-exaggeration of a feature.

How to extract average metrics with Cross-Validation in PySpark

I'm trying to perform a Cross-Validation over Random Forest in Spark 1.6.0 and I'm finding hard to obtain the evaluation metrics (precision, recall, f1...). I want the average of the metrics of all folds. Is this possible to obtain them with CrossValidator and MulticlassClassificationEvaluator?
I only found examples where the evaluation is performed later over an independent test dataset and using the best model from the Cross-Validation. I'm not planning to use a train and test set, but to use all the dataframe (df) for the cross validation, let it make the splits, and then take the average metrics.
paramGrid = ParamGridBuilder().build()
evaluator = MulticlassClassificationEvaluator()
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=5)
model = crossval.fit(df)
evaluator.evaluate(model.transform(df))
For now, I obtain the best model metric with the last line of the above code evaluator.evaluate(model.transform(df)) and I'm not totally sure that I'm doing it correctly.
In Spark 2.x, it is possible to get the average metrics using model.avgMetrics. This returns an array of double containing the metrics used to train your cross validation model.
For MulticlassClassificationEvaluator, this gives an array of: f1, weightedPrecision, weightedRecall, accuracy (as documented here). These metrics can be overridden as needed using setter in the evaluator class.
If you also need to get the best model parameters chosen by the cross validator, please see my answer in here.

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Train multiple models with various measures and accumulate predictions

So I have been playing around with Azure ML lately, and I got one dataset where I have multiple values I want to predict. All of them uses different algorithms and when I try to train multiple models within one experiment; it says the “train model can only predict one value”, and there are not enough input ports on the train-model to take in multiple values even if I was to use the same algorithm for each measure. I tried launching the column selector and making rules, but I get the same error as mentioned. How do I predict multiple values and later put the predicted columns together for the web service output so I don’t have to have multiple API’s?
What you would want to do is to train each model and save them as already trained models.
So create a new experiment, train your models and save them by right clicking on each model and they will show up in the left nav bar in the Studio. Now you are able to drag your models into the canvas and have them score predictions where you eventually make them end up in the same output as I have done in my example through the “Add columns” module. I made this example for Ronaldo (Real Madrid CF player) on how he will perform in match after training day. You can see my demo on http://ronaldoinform.azurewebsites.net
For more detailed explanation on how to save the models and train multiple values; you can check out Raymond Langaeian (MSFT) answer in the comment section on this link:
https://azure.microsoft.com/en-us/documentation/articles/machine-learning-convert-training-experiment-to-scoring-experiment/
You have to train models for each variable that you going to predict. Then add all those predicted columns together and get as a single output for the web service.
The algorithms available in ML are only capable of predicting a single variable at a time based on the inputs it's getting.

advanced feature extraction for cross-validation using sklearn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle
Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Resources