accord.net svm incremental training - svm

I am using SVM in Accord.net for time series modeling. I train it once with available data (say 5000). After that, I get new data every second, I want to update my SVM machine in incremental fashion using one data every second, is that possible?

While it is now possible to learn SVMs incrementally in Accord.NET, this functionality is unfortunately only available for linear machines. Since you are working with time series, I am assuming you are working with a kernel SVM using the Dynamic Time Warping kernel.
If you are able to extract fixed-length features from your sequences, then what you ask may be possible if you feed those features to a linear machine instead and train it using Stochastic Gradient Descent or Averaged Stochastic Gradient Descent, as shown below:
// In this example, we will learn a multi-class SVM using the one-vs-one (OvO)
// approach. The OvO approacbh can decompose decision problems involving multiple
// classes into a series of binary ones, which can then be solved using SVMs.
// Ensure we have reproducible results
Accord.Math.Random.Generator.Seed = 0;
// We will try to learn a classifier
// for the Fisher Iris Flower dataset
var iris = new Iris();
double[][] inputs = iris.Instances; // get the flower characteristics
int[] outputs = iris.ClassLabels; // get the expected flower classes
// We will use mini-batches of size 32 to learn a SVM using SGD
var batches = MiniBatches.Create(batchSize: 32, maxIterations: 1000,
shuffle: ShuffleMethod.EveryEpoch, input: inputs, output: outputs);
// Now, we can create a multi-class teaching algorithm for the SVMs
var teacher = new MulticlassSupportVectorLearning<Linear, double[]>
{
// We will use SGD to learn each of the binary problems in the multi-class problem
Learner = (p) => new StochasticGradientDescent<Linear, double[], LogisticLoss>()
{
LearningRate = 1e-3,
MaxIterations = 1 // so the gradient is only updated once after each mini-batch
}
};
// The following line is only needed to ensure reproducible results. Please remove it to enable full parallelization
teacher.ParallelOptions.MaxDegreeOfParallelism = 1; // (Remove, comment, or change this line to enable full parallelism)
// Now, we can start training the model on mini-batches:
foreach (var batch in batches)
{
teacher.Learn(batch.Inputs, batch.Outputs);
}
// Get the final model:
var svm = teacher.Model;
// Now, we should be able to use the model to predict
// the classes of all flowers in Fisher's Iris dataset:
int[] prediction = svm.Decide(inputs);
// And from those predictions, we can compute the model accuracy:
var cm = new GeneralConfusionMatrix(expected: outputs, predicted: prediction);
double accuracy = cm.Accuracy; // should be approximately 0.973

Related

cross Validation in Sklearn using a Custom CV

I am dealing with a binary classification problem.
I have 2 lists of indexes listTrain and listTest, which are partitions of the training set (the actual test set will be used only later). I would like to use the samples associated with listTrain to estimate the parameters and the samples associated with listTest to evaluate the error in a cross validation process (hold out set approach).
However, I am not be able to find the correct way to pass this to the sklearn GridSearchCV.
The documentation says that I should create "An iterable yielding (train, test) splits as arrays of indices". However, I do not know how to create this.
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = custom_cv, n_jobs = -1, verbose = 0,scoring=errorType)
So, my question is how to create custom_cv based on these indexes to be used in this method?
X and y are respectivelly the features matrix and y is the vector of labels.
Example: Supose that I only have one hyperparameter alpha that belongs to the set{1,2,3}. I would like to set alpha=1, estimate the parameters of the model (for instance the coefficients os a regression) using the samples associated with listTrain and evaluate the error using the samples associated with listTest. Then I repeat the process for alpha=2 and finally for alpha=3. Then I choose the alpha that minimizes the error.
EDIT: Actual answer to question. Try passing cv command a generator of the indices:
def index_gen(listTrain, listTest):
yield listTrain, listTest
grid_search = GridSearchCV(estimator = model, param_grid =
param_grid,cv = index_gen(listTrain, listTest), n_jobs = -1,
verbose = 0,scoring=errorType)
EDIT: Before Edits:
As mentioned in the comment by desertnaut, what you are trying to do is bad ML practice, and you will end up with a biased estimate of the generalisation performance of the final model. Using the test set in the manner you're proposing will effectively leak test set information into the training stage, and give you an overestimate of the model's capability to classify unseen data. What I suggest in your case:
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = 5,
n_jobs = -1, verbose = 0,scoring=errorType)
grid_search.fit(x[listTrain], y[listTrain]
Now, your training set will be split into 5 (you can choose the number here) folds, trained using 4 of those folds on a specific set of hyperparameters, and tested the fold that was left out. This is repeated 5 times, till all of your training examples have been part of a left out set. This whole procedure is done for each hyperparameter setting you are testing (5x3 in this case)
grid_search.best_params_ will give you a dictionary of the parameters that performed the best over all 5 folds. These are the parameters that you use to train your final classifier, using again only the training set:
clf = LogisticRegression(**grid_search.best_params_).fit(x[listTrain],
y[listTrain])
Now, finally your classifier is tested on the test set and an unbiased estimate of the generalisation performance is given:
predictions = clf.predict(x[listTest])

How to use cross_val_predict to predict probabilities for a new dataset?

I am using sklearn's cross_val_predict for training like so:
myprobs_train = cross_val_predict(LogisticRegression(),X = x_old, y=y_old, method='predict_proba', cv=10)
I am happy with the returned probabilities, and would like now to score up a brand-new dataset. I tried:
myprobs_test = cross_val_predict(LogisticRegression(), X =x_new, y= None, method='predict_proba',cv=10)
but this did not work, it's complaining about y having zero shape. Does it mean there's no way to apply the trained and cross-validated model from cross_val_predict on new data? Or am I just using it wrong?
Thank you!
You are looking at a wrong method. Cross validation methods do not return a trained model; they return values that evaluate the performance of a model (logistic regression in your case). Your goal is to fit some data and then generate prediction for new data. The relevant methods are fit and predict of the LogisticRegression class. Here is the basic structure:
logreg = linear_model.LogisticRegression()
logreg.fit(x_old, y_old)
predictions = logreg.predict(x_new)
I have the same concern as #user3490622. If we can only use cross_val_predict on training and testing sets, why y (target) is None as the default value? (sklearn page)
To partially achieve the desired results of multiple predicted probability, one could use the fit then predict approach repeatedly to mimic the cross-validation.

How to tune weights in Voting Classifier (Sklearn)

I am trying to do the following:
vc = VotingClassifier(estimators=[('gbc',GradientBoostingClassifier()),
('rf',RandomForestClassifier()),('svc',SVC(probability=True))],
voting='soft',n_jobs=-1)
params = {'weights':[[1,2,3],[2,1,3],[3,2,1]]}
grid_Search = GridSearchCV(param_grid = params, estimator=vc)
grid_Search.fit(X_new,y)
print(grid_Search.best_Score_)
In this, I want to tune the parameter weights. If I use GridSearchCV, it is taking a lot of time. Since it needs to fit the model for each iteration. Which is not required, I guess. Better would be use something like prefit used in SelectModelFrom function from sklearn.model_selection.
Is there any other option or I am misinterpreting something?
The following code (in my repo) would do this.
It contains a class VotingClassifierCV. It first makes cross-validated predictions for all classifiers. Then loops over all weights, choosing the best combination, and using pre-calculated predictions.
A compute friendlier way would be to first parameter tune each classifier separately on your training data. Then weight each classifier proportional to your target metric (say accuracy_score) from your validate data.
# parameter tune
models = {
'rf': GridSearchCV(rf_params, RandomForestClassifier()).fit(X_trian, y_train),
'svc': GridSearchCV(svc_params, SVC()).fit(X_train, y_train),
}
# relative weights
model_scores = {
name: sklearn.metrics.accuracy_score(
y_validate,
model.predict(X_validate),
normalized=True
)
for name, model in models.items()
}
total_score = sum(model_scores.values())
# combine the parts
combined_model = VotingClassifier(
list(models.items()),
weights=[
model_scores[name] / total_score
for name in models.keys()
]
).fit(X_learn, y_learn)
Finally, you may fit the combined model with your learning (train + validate) data & evaluate with your test data.

How to apply random forest properly?

I am new to machine learning and python. Now I am trying to apply random forest to predict binary results of a target. In my data I have 24 predictors (1000 observations) where one of them is categorical(gender) and all the others numerical. Among numerical ones, there are two types of values which are volume of money in euros (very skewed and scaled) and numbers (number of transactions from an atm). I have transformed the big scale features and did the imputation. Last, I have checked correlation and collinearity and based on that removed some features (as a result I had 24 features.) Now when I implement RF it is always perfect in the training set while the ratios not so good according to crossvalidation. And even applying it in the test set it gives very very low recall values. How should I remedy this?
def classification_model(model, data, predictors, outcome):
# Fit the model:
model.fit(data[predictors], data[outcome])
# Make predictions on training set:
predictions = model.predict(data[predictors])
# Print accuracy
accuracy = metrics.accuracy_score(predictions, data[outcome])
print("Accuracy : %s" % "{0:.3%}".format(accuracy))
# Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train, :])
# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]
# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)
# Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test, :], data[outcome].iloc[test]))
print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
# Fit the model again so that it can be refered outside the function:
model.fit(data[predictors], data[outcome])
outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20)
predictor_var = train.drop('Sold', axis=1).columns.values
classification_model(model,train,predictor_var,outcome_var)
#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)
outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20, max_depth=20, oob_score = True)
predictor_var = ['fet1','fet2','fet3','fet4']
classification_model(model,train,predictor_var,outcome_var)
In Random Forest it is very easy to overfit. To resolve this you need to do parameter search a little more rigorously to know the best parameter to use. [Here](http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html
) is the link on how to do this: (from the scikit doc).
It is overfitting and you need to search for the best parameter that will work work on the model. The link provides implementation for Grid and Randomized search for hyper parameter estimation.
And it will also be fun to go through this MIT Artificial Intelligence lecture to get get deep theoretical orientation: https://www.youtube.com/watch?v=UHBmv7qCey4&t=318s.
Hope this helps!

SVM overfitting in scikit learn

I am building digit recognition classification using SVM. I have 10000 data and I split them to training and test data with a ratio 7:3. I use linear kernel.
The results turn out that training accuracy is always 1 when change training example numbers, however the test accuracy is just around 0.9 ( I am expecting a much better accuracy, at least 0.95). I think the results indicates overfitting. However, I worked on the parameters, like C, gamma, ... they don't change the results very much.
Can anyone help me out with how to deal with overfitting in SVM? Thanks very much in advance for your time and help.
The following is my code:
from sklearn import svm, cross_validation
svc = svm.SVC(kernel = 'linear',C = 10000, gamma = 0.0, verbose=True).fit(sample_X,sample_y_1Num)
clf = svc
predict_y_train = clf.predict(sample_X)
predict_y_test = clf.predict(test_X)
accuracy_train = clf.score(sample_X,sample_y_1Num)
accuracy_test = clf.score(test_X,test_y_1Num)
#conduct cross-validation
cv = cross_validation.ShuffleSplit(sample_y_1Num.size, n_iter=10,test_size=0.2, random_state=None)
scores = cross_validation.cross_val_score(clf,sample_X,sample_y_1Num,cv = cv)
score_mean = mean(scores)
One way to reduce the overfitting is by adding more training observations. Since your problem is digit recognition, it easy to synthetically generate more training data by slightly changing the observations in your original data set. You can generate 4 new observations from each of your existing observations by shifting the digit images one pixel left, right, up, and down. This will greatly increase the size of your training data set and should help the classifier learn to generalize, instead of learning the noise.

Resources