How to Remove Automatically Created Legend on an SVM plot - svm

I have created an SVM Model and have also successfully mapped out the contour lines in R2. See example below where I defined two variables "x" and "y" which will be used to determine the classification of the "indicator" using SVM. The training set is 50% of the original data and the test set is the other 50% of the original data for simplicity.
My issue is with the SVM plot. I just want to be able to get rid of (or possibly move) the resulting legend which appears to the right of my plot. It messes with the scaling when I try to plot other points on this plot later on.
#simulate the data, x and y are used to predict indicator
x = c(rep(1,10), rep(10,10), rep(20,10))
y = c(rep(2,10), rep(12,10), rep(24,10))
indicator = c(rep("low", 10), rep("med",10), rep("high",10))
df = data.frame(x,y,as.factor(indicator))
# create train and test sets, 50% of data each
set.seed(123)
end = length(df$x)
i = 1:end
train = sample(i, (.5 * end), replace = FALSE)
train = sort(train)
test = setdiff(i,train)
TRAINSET = df[train,]
TESTSET = df[test,]
# use SVM to create a predictive model
svm.model = svm(TRAINSET[,3] ~ ., data = TRAINSET[,c(1,2)], cost = 1000, gamma = .5)
# plot SVM
plot(svm.model, TRAINSET)
I can use the legend() function to add legends but I just cannot figure out how to remove one that appears automatically when plotting an SVM model.
Thanks -- I appreciate any help with this!

Related

How can I use VIPS for image normalization?

I want to normalize the exposure and color palettes of a set of images. For context, this is for training a neural net in image classification on medical images. I'm also doing this for hundreds of thousands of images, so efficiency is very important.
So far I've been using VIPS, specifically PyVIPS, and would prefer a solution using that library. After finding this answer and looking through the documentation, I tried
x = pyvips.Image.new_from_file('test.ndpi')
x = x.hist_norm()
x.write_to_file('test_normalized.tiff')
but that seems to always produce a pure-white image.
You need hist_equal for histogram equalisation.
The main docs are here:
https://libvips.github.io/libvips/API/current/libvips-histogram.html
However, that will be extremely slow for large slide images. It will need to scan the whole slide once to build the histogram, then scan again to equalise it. It would be much faster to find the histogram of a low-res layer, then use that to equalise the high-res one.
For example:
#!/usr/bin/env python3
import sys
import pyvips
# open the slide image and get the number of layers ... we are not fetching
# pixels, so this is quick
x = pyvips.Image.new_from_file(sys.argv[1])
levels = int(x.get("openslide.level-count"))
# find the histogram of the highest level ... again, this should be quick
x = pyvips.Image.new_from_file(sys.argv[1],
level=levels - 1)
hist = x.hist_find()
# from that, compute the transform for histogram equalisation
equalise = hist.hist_cum().hist_norm()
# and use that on the full-res image
x = pyvips.Image.new_from_file(sys.argv[1])
x = x.maplut(equalise)
x.write_to_file(sys.argv[2])
Another factor is that histogram equalisation is non-linear, so it will distort lightness relationships. It can also distort colour relationships and make noise and compression artifacts look crazy. I tried that program on an image I have here:
$ ~/try/equal.py bild.ndpi[level=7] y.jpg
The stripes are from the slide scanner and the ugly fringes from compression.
I think I would instead find image max and min from the low-res level, then use them to do a simple linear stretch of pixel values.
Something like:
x = pyvips.Image.new_from_file(sys.argv[1])
levels = int(x.get("openslide.level-count"))
x = pyvips.Image.new_from_file(sys.argv[1],
level=levels - 1)
mn = x.min()
mx = x.max()
x = pyvips.Image.new_from_file(sys.argv[1])
x = (x - mn) * (256 / (mx - mn))
x.write_to_file(sys.argv[2])
Did you find the new Region feature in pyvips? It makes generating patches for training MUCH faster, up to 100x faster in some cases:
https://github.com/libvips/pyvips/issues/100#issuecomment-493960943

cross Validation in Sklearn using a Custom CV

I am dealing with a binary classification problem.
I have 2 lists of indexes listTrain and listTest, which are partitions of the training set (the actual test set will be used only later). I would like to use the samples associated with listTrain to estimate the parameters and the samples associated with listTest to evaluate the error in a cross validation process (hold out set approach).
However, I am not be able to find the correct way to pass this to the sklearn GridSearchCV.
The documentation says that I should create "An iterable yielding (train, test) splits as arrays of indices". However, I do not know how to create this.
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = custom_cv, n_jobs = -1, verbose = 0,scoring=errorType)
So, my question is how to create custom_cv based on these indexes to be used in this method?
X and y are respectivelly the features matrix and y is the vector of labels.
Example: Supose that I only have one hyperparameter alpha that belongs to the set{1,2,3}. I would like to set alpha=1, estimate the parameters of the model (for instance the coefficients os a regression) using the samples associated with listTrain and evaluate the error using the samples associated with listTest. Then I repeat the process for alpha=2 and finally for alpha=3. Then I choose the alpha that minimizes the error.
EDIT: Actual answer to question. Try passing cv command a generator of the indices:
def index_gen(listTrain, listTest):
yield listTrain, listTest
grid_search = GridSearchCV(estimator = model, param_grid =
param_grid,cv = index_gen(listTrain, listTest), n_jobs = -1,
verbose = 0,scoring=errorType)
EDIT: Before Edits:
As mentioned in the comment by desertnaut, what you are trying to do is bad ML practice, and you will end up with a biased estimate of the generalisation performance of the final model. Using the test set in the manner you're proposing will effectively leak test set information into the training stage, and give you an overestimate of the model's capability to classify unseen data. What I suggest in your case:
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = 5,
n_jobs = -1, verbose = 0,scoring=errorType)
grid_search.fit(x[listTrain], y[listTrain]
Now, your training set will be split into 5 (you can choose the number here) folds, trained using 4 of those folds on a specific set of hyperparameters, and tested the fold that was left out. This is repeated 5 times, till all of your training examples have been part of a left out set. This whole procedure is done for each hyperparameter setting you are testing (5x3 in this case)
grid_search.best_params_ will give you a dictionary of the parameters that performed the best over all 5 folds. These are the parameters that you use to train your final classifier, using again only the training set:
clf = LogisticRegression(**grid_search.best_params_).fit(x[listTrain],
y[listTrain])
Now, finally your classifier is tested on the test set and an unbiased estimate of the generalisation performance is given:
predictions = clf.predict(x[listTest])

How to use Principal Component Analysis while predicting?

Suppose my original dataset has 8 features and I apply PCA with n_components = 3 (I am using sklearn.decomposition.PCA). Then I train my model using those 3 PCA components (which are now my new features).
Do I need to apply PCA while predicting as well ?
Do I need to do that even if I am predicting only one data point?
What confuses me is that when I do prediction, every data point is a row in a 2D matrix (consists of all data points that I want to predict). So if I apply PCA on just one data point, then the corresponding row vector will be converted to a zero vector.
If you fitted your model on the first three components of the PCA, you have to transform appropriately any new data. For example, consider this code taken from here:
pca = PCA(n_components=n_components, svd_solver='randomized',
whiten=True).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
In the code, they first fit PCA on the trainig. Then they transform both training and testing, and then they apply the model (in their case, SVM) on the transformed data.
Even if your X_test consists of only 1 data point, you could still use PCA. Just transform your data into a 2D matrix. For example, if your data point is [1,2,0,5] then X_test=[[1,2,0,5]]. That is, it is a 2D matrix with 1 row.

How to apply random forest properly?

I am new to machine learning and python. Now I am trying to apply random forest to predict binary results of a target. In my data I have 24 predictors (1000 observations) where one of them is categorical(gender) and all the others numerical. Among numerical ones, there are two types of values which are volume of money in euros (very skewed and scaled) and numbers (number of transactions from an atm). I have transformed the big scale features and did the imputation. Last, I have checked correlation and collinearity and based on that removed some features (as a result I had 24 features.) Now when I implement RF it is always perfect in the training set while the ratios not so good according to crossvalidation. And even applying it in the test set it gives very very low recall values. How should I remedy this?
def classification_model(model, data, predictors, outcome):
# Fit the model:
model.fit(data[predictors], data[outcome])
# Make predictions on training set:
predictions = model.predict(data[predictors])
# Print accuracy
accuracy = metrics.accuracy_score(predictions, data[outcome])
print("Accuracy : %s" % "{0:.3%}".format(accuracy))
# Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train, :])
# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]
# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)
# Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test, :], data[outcome].iloc[test]))
print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
# Fit the model again so that it can be refered outside the function:
model.fit(data[predictors], data[outcome])
outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20)
predictor_var = train.drop('Sold', axis=1).columns.values
classification_model(model,train,predictor_var,outcome_var)
#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)
outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20, max_depth=20, oob_score = True)
predictor_var = ['fet1','fet2','fet3','fet4']
classification_model(model,train,predictor_var,outcome_var)
In Random Forest it is very easy to overfit. To resolve this you need to do parameter search a little more rigorously to know the best parameter to use. [Here](http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html
) is the link on how to do this: (from the scikit doc).
It is overfitting and you need to search for the best parameter that will work work on the model. The link provides implementation for Grid and Randomized search for hyper parameter estimation.
And it will also be fun to go through this MIT Artificial Intelligence lecture to get get deep theoretical orientation: https://www.youtube.com/watch?v=UHBmv7qCey4&t=318s.
Hope this helps!

How to SVM Train my Edge images using Java code

I have set of images on which I performed edge detection using OpenCV 3.1. The edges are stored in MAT of OpenCV. Can someone help me in processing for Java SVM train and test code on those set of images ?
Following discussion in comments I am providing you with an example project which I built for android studio a while back.
This was used to classify images depending on Lab color spaces.
//1.a Assign the parameters for SVM training here
double nu = 0.999D;
double gamma = 0.4D;
double epsilon = 0.01D;
double coef0 = 0;
//kernel types are Linear(0), Poly(1), RBF(2), Sigmoid(3)
//For Poly(1) set degree and gamma
double degree = 2;
int kernel_type = 4;
//1.b Create an SVM object
SVM B_channel_svm = SVM.create();
B_channel_svm.setType(104);
B_channel_svm.setNu(nu);
B_channel_svm.setCoef0(coef0);
B_channel_svm.setKernel(kernel_type);
B_channel_svm.setDegree(degree);
B_channel_svm.setGamma(gamma);
B_channel_svm.setTermCriteria(new TermCriteria(2, 10, epsilon));
// Repeat Step 1.b for the number of SVMs.
//2. Train the SVM
// Note: training_data - If your image has n rows and m columns, you have to make a matrix of size (n*m, o), where o is the number of labels.
// Note: Label_data is same as above, n rows and m columns, make a matrix of size (n*m, o) where o is the number of labels.
// Note: Very Important - Train the SVM for the entire data as training input and the specific column of the Label_data as the Label. Here, I train the data using B, G and R channels and hence, the name B_channel_SVM. I make 3 different SVM objects separately but you can do this by creating only one object also.
B_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(0));
G_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(1));
R_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(2));
// Now after training we "predict" the outcome for a sample from the trained SVM. But first, lets prepare the Test data.
// As above for the training data, make a matrix of (n*m, o) and use the columns to predict. So, since I created 3 different SVMs, I will input three separate matrices for the three SVMs of size (n*m, 1).
//3. Predict the testing data outcome using the trained SVM.
B_channel_svm.predict(scene_ml_input, predicted_final_B, StatModel.RAW_OUTPUT);
G_channel_svm.predict(scene_ml_input, predicted_final_G, StatModel.RAW_OUTPUT);
R_channel_svm.predict(scene_ml_input, predicted_final_R, StatModel.RAW_OUTPUT);
//4. Here, predicted_final_ are matrices which gives you the final value as in Label(0,1,2... etc) for the input data (edge profile in your case)
Now, I hope you have an idea for how SVM works. You basically need to do these steps:
Step 1: Identify labels - In your case Gestures from edge profile.
Step 2: Assign values to the labels - For example, if you are trying to classify haptic gestures - Open Hand = 1, Closed Hand/Fist = 2, Thumbs up = 3 and so on.
Step 3: Prepare the training data (edge profiles) and Labels (1,2,3) etc. according to the process above.
Step 4: Prepare data for prediction using the transformation calculated using SVM.
Very Important for SVM on OpenCV - Normalize your data, make sure you all matrices are of Same Type - CvType
Hope it helps. Feel free to ask questions if you have any doubts and post what you have tried. I can solve the problem for you if you send me some images but then you won't learn anything right? ;)

Resources