If RMSE value for Train & Test is 53 & 51 respectively in SVM Algorithm, What should be model called? - svm

My
RMSE
is coming very high, Should I need to build model using different algorithm or should be satisfy with the existing results..

Related

How to handle extreme outlier in the prediction of Random Forest Regression Model

I have done Prediction by using Random Forest Regression and I have got r2score of 90% which is good for this dataset I guess, after checking MSE I got more than 1500 as result of mse. Then I checked the difference between Actual and predicted value, there I found that my model was predicted only one value very poorly and so the mse error is too high. Now how to handle the only one value that is been predicted very poorly by my rfr model?
In my training set I have similar kind of data as well and so my model should have learnt from that, but somehow my prediction for that only one value is poor? What is the problem?

Evaluate Model Node in Azure ML Studio does not take all the rows of the dataset in confusion matrix

I have this dataset in which the positive class consists of component failures for a specific component of the APS system.
I am doing Predictive Maintenance using Microsoft Azure Machine Learning Studio.
As you can see from the pictures below, I am using 4 algorithm: Logistic Regression, Random Forest, Decision Tree and SVM. And you can see that the Output dataset in the score model node consists of 16k rows. However, when I see the output of the Evaluate Model, in the confusion matrix there are only 160 observations for the Logistic Regression, and the correct number, 16k for Random Forest. I have the same problem, only 160 observations in the models of Decision Tree and SVM. And the same problem is repeated in other experiments for example after feature selection, normalization etc.: some evaluate model does not use all the rows of the test dataset, and some other node does it.
How can I fix this problem? Because I am interested in the real number of false positive and false negatives.
The output metrics shown are based on the validation set (e.g. “validation metric”, “val-accuracy”).All the metrics computed and displayed are on validation set and not on the original training set. All those metrics are calculated only over the validation set without considering the training set, otherwise we would inflate the performances of the model by considering data already used to train the model.

Is there any way to evaluate RMSE and MAE on out of bag data for random forest regressor

enter image description hereI am working with random forest regression model for my thesis work. My data set is small (about 3000 samples and 20 features) and it's over fitting on train data. Since data set is small and i don't want to split data into train(train+oob), test sets.So I am using a bagging regression to avoid over fitting problem.
I'm trying to evaluate performance metrics for regression model. I can calculate RMSE and MAE values for train set but I don't know how to check these metrics for out of bag data. Need suggestions
Thanks in advance

Spark, MLlib: Adjusting classifier descrimination threshold

I try to use Spark MLlib Logistic Regression (LR) and/or Random Forests (RF) classifiers to create model to descriminate between two classes reprsented by sets which cardinality differes quite a lot.
One set has 150 000 000 negative and and another just 50 000 positive instances.
After training both LR and RF classifiers with default parameters I get very similar results for both classifiers with, for example, for the following test set:
Test instances: 26842
Test positives = 433.0
Test negatives = 26409.0
Classifier detects:
truePositives = 0.0
trueNegatives = 26409.0
falsePositives = 433.0
falseNegatives = 0.0
Precision = 0.9838685641904478
Recall = 0.9838685641904478
It looks like classifier can not detect any positive instance at all.
Also, no matter how data was split into train and test sets, classifier provides exactly the same number of false positives equal to a number of positives that test set really has.
LR classifier default threshold is set to 0.5 Setting threshold to 0.8 does not make any difference.
val model = new LogisticRegressionWithLBFGS().run(training)
model.setThreshold(0.8)
Questions:
1) Please advise how to manipulate classifier threshold to make classifier more sensetive to a class with a tiny fraction of positive instances vs a class with huge amount of negative instances?
2) Any other MLlib classifiers to solve this problem?
3) What itercept parameter does to the Logistic Regression algorithm?
val model = new LogisticRegressionWithSGD().setIntercept(true).run(training)
Well, I think what you have here is a very unbalance data set problem:
150 000 000 Class1
50 000 Class2. 3000 times smaller.
So if you train a classifier that assumes all are Class1 you are going to have:
0.999666 accuracy. So the best classifier will always be ALL are Class1. This is what your model is learning here.
There are different ways to assess these cases, in general you can do, downsampling the larger Class, or up-sampling the smaller class, or there are some other things you can do with randomforests for example when you sample do it in a balanced way (stratified), or add weights:
http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
Other methods also exist like SMOTE,etc (also doing samples) for more details you can read here:
https://www3.nd.edu/~dial/papers/SPRINGER05.pdf
The threshold you can change for your logistic regression is going to be the probability one, you can try playing with "probabilityCol" in the parameters of the logistic regression example here:
http://spark.apache.org/docs/latest/ml-guide.html
But a problem now with MLlib is that not all classifiers are returning a probability, I asked them about this and it is in their roadmap.

Test error lower than training error

Would appreciate your input on this. I am constructing a regression model with the help of genetic programming.
If my RMSE on test data is (much) lower than my RMSE on training data for a 1:5 ratio of data, should I be worried?
The test data is drawn randomly without replacement from a set of 24 data points. The model was built using genetic programming technique so the number of features, modeling framework etc vary as I minimize the training RMSE regularized by the number of nodes in the GP tree.
Is the model underfitted? Or should I have minimized MSE instead of RMSE (I thought it would be the same as MSE is positive and the minimum of MSE would coincide with the minimum of RMSE assuming the optimizer is good enough to find the minimum)?
Tks
So your model is trained on 20 out of 24 data points and tested on the 4 remaining data points?
To me it sounds like you need (much) more data, so you can have a larger train and test sets. I'm not surprised by the low performance on your test set as it seems that your model wasn't able to learn from such few data. As a rule of thumb, for machine learning you can never have enough data. Is it a possibility to gather a larger dataset?

Resources