How to evaluate and explain the trained model in this machine learning? - statistics

I am new in machine learning. I did a test but do not know how to explain and evaluate.
Case 1:
I first divide randomly the data (data A, about 8000 words) into 10 groups (a1..a10). Within each group, I use 90% of data to build ngram model. This ngram model is then tested on the other 10% data of the same group. The result is below 10% accuracy. Other 9 groups are done same way (respectively build model and respectively tested on the remained 10% data of that group). All results are about 10% accuracy. (Is this 10 fold cross-validation?)
Case 2:
I first build a ngram model based on entire data set (data A) of about 8000 words. Then I divide this A into 10 groups(a1,a2,a3..a10), randomly of course. I then use this ngram to test respectively a1,a2..a10. I found the model is almost 96% accuracy on all groups.
How to explain such situations.
Thanks in advance.

Yes, 10-fold cross validation.
This testing method has the common flaw of testing on the training set. That is why the accuracy is inflated. It is unrealistic because, in real life, your test instances are novel and previously unseen by the system.
N-fold cross validation is a valid evaluation method used in many works.

You need to read up on the topic of overfitting.
The situation you describes gives the impression that your ngram model is heavily overfitted: it can "memorize" 96% of the training data. But when trained on a proper subset, it only achieves a prediction on the unknown data of 10%.

This is called 10 fold cross-validation

Related

Large dataset - ANN

I am trying to classify around 400K data with 13 attributes. I have used python sklearn's SVM package, but it didn't work, and then I learned that SVM's are not suitable for large dataset classification. Then I used the (sklearn) ANN using the following MLPClassifier:
MLPClassifier(solver='adam', alpha=1e-5, random_state=1,activation='relu', max_iter=500)
and trained the system using 200K samples, and tested the model on the remaining ones. The classification worked well. However, my concern is that the system is over trained or overfit. Can you please guide me on the number of hidden layers and node sizes to make sure that there is no overfit? (I have learned that the default implementation has 100 hidden neurons. Is it ok to use the default implementation as is?)
To know if your are overfitting you have to compute:
Training set accuracy
Test set accuracy
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
If you are overfitting with Neuronal Networks you probably have to reduce the number of layers and reduce the number of neurons per layer. There isn't any strict rule that says the number of layer or neurons you need depending on you dataset size. Every dataset can behaves completely different with the same dataset size.
So, to conclude, if you are overfitting, you would have to evaluate your model accuracy using different parameters of layers and number of neurons, and, then, observe with which values you obtain the best results. There are some methods you can use to find the best parameters, is like gridsearchCV.

Neural network regression evaluation based on target range

I am currently fitting a neural network to predict a continuous target from 1 to 10. However, the samples are not evenly distributed over the entire data set: samples with target ranging from 1-3 are quite underrepresented (only account for around 5% of the data). However, they are of big interest, since the low range of the target is kind of the critical range.
Is there any way to know how my model predicts these low range samples in particular? I know that when doing multiclass classification I can examine the recall to get a taste of how well the model performs on a certain class. For classification use cases I can also set the class weight parameter in Keras to account for class imbalances, but this is obviously not possible for regression.
Until now, I use typical metrics like MAE, MSE, RMSE and get satisfying results. I would however like to know how the model performs on the "critical" samples.
From my point of view, I would compare the test measurements (classification performance, MSE, RMSE) for the whole test step that corresponds to the whole range of values (1-10). Then, of course, I would do it separately to the specific range that you are considering critical (let's say between 1-3) and compare the divergence of the two populations. You can even perform some statistics about the significance of the difference between the two populations (Wilcoxon tests etc.).
Maybe this link could be useful for your comparisons. Since you can regression you can even compare for MSE and RMSE.
What you need to do is find identifiers for these critical samples. Often times row indices are used for this. Once you have predicted all of your samples, use those stored indices to find the critical samples in your predictions and run whatever automatic metric over those filtered samples. I hope this answers your question.

Best Way to Overcome Early Convergence for Machine Learning Model

I have a machine learning model built that tries to predict weather data, and in this case I am doing a prediction on whether or not it will rain tomorrow (a binary prediction of Yes/No).
In the dataset there is about 50 input variables, and I have 65,000 entries in the dataset.
I am currently running a RNN with a single hidden layer, with 35 nodes in the hidden layer. I am using PyTorch's NLLLoss as my loss function, and Adaboost for the optimization function. I've tried many different learning rates, and 0.01 seems to be working fairly well.
After running for 150 epochs, I notice that I start to converge around .80 accuracy for my test data. However, I would wish for this to be even higher. However, it seems like the model is stuck oscillating around some sort of saddle or local minimum. (A graph of this is below)
What are the most effective ways to get out of this "valley" that the model seems to be stuck in?
Not sure why exactly you are using only one hidden layer and what is the shape of your history data but here are the things you can try:
Try more than one hidden layer
Experiment with LSTM and GRU layer and combination of these layers together with RNN.
Shape of your data i.e. the history you look at to predict the weather.
Make sure your features are scaled properly since you have about 50 input variables.
Your question is little ambiguous as you mentioned RNN with a single hidden layer. Also without knowing the entire neural network architecture, it is tough to say how can you bring in improvements. So, I would like to add a few points.
You mentioned that you are using "Adaboost" as the optimization function but PyTorch doesn't have any such optimizer. Did you try using SGD or Adam optimizers which are very useful?
Do you have any regularization term in the loss function? Are you familiar with dropout? Did you check the training performance? Does your model overfit?
Do you have a baseline model/algorithm so that you can compare whether 80% accuracy is good or not?
150 epochs just for a binary classification task looks too much. Why don't you start from an off-the-shelf classifier model? You can find several examples of regression, classification in this tutorial.

Should sentiment analysis training data be evenly distributed?

If I am training a sentiment classifier off of a tagged dataset where most documents are negative, say ~95%, should the classifier be trained with the same distribution of negative comments? If not, what would be other options to "normalize" the data set?
You don't say what type of classifier you have but in general you don't have to normalize the distribution of the training set. However, usually the more data the better but you should always do blind tests to prevent over-fitting.
In your case you will have a strong classifier for negative comments and unless you have a very large sample size, a weaker positive classifier. If your sample size is large enough it won't really matter since you hit a point where you might start over-fitting your negative data anyway.
In short, it's impossible to say for sure without knowing the actual algorithm and the size of the data sets and the diversity within the dataset.
Your best bet is to carve off something like 10% of your training data (randomly) and just see how the classifier performs after being trained on the 90% subset.

What is the purpose of cross-validation? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am working myself through a book on machine learning right now.
Working on a NaiveBayesClassifier the author is very much in favour of the cross-validation method.
He proposes to split the data into ten buckets (files) and train on nine of them each time withholding a different bucket.
Up to now the only approach I am familiar with is to split the data into a training set and a test set in the ratio of 50%/50% and simply train the classifier all at once.
Could someone explain what are possible advantages of using cross-validation?
Cross-validation is a way to address the tradeoff between bias and variance.
When you obtain a model on a training set, your goal is to minimize variance. You can do this by adding more terms, higher order polynomials, etc.
But your true objective is to predict outcomes for points that your model has never seen. That's what holding out the test set simulates.
You'll create your model on a training set, then try it out on a test set. You will find that there's a minimum combination of variance and bias that will give your best results. The simplest model that minimizes both should be your choice.
I'd recommend "An Intro to Statistical Learning" or "Elements of Statistical Learning" by Hastie and Tibshirani for more details.
The general objective of machine learning is that the more the data you have for training, the better results you would get. This was important to state before i start answering the question.
Cross-validation helps us to avoid overfitting of the model and it also helps to increase the generalization accuracy which is the accuracy of the model on unseen future point. Now when you divide your dataset into dtrain and dtest there is one problem with that which is if your function that would be determined once you have trained your model needs both training and testing data, then you cannot say your accuracy on future unseen point would be same as the accuracy you got on your test data. This above argument can be stated by taking the example of k-nn where nearest neighbour is determined with the help of training data while the value of k is determined by test data.
But if you use CV then k could be determined by CV data and your test data can be considered as the unseen data point.
Now suppose you divide your dataset into 3 parts Dtrain(60%), Dcv(20%) and Dtest(20%). Now you have only 60% of data to train with. Now suppose you want to use all the 80% of your data to train then you can do this with the help of m-fold cross validation. In m-fold CV you divide your data into two parts Dtrain and Dtest (lets say 80 and 20).
lets say the value of m is 4 so you divide the training data into 4 equal parts randomly (d1,d2,d3,d4). Now start training the model by taking d1,d2,d3 as dtrain and d4 as cv and calculate the accuracy, in the next go take d2,d3,d4 as dtrain and d1 as cv and likewise take all possiblity for m=1, then for m=2 continue the same procedure. With this you use your entire 80% of your dtrain and your dtest can be considered as future unseen dataset.
The advantages are better and more use of your dtrain data, reduce the overfit and helps to give you sured generalization accuracy.
but on the downside the time complexity is high.
In your case the value of m is 10.
Hope this helps.
The idea is to have maximum no. of points to train the model to achieve accurate results.For every data point chosen in the train set, it is excluded from the test set. So we use the concept of k and k-1 where we firstly divide the data set into equal k sized bins and we take one bin make it a test set and the remaining k-1 bins represent train set. We repeat the process till all the bins are selected once as test set(k) and the remaining as training(k-1).Doing this no data point in missed out for training purpose

Resources