Confusion Matrix - Not changing with predictive models (Sklearn) - python-3.x

I have 3 predictive models and I am evaluating there performance with a confusion matrix.
I am getting the same results for the confusion matrix for each of the 3 models.
I expect that the different models would perform differently and produce different confusion matrices. I am new to predictive modelling, so I suspect I am making a "Rooky mistake" . The full script I am using is sitting in a Jupyter notebook on GiThub here
A screenshot of the code for the 3 models is below
Can some one point out what is going wrong?
Cheers
Mike

As mentioned: make predictions on the test data. But keep in mind that your targets are skewed! So use StratifiedKFolds or something like this.
Also I guess that your data is a bit corrupted. While all models show the same result it may be a big mistake underneath.
Few questions/advises:
1. Did you scale your data?
2. Did you use one-hot-encoding?
2. Use don't Decision Trees but Forests/XGBoost. Easy to overfit with DT.
3. Don't use >2 hidden layers in NN because it's easy to overfit too. Use 2 firstly. And your architecture (30, 30, 30) with 2 target classes seems weird.
4. And if you wish to use >2 hidden layers - go to Keras or TF. You'll find there many features that can help you to not overfit.

That is simply because you are using the same Training data to make predictions. Since your models are already trained on the same data that you are making the predictions on, they will return the same results (and ultimately the same confusion matrix). You need to split your dataset into training and test sets. Then train your classifier on training set and make predictions on test set.
You can use train_test_split in Sklearn to split your dataset into training or test set.

Related

Why my LSTM for Multi-Label Text Classification underperforms?

I'm using Windows 10 machine.
Libraries: Keras with Tensorflow 2.0
Embeddings:Glove(100 dimensions)
I am trying to implement an LSTM architecture for multi-label text classification.
My problem is that no matter how much fine-tuning I do, the results are really bad.
I am not experienced in DL practical implementations that's why I ask for your advice.
Below I will state basic information about my dataset and my model so far.
I can't embed images since I am a new member so they appear as links.
Dataset form+Embedings form+train-test-split form
Dataset's labels distribution
My Implementation of LSTM
Model's Summary
Model's Accuracy plot
Model's Loss plot
As you can see my dataset is really small (~6.000 examples) and maybe that's one reason why I cannot achieve better results. Still, I chose it because it's unbiased.
I'd like to know if there is any fundamental mistake in my code regarding the dimensions, shape, activation functions, and loss functions for multi-label text classification?
What would you recommend to achieve better results on my model? Also any general advice regarding optimizing, methods,# of nodes, layers, dropouts, etc is very welcome.
Model's best val accuracy that I achieved so far is ~0.54 and even if I tried to raise it, it seems stuck there.
There are many ways to get this wrong but the most common mistake is to get your model overfit the training data.
I suspect that 0.54 accuracy means that your model selects the most common label (offensive) for almost all cases.
So, consider one of these simple solutions:
Create balanced training data: like 400 samples from each class.
or sample balanced batches for training (exactly the same number of labels on each training batch)
In addition to tracking accuracy and loss, look at precision-recall-f1 or even better try plotting area under curve, maybe different classes need different thresholds of activation. (If you are using Sigmoid on last layer maybe one class could perform better with 0.2 activations and another class with 0.7)
first try simple model. embedding 1 layer LSTM than classify
how to tokenize text , is vocab size enough ?
try dice loss

Large dataset - ANN

I am trying to classify around 400K data with 13 attributes. I have used python sklearn's SVM package, but it didn't work, and then I learned that SVM's are not suitable for large dataset classification. Then I used the (sklearn) ANN using the following MLPClassifier:
MLPClassifier(solver='adam', alpha=1e-5, random_state=1,activation='relu', max_iter=500)
and trained the system using 200K samples, and tested the model on the remaining ones. The classification worked well. However, my concern is that the system is over trained or overfit. Can you please guide me on the number of hidden layers and node sizes to make sure that there is no overfit? (I have learned that the default implementation has 100 hidden neurons. Is it ok to use the default implementation as is?)
To know if your are overfitting you have to compute:
Training set accuracy
Test set accuracy
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
If you are overfitting with Neuronal Networks you probably have to reduce the number of layers and reduce the number of neurons per layer. There isn't any strict rule that says the number of layer or neurons you need depending on you dataset size. Every dataset can behaves completely different with the same dataset size.
So, to conclude, if you are overfitting, you would have to evaluate your model accuracy using different parameters of layers and number of neurons, and, then, observe with which values you obtain the best results. There are some methods you can use to find the best parameters, is like gridsearchCV.

Model underfitting

I have trained a model and it took me quite a while to find the correct hyperparameters.
The model has now been trained for 15h and it seems to to its job quite well.
When I observed the training and validation loss though, the training loss is somewhat higher than the validation loss. (red curve: training, green: validation)
I use dropout to regularize my model and as far as I have understood, droput is is only applied during training which might be the reason.
Now Iam wondering if I have trained a valid model?
It doesn't seem like the model is heavily underfitted?
Thanks in advance for any advice,
cheers,
M
First, check whether you have good data set, i.e., if it is a classification, then get equal number of images for all classes and get it from same source not from different sources. And regularization, dropout are used for overfitting/High variance so don't worry about these.
Then, I think your model is doing good when you trained your model the initial error between them are different but as you increased the epochs then they both got into some steady path. So it is good. And may be reason for this is as I mentioned above or you should try shuffle them then using train_test_split for getting better distribution of training and validation sets.
A plot of learning curves shows a good fit if:
The plot of training loss decreases to a point of stability.
The plot of validation loss decreases to a point of stability and has a small gap with the training loss.
In your case these conditions are satisfied.
Still if you want to deal with High Bias/underfitting then here are few methods:
Train bigger models
Train longer. Use better optimization techniques
Try different Neural Network Architecture and also hyper parameters
And also you can use cross-validation or GridSearchCV for finding better optimizer or hyper parameters but it may take really long because you have to train it on different parameters each time considering your time which is 15 hours then it might be very long but you will find better parameters and then train on it.
Above all I think your model is doing okay.
If your model underfits, its performance will be lower, similar as in the case of overfitting, because actually it can not learn effectively to get the optimal result, i.e the proper function to fit the given distribution. So you have to use less regularization technique e.g. less dropout to get the optimal result.
Furthermore the sampling can also be crucial, because there can be training-validation subsets where your model performs well on validation set and less effective on training set and vice-versa. This is one of the reason why we use crossvalidation and different sampling methods e.g. stratified k-fold.

Best Way to Overcome Early Convergence for Machine Learning Model

I have a machine learning model built that tries to predict weather data, and in this case I am doing a prediction on whether or not it will rain tomorrow (a binary prediction of Yes/No).
In the dataset there is about 50 input variables, and I have 65,000 entries in the dataset.
I am currently running a RNN with a single hidden layer, with 35 nodes in the hidden layer. I am using PyTorch's NLLLoss as my loss function, and Adaboost for the optimization function. I've tried many different learning rates, and 0.01 seems to be working fairly well.
After running for 150 epochs, I notice that I start to converge around .80 accuracy for my test data. However, I would wish for this to be even higher. However, it seems like the model is stuck oscillating around some sort of saddle or local minimum. (A graph of this is below)
What are the most effective ways to get out of this "valley" that the model seems to be stuck in?
Not sure why exactly you are using only one hidden layer and what is the shape of your history data but here are the things you can try:
Try more than one hidden layer
Experiment with LSTM and GRU layer and combination of these layers together with RNN.
Shape of your data i.e. the history you look at to predict the weather.
Make sure your features are scaled properly since you have about 50 input variables.
Your question is little ambiguous as you mentioned RNN with a single hidden layer. Also without knowing the entire neural network architecture, it is tough to say how can you bring in improvements. So, I would like to add a few points.
You mentioned that you are using "Adaboost" as the optimization function but PyTorch doesn't have any such optimizer. Did you try using SGD or Adam optimizers which are very useful?
Do you have any regularization term in the loss function? Are you familiar with dropout? Did you check the training performance? Does your model overfit?
Do you have a baseline model/algorithm so that you can compare whether 80% accuracy is good or not?
150 epochs just for a binary classification task looks too much. Why don't you start from an off-the-shelf classifier model? You can find several examples of regression, classification in this tutorial.

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Resources