CV results of Desicion Tree test mse 0.0000578 and train mse 0 - decision-tree

I'm working on a small data set (25 rows, 4 features). I trained the decision tree algorithm and used K-fold cross-validation (cv=3). Then, I obtained r2 0.97 for this reason, I suspected overfitting and looked at the test and train MSE values. I got the test MSE value 0.0000578 and the train mse value 0.0 How should I interpret this fitting do you have over here what do I do?
I'm new on this topic :) thank you in advance for your response.

Your dataset is really very small, but everything appears ok. Is the r2 you calculated for your test set? An r2 of 1 would correspond to your model's output being the exact same as the hidden values, and so would an MSE of 0, so your r2 and MSE scores are consistent with one another. The fact that performance is not much worse on the test set than the train set means that you shouldn't be as worried about overfitting, assuming you are splitting your data correctly.

Related

How to handle extreme outlier in the prediction of Random Forest Regression Model

I have done Prediction by using Random Forest Regression and I have got r2score of 90% which is good for this dataset I guess, after checking MSE I got more than 1500 as result of mse. Then I checked the difference between Actual and predicted value, there I found that my model was predicted only one value very poorly and so the mse error is too high. Now how to handle the only one value that is been predicted very poorly by my rfr model?
In my training set I have similar kind of data as well and so my model should have learnt from that, but somehow my prediction for that only one value is poor? What is the problem?

Loss function negative log likelihood giving loss despite perfect accuracy

I am debugging a sequence-to-sequence model and purposely tried to perfectly overfit a small dataset of ~200 samples (sentence pairs of length between 5-50). I am using negative log-likelihood loss in pytorch. I get low loss (~1e^-5), but the accuracy on the same dataset is only 33%.
I trained the model on 3 samples as well and obtained 100% accuracy, yet during training I had loss. I was under the impression that negative log-likelihood only gives loss (loss is in the same region of ~1e^-5) if there is a mismatch between predicted and target label?
Is a bug in my code likely?
There is no bug in your code.
The way things usually work in deep nets is that the networks predicts the logits (i.e., log-likelihoods). These logits are then transformed to probability using soft-max (or a sigmoid function). Cross-entropy is finally evaluated based on the predicted probabilities.
The advantage of this approach is that is numerically stable, and easy to train with. On the other side, because of the soft-max you can never have "perfect" 0/1 probabilities for your predictions: That is, even when your network has perfect accuracy it will never assign probability 1 to the correct prediction, but "close to one". As a result, the loss will always be positive (albeit small).

Training loss curve decrease sharply

I am new at deep learning.
I have a dataset with 1001 values of human pose upper body. The model I trained for that has 4 Conv layers and 2 fully connected layers with ReLu and Dropout. This is the result I got after 200 iterations. Does anyone have any ideas about why the curve of training loss decreases in a sharp way?
I think probably I need more data, as my dataset concludes numerical values what do you think is the best data augmentation method I have to use here?

For regression model, why the validation set passed to model.fit have different metric result than the model.evaluate?

I have a regression model with Euclidean distance as a loss function and RMSE as a metric evaluation (lower is better). When I passed my train, test sets to model.fit I have train_rmse, and test_rmse which their values made sense to me. But when I pass the test sets into model.evalute after loading the weight of the trained model I got different results which are approximately twice the result with model.fit. And I am aware of the difference that should happen between the train evaluation and test evaluation as I know from Keras that :
the training loss is the average of the losses over each batch of training data. Because your model is changing over time, the loss over the first batches of an epoch is generally higher than over the last batches. On the other hand, the testing loss for an epoch is computed using the model as it is at the end of the epoch, resulting in a lower loss.
But here I am talking about the result of test-set passed to model.fit in which I beleived is evaluated on the final model. In Keras documentation, they said on validation argument that I am passing the test set in it:
validation_data: Data on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data.
When I searched about the problem I found several issues
1- Some people like here report that this issue is with the model itself if they have batch normalization layer,or if you do transfer learning and freeze some BN layers like here. my model has BN layers, and I did not freeze any layer. Also, I used the same model for the mulit-class classification problem (not regression) and the result was the same for test set in the model.fit and model.evaluate.
2- Other people like said that this is related with either the prediction or metric calculation like here, in which they found that this difference is related with the different of dtype for y_true and y_pred if someone is float32 and other float64 for example, then the metric calculation will be different. When they unified the dtype the problem is fixed.
I believed that the last case applied to me since in the regression task my labels now is tf.float32. My y_true labels already cast to tf.float32 through tfrecord, so I tried to cast the y_pred to tf.float32 before the rmse calculation, and I still have the difference in the result.
So My questions are:
Why this difference in results
To whom I should rely for test set, on model.fit result or model.evalute
I know that for training loss and accuracy, keras does a running average over the batches, and I know for model.evalute, these metric are calculated by taking all the dataset one time on the final model. But how the validation loss and accuracy calculated for validation set passed to model.fit?
UPDATE:
The problem was in the shape conflict between the y_true and y_pred. As for y_true label I save it in tfrecords as float single value and eventually will be with the size of [batch_size] while the regression model gives the prediction as [batch_size, 1] and then the result of tf.subtract(y_true, y_pred) in rmse equation will result in matrix of [batch_size, batch_sizze] and with taking the mean of this final one you will never guess it is wrong and the code will not throw any error but the calculation of rmse will be wrong. I am still working to make the shape consistent but still didn't find a good solution.

Test error lower than training error

Would appreciate your input on this. I am constructing a regression model with the help of genetic programming.
If my RMSE on test data is (much) lower than my RMSE on training data for a 1:5 ratio of data, should I be worried?
The test data is drawn randomly without replacement from a set of 24 data points. The model was built using genetic programming technique so the number of features, modeling framework etc vary as I minimize the training RMSE regularized by the number of nodes in the GP tree.
Is the model underfitted? Or should I have minimized MSE instead of RMSE (I thought it would be the same as MSE is positive and the minimum of MSE would coincide with the minimum of RMSE assuming the optimizer is good enough to find the minimum)?
Tks
So your model is trained on 20 out of 24 data points and tested on the 4 remaining data points?
To me it sounds like you need (much) more data, so you can have a larger train and test sets. I'm not surprised by the low performance on your test set as it seems that your model wasn't able to learn from such few data. As a rule of thumb, for machine learning you can never have enough data. Is it a possibility to gather a larger dataset?

Resources