Scaling new data LSTM [duplicate] - keras

While applying min max scaling to normalize your features, do you apply min max scaling on the entire dataset before splitting it into training, validation and test data?
Or do you split first and then apply min max on each set, using the min and max values from that specific set?
Lastly , when making a prediction on a new input, should the features of that input be normalized using the min, max values from the training data before being fed into the network?

Split it, then scale. Imagine it this way: you have no idea what real-world data looks like, so you couldn't scale the training data to it. Your test data is the surrogate for real-world data, so you should treat it the same way.
To reiterate: Split, scale your training data, then use the scaling from your training data on the testing data.

Related

Regression analysis with All text data

I want to know what is the best approach to handle a regression analysis on all text data type. I have the following data set.
my feature columns are: Strength, area of development, leadership, satisfactory
values of these columns are predefined set of texts eg. "Continuous Improvement,Self-Development,Coaching and Mentoring,Creativity,Adaptability"
based on the value in these columns I want to predict the label (overall Performance) - Outstanding or Exceeding Expectation or Meeting Expectation.
what should be the best approach to deal with this dataset ?

Standardize or subtract constant to data for regression

I am attempting to create a prediction model using multiple linear regression.
One of the predictor variables I want to use is a percentage, so it ranges from 0 - 100. I hypothesize that when it’s <50% there will be a negative effect on the target variable and when >50% a positive effect.
The mean of the predictor variable isn’t exactly 50 in my data set so I am unsure if I centre or Standardize this variable, or just subtract 50 from it to create the split I am looking for.
I am very new to statistics and self teaching myself at the moment, any help is greatly appreciated.

Arbitary choosen values as std/mean for normalizatio. why?

I have a question regarding the z-score normalization method.
This method uses the z-score to normalize the values of the dataset and needs a mean/std.
I know that you are normally supposed to use the mean/std of the dataset.
But I have seen multiple tutorials on pytorch.org and the net who just use the 0.5 for mean/std which seems completely arbitrary to me.
And I was wondering why they didn't use the mean/std of the dataset?
Example Tutorials where they just use 0.5 as mean/std:
https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html
https://medium.com/ai-society/gans-from-scratch-1-a-deep-introduction-with-code-in-pytorch-and-tensorflow-cb03cdcdba0f
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py
If you use the std/mean of your dataset to normalize the same dataset you are going to have after the normalization a mean of 0 and an std of 1.
Where the min/max values of the normalized dataset are not in a certain range.
If you use mean/std of 0.5 as a parameter for normalization of your dataset you are going to have a dataset in the range of -1 to 1.
And the mean of the normalized dataset will be close to zero and the std of the normalized dataset will be close to 0.5.
So to answer my question you use 0.5 as mean/std when you want that your dataset is in a range of -1 to 1.
Which would be beneficial when using, for example, a tanh activation function in a neural network.

Convert GMM-UBM scores to equicalent accuracy percent

I have constructed a GMM-UBM model for the speaker recognition purpose. The output of models adapted for each speaker some scores calculated by log likelihood ratio. Now I want to convert these likelihood scores to equivalent number between 0 and 100. Can anybody guide me please?
There is no straightforward formula. You can do simple things like
prob = exp(logratio_score)
but those might not reflect the true distribution of your data. The computed probability percentage of your samples will not be uniformly distributed.
Ideally you need to take a large dataset and collect statistics on what acceptance/rejection rate do you have for what score. Then once you build a histogram you can normalize the score difference by that spectrogram to make sure that 30% of your subjects are accepted if you see the certain score difference. That normalization will allow you to create uniformly distributed probability percentages. See for example How to calculate the confidence intervals for likelihood ratios from a 2x2 table in the presence of cells with zeroes
This problem is rarely solved in speaker identification systems because confidence intervals is not what you want actually want to display. You need a simple accept/reject decision and for that you need to know the amount of false rejects and accept rate. So it is enough to find just a threshold, not build the whole distribution.

Why does k=1 in KNN give the best accuracy?

I am using Weka IBk for text classificaiton. Each document basically is a short sentence. The training dataset contains 15,000 documents. While testing, I can see that k=1 gives the best accuracy? How can this be explained?
If you are querying your learner with the same dataset you have trained on with k=1, the output values should be perfect barring you have data with the same parameters that have different outcome values. Do some reading on overfitting as it applies to KNN learners.
In the case where you are querying with the same dataset as you trained with, the query will come in for each learner with some given parameter values. Because that point exists in the learner from the dataset you trained with, the learner will match that training point as closest to the parameter values and therefore output whatever Y value existed for that training point, which in this case is the same as the point you queried with.
The possibilities are:
The data training with data tests are the same data
Data tests have high similarity with the training data
The boundaries between classes are very clear
The optimal value for K is depends on the data. In general, the value of k may reduce the effect of noise on the classification, but makes the boundaries between each classification becomes more blurred.
If your result variable contains values of 0 or 1 - then make sure you are using as.factor, otherwise it might be interpreting the data as continuous.
Accuracy is generally calculated for the points that are not in training dataset that is unseen data points because if you calculate the accuracy for unseen values (values not in training dataset), you can claim that my model's accuracy is the accuracy that is been calculated for the unseen values.
If you calculate accuracy for training dataset, KNN with k=1, you get 100% as the values are already seen by the model and a rough decision boundary is formed for k=1. When you calculate the accuracy for the unseen data it performs really bad that is the training error would be very low but the actual error would be very high. So it would be better if you choose an optimal k. To choose an optimal k you should be plotting a graph between error and k value for the unseen data that is the test data, now you should choose the value of the where the error is lowest.
To answer your question now,
1) you might have taken the entire dataset as train data set and would have chosen a subpart of the dataset as the test dataset.
(or)
2) you might have taken accuracy for the training dataset.
If these two are not the cases than please check the accuracy values for higher k, you will get even better accuracy for k>1 for the unseen data or the test data.

Resources