What is the difference between scale and fit in sklearn? - scikit-learn

i am new to datascience and when i was going through one of the kaggle blog, i saw that the user is using both scale and fit on the data set. i tried to understand the difference by going through the documentation but was not able to understand

It's hard to understand the source of your confusion without any code. Inside the link you provided, the data is first scaled with sklearn.preprocessing.scale() and then fit to a sklearn.ensemble.GradientBoostingRegressor.
So the scaling operation transforms data such that all the features are represented on the same scale, and the fitting operation trains the model with the said data.
From your question it sounds like you thought these two operations were mutually exclusive, or somehow equivalent, but they are actually logical consecutive steps.
In general, before model is trained, data is somehow preprocessed (with .scale() in this case), then trained. In sklearn the .fit() methods are for training (fitting functions/models to the data).
Hope it makes sense!

Scale is a data normalization technique and it is used when data in different features are of not similar values like in one feature you have values ranging from 1 to 10 and in other features you have values ranging from 1000 to 10000.
Where as fit is the function that actually starts your model training

Scaling is conversion of data, a method used to normalize the range of independent variables or features of data. The fit method is a training step.

Related

Using discretization before or after splitting data?

I am new to data mining concepts and have a question regarding implementation of a technique.
I am using the a dataset with large continuous values.
Now, I am trying to code an algorithm where I need to discretize data (not scale as it makes no impact on data along with the fact that algorithm is not a distance based one, hence no scaling needed).
Now for discretization, I have a similar question with regards to scaling and train test split.
For scaling, I know we should split data and then fit transform the train and transform the test based on what we fit from train.
But what do we do for discretization? I am using scikit learns KBinsDiscretizer and trying to make sense of whether I should split first and discretize the same way we normally scale or discretize first then scale.
The issue came up because I used the 17 bins, uniform strategy (0-16 value range)
With split then discretize, I get (0-16) range throughout in train but not in test.
With discretize and split, I get (0-16) range in both.
With former strategy, my accuracy is around 85% but with the latter, its a whopping 97% which leads me to believe I have definitely overfit the data.
Please advise on what I should be doing for discretization and whether the data interpretation was correct.

When and Whether should we normalize the ground-truth labels in the multi-task regression models?

I am trying a multi-task regression model. However, the ground-truth labels of different tasks are on different scales. Therefore, I wonder whether it is necessary to normalize the targets. Otherwise, the MSE of some large-scale tasks will be extremely bigger. The figure below is part of my overall targets. You can certainly find that columns like ASA_m2_c have much higher values than some others.
First, I have already tried some weighted loss techniques to balance the concentration of my model when it does gradient backpropagation. The result shows it didn't perform well.
Secondly, I have seen tremendous discussions regarding normalizing the input data, but hardly discovered any particular talking about normalizing the labels. It's partly because most of the people's problems are classification type and a single task. I do know pytorch provides a convenient approach to normalize the vision dataset by transform.normalize, which is still operated on the input rather than the labels.
Similar questions: https://forums.fast.ai/t/normalizing-your-dataset/49799
https://discuss.pytorch.org/t/ground-truth-label-normalization/26981/19
PyTorch - How should you normalize individual instances
Moreover, I think it might be helpful to provide some details of my model architecture. The input is first fed into a feature extractor and then several generators use the shared output representation from that extractor to predict different targets.
I've been working on a Multi-Task Learning problem where one head has an output of ~500 and another between 0 and 1.
I've tried Uncertainty Weighting but in vain. So I'd be grateful if you could give me a little clue about your studies.(If there is any progress)
Thanks.

when to normalize data with zscore (before or after split)

I was taking a udemy course, which made a strong case for normalizing only the train data (after the split from test data) since the model will typically used by fresh data, with features of the scale of the original set. And if you scale the test data, then you are not scoring the model properly.
On the other hand, what I found was that my two-class logistic regression model (created with Azure Machine Learning Studio) was getting terrible results after Z-Score scaling only the train data.
a. Is this a problem only with Azure's tools?
b. What is a good rule of thumb for when feature data needs to be scaled (one, two, or three orders of magnitude in difference)?
Not scoring the model properly due to normalized test set doesn't seem to make sense:
you would presumably also normalize data that you use for predictions in the future.
I found this similar question in datascience stackexchange and the top answer suggests not only that test data has to be normalized, but you need to apply the exact same scaling as you have done to the training data, because the scale of your data is also taken into account by your model: differently scaled test/prediction data would potentially lead to over/under-exaggeration of a feature.

How to handle shared data between samples and batches in Keras

I'm using Keras for timeseries prediction and I want to create a model that is based on the self-attention mechanism that will not use any RNNs. For each sample we look at the last x timesteps of samples to predict the next sample.
In other words I want to feed the network (num_batches, num_samples, timesteps, features) and get (num_batches, predictions).
There is 1 problems with this.
There is a lot of unnecessary duplication of data where sample n has basically the same timesteps and features as sample n+1, only shifted 1 to the left.
How would you handle this assuming you dataset is very large?
I am not very familiar with this, but if your issue is "I have too many replicated data" I think you can solve your problem devising a generator for your data, and then pass the generator as input for the Keras/TensorFlow fit function (according to TensorFlow APIs specification, it is stated that it supports generators as input).
If your question is related to the logic behind the model, I do not see the issue. It is like that you have a sliding window, for each window you predict one value, and then you move the window by a certain amount (in your case, one). Could you argue a little more about your concern?

Using SVM to perform classification on multi-dimensional time series datasets

I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.

Resources