Questions about standardizing and scaling - scikit-learn

I am trying to generate a model that uses several physico-chemical properties of a molecule (incl. number of atoms, number of rings, volume, etc.) to predict a numeric value Y. I would like to use PLS Regression, and I understand that standardization is very important here. I am programming in Python, using scikit-learn. The type and range for the features varies. Some are int64 while other are float. Some features generally have small (positive or negative) values, while other have very large value. I have tried using various scalers (e.g. standard scaler, normalize, minmax scaler, etc.). Yet, the R2/Q2 are still low. I have a few questions:
Is it possible that by scaling, some of the very important features lose their significance, and thus contribute less to explaining the variance of the response variable?
If yes, if I identify some important features (by expert knowledge), is it OK to scale other features but those? Or scale the important features only?
Some of the features, although not always correlated, have values that are in a similar range (e.g. 100-400), compared to others (e.g. -1 to 10). Is it possible to scale only a specific group of features that are within the same range?

The whole idea of scaling is to make models more robust to analysis on features space. For example, if you have 2 features as 5 Kg and 5000 gm, we know both are same, but for some algorithm, which are sensitive to metric space such as KNN, PCA etc, they will be more weighted towards second features, so scaling must be done for these algos.
Now coming to your question,
Scaling doesn't effect the significance of features. As i explained above, it helps in better analysis of data.
No, you should not do, reason explained above.
If you want to include domain knowledge in your model, you can use it as prior information. In short, for linear model, this is same as regularization. It has very good features. if you think, you have many useless-features, you can use L1 regularization, which creates sparse effect on features space, which is nothing but assign 0 weight to useless features. Here is the link for more-info.
One more point, some method such as tree based model doesn't need scaling, In last, it mostly depend on the model, you choose.

Lose significance? Yes. Contribute less? No.
No, it's not OK. It's either all or nothing.
No. The idea of scaling is not to decrease / increase significance / effect of a variable. It's to transform all variables to a common scale that can be interpreted.

Related

How do I analyze the change in the relationship between two variables?

I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!

Feature scaling and its affect on various algorithm

Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.

How do I measure the distribution of an attribute of a given population?

I have a catalog of 900 applications.
I need to determine how their reliability is distributed as a whole. (i.e. is it normal).
I can measure the reliability of an individual application.
How can I determine the reliability of the group as a whole without measuring each one?
That's a pretty open-ended question! Overall, distribution fitting can be quite challenging and works best with large samples (100's or even 1000's). It's generally better to pick a modeling distribution based on known characteristics of the process you're attempting to model than to try purely empirical fitting.
If you're going to go empirical, for a start you could take a random sample, measure the reliability scores (whatever you're using for that) of your sample, sort them, and plot them vs normal quantiles. If they fall along a relatively straight line the normal distribution is a plausible model, and you can estimate sample mean and variance to parameterize it. You can apply the same idea of plotting vs quantiles from other proposed distributions to see if they are plausible as well.
Watch out for behavior in the tails, in particular. Pretty much by definition the tails occur rarely and may be under-represented in your sample. Like all things statistical, the larger the sample size you can draw on the better your results will be.
I'd also add that my prior belief would be that a normal distribution wouldn't be a great fit. Your reliability scores probably fall on a bounded range, tend to fall more towards one side or the other of that range. If they tend to the high range, I'd predict that they get lopped off at the end of the range and have a long tail to the low side, and vice versa if they tend to the low range.

I need a function that describes a set of sequences of zeros and ones?

I have multiple sets with a variable number of sequences. Each sequence is made of 64 numbers that are either 0 or 1 like so:
Set A
sequence 1: 0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0
sequence 2:
0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
sequence 3:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0
...
Set B
sequence1:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
sequence2:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,0
...
I would like to find a mathematical function that describes all possible sequences in the set, maybe even predict more and that does not contain the sequences in the other sets.
I need this because I am trying to recognize different gestures in a mobile app based on the cells in a grid that have been touched (1 touch/ 0 no touch). The sets represent each gesture and the sequences a limited sample of variations in each gesture.
Ideally the function describing the sequences in a set would allow me to test user touches against it to determine which set/gesture is part of.
I searched for a solution, either using Excel or Mathematica, but being very ignorant about both and mathematics in general I am looking for the direction of an expert.
Suggestions for basic documentation on the subject is also welcome.
It looks as if you are trying to treat what is essentially 2D data in 1D. For example, let s1 represent the first sequence in set A in your question. Then the command
ArrayPlot[Partition[s1, 8]]
produces this picture:
The other sequences in the same set produce similar plots. One of the sequences from the second set produces, in response to the same operations, the picture:
I don't know what sort of mathematical function you would like to define to describe these pictures, but I'm not sure that you need to if your objective is to recognise user gestures.
You could do something much simpler, such as calculate the 'average' picture for each of your gestures. One way to do this would be to calculate the average value for each of the 64 pixels in each of the pictures. Perhaps there are 6 sequences in your set A describing gesture A. Sum the sequences element-by-element. You will now have a sequence with values ranging from 0 to 6. Divide each element by 6. Now each element represents a sort of probability that a new gesture, one you are trying to recognise, will touch that pixel.
Repeat this for all the sets of sequences representing your set of gestures.
To recognise a user gesture, simply compute the difference between the sequence representing the gesture and each of the sequences representing the 'average' gestures. The smallest (absolute) difference will direct you to the gesture the user made.
I don't expect that this will be entirely foolproof, it may well result in some user gestures being ambiguous or not recognisable, and you may want to try something more sophisticated. But I think this approach is simple and probably adequate to get you started.
In Mathematica the following expression will enumerate all the possible combinations of {0,1} of length 64.
Tuples[{1, 0}, {64}]
But there are 2^62 or 18446744073709551616 of them, so I'm not sure what use that will be to you.
Maybe you just wanted the unique sequences contained in each set, in that case all you need is the Mathematica Union[] function applied to the set. If you have a the sets grouped together in a list in Mathematica, say mySets, then you can apply the Union operator to every set in the list my using the map operator.
Union/#mySets
If you want to do some type of prediction a little more information might be useful.
Thanks you for the clarifications.
Machine Learning
The task you want to solve falls under the disciplines known by a variety of names, but probably most commonly as Machine Learning or Pattern Recognition and if you know which examples represent the same gestures, your case would be known as supervised learning.
Question: In your case do you know which gesture each example represents ?
You have a series of examples for which you know a label ( the form of gesture it is ) from which you want to train a model and use that model to label an unseen example to one of a finite set of classes. In your case, one of a number of gestures. This is typically known as classification.
Learning Resources
There is a very extensive background of research on this topic, but a popular introduction to the subject is machine learning by Christopher Bishop.
Stanford have a series of machine learning video lectures Standford ML available on the web.
Accuracy
You might want to consider how you will determine the accuracy of your system at predicting the type of gesture for an unseen example. Typically you train the model using some of your examples and then test its performance using examples the model has not seen. The two of the most common methods used to do this are 10 fold Cross Validation or repeated 50/50 holdout. Having a measure of accuracy enables you to compare one method against another to see which is superior.
Have you thought about what level of accuracy you require in your task, is 70% accuracy enough, 85%, 99% or better?
Machine learning methods are typically quite sensitive to the specific type of data you have and the amount of examples you have to train the system with, the more examples, generally the better the performance.
You could try the method suggested above and compare it against a variety of well proven methods, amongst which would be Random Forests, support vector machines and Neural Networks. All of which and many more are available to download in a variety of free toolboxes.
Toolboxes
Mathematica is a wonderful system, is infinitely flexible and my favourite environment, but out of the box it doesn't have a great deal of support for machine learning.
I suspect you will make a great deal of progress more quickly by using a custom toolbox designed for machine learning. Two of the most popular free toolboxes are WEKA and R both support more than 50 different methods for solving your task along with methods for measuring the accuracy of the solutions.
With just a little data reformatting, you can convert your gestures to a simple file format called ARFF, load them into WEKA or R and experiment with dozens of different algorithms to see how each performs on your data. The explorer tool in WEKA is definitely the easiest to use, requiring little more than a few mouse clicks and typing some parameters to get started.
Once you have an idea of how well the established methods perform on your data you have a good starting point to compare a customised approach against should they fail to meet your criteria.
Handwritten Digit Recognition
Your problem is similar to a very well researched machine learning problem known as hand written digit recognition. The methods that work well on this public data set of handwritten digits are likely to work well on your gestures.

What are the efficient and accurate algorithms to exclude outliers from a set of data?

I have set of 200 data rows(implies a small set of data). I want to carry out some statistical analysis, but before that I want to exclude outliers.
What are the potential algos for the purpose? Accuracy is a matter of concern.
I am very new to Stats, so need help in very basic algos.
Overall, the thing that makes a question like this hard is that there is no rigorous definition of an outlier. I would actually recommend against using a certain number of standard deviations as the cutoff for the following reasons:
A few outliers can have a huge impact on your estimate of standard deviation, as standard deviation is not a robust statistic.
The interpretation of standard deviation depends hugely on the distribution of your data. If your data is normally distributed then 3 standard deviations is a lot, but if it's, for example, log-normally distributed, then 3 standard deviations is not a lot.
There are a few good ways to proceed:
Keep all the data, and just use robust statistics (median instead of mean, Wilcoxon test instead of T-test, etc.). Probably good if your dataset is large.
Trim or Winsorize your data. Trimming means removing the top and bottom x%. Winsorizing means setting the top and bottom x% to the xth and 1-xth percentile value respectively.
If you have a small dataset, you could just plot your data and examine it manually for implausible values.
If your data looks reasonably close to normally distributed (no heavy tails and roughly symmetric), then use the median absolute deviation instead of the standard deviation as your test statistic and filter to 3 or 4 median absolute deviations away from the median.
Start by plotting the leverage of the outliers and then go for some good ol' interocular trauma (aka look at the scatterplot).
Lots of statistical packages have outlier/residual diagnostics, but I prefer Cook's D. You can calculate it by hand if you'd like using this formula from mtsu.edu (original link is dead, this is sourced from archive.org).
You may have heard the expression 'six sigma'.
This refers to plus and minus 3 sigma (ie, standard deviations) around the mean.
Anything outside the 'six sigma' range could be treated as an outlier.
On reflection, I think 'six sigma' is too wide.
This article describes how it amounts to "3.4 defective parts per million opportunities."
It seems like a pretty stringent requirement for certification purposes. Only you can decide if it suits you.
Depending on your data and its meaning, you might want to look into RANSAC (random sample consensus). This is widely used in computer vision, and generally gives excellent results when trying to fit data with lots of outliers to a model.
And it's very simple to conceptualize and explain. On the other hand, it's non deterministic, which may cause problems depending on the application.
Compute the standard deviation on the set, and exclude everything outside of the first, second or third standard deviation.
Here is how I would go about it in SQL Server
The query below will get the average weight from a fictional Scale table holding a single weigh-in for each person while not permitting those who are overly fat or thin to throw off the more realistic average:
select w.Gender, Avg(w.Weight) as AvgWeight
from ScaleData w
join ( select d.Gender, Avg(d.Weight) as AvgWeight,
2*STDDEVP(d.Weight) StdDeviation
from ScaleData d
group by d.Gender
) d
on w.Gender = d.Gender
and w.Weight between d.AvgWeight-d.StdDeviation
and d.AvgWeight+d.StdDeviation
group by w.Gender
There may be a better way to go about this, but it works and works well. If you have come across another more efficient solution, I’d love to hear about it.
NOTE: the above removes the top and bottom 5% of outliers out of the picture for purpose of the Average. You can adjust the number of outliers removed by adjusting the 2* in the 2*STDDEVP as per: http://en.wikipedia.org/wiki/Standard_deviation
If you want to just analyse it, say you want to compute the correlation with another variable, its ok to exclude outliers. But if you want to model / predict, it is not always best to exclude them straightaway.
Try to treat it with methods such as capping or if you suspect the outliers contain information/pattern, then replace it with missing, and model/predict it. I have written some examples of how you can go about this here using R.

Resources