citation for square root transformation for negatively skewed data - statistics

I have a dataset with a couple of variables that are moderately to severely negatively skewed. I have tried a number of different transformations, and it seems a couple of transformations work well for me in terms of the distribution of residuals
sqrt(max(x+1) - x)
1/(max(x+1) - x)
I found out about these transformation from https://www.datanovia.com/en/lessons/transform-data-to-normal-distribution-in-r/
I was wondering whether these are acceptable transformations? Are there any references that can be cited when using these transformations?

Related

Differences between QuantileTransformer and PowerTransformer

In sklearn, the document of QuantileTransformer says
This method transforms the features to follow a uniform or a normal distribution
the document of PowerTransformer says,
Apply a power transform featurewise to make data more Gaussian-like
It seems both of them can transform features to a gaussian/normal distribution. What are the differences in terms of this aspect and when to use which ?
It is confusing terminology that they use because Gaussian and normal distribution are actually the SAME.
QuantileTransformer and PowerTransformer are both non-linear.
To answer your question about what exactly is the difference it is this according to https://scikit-learn.org:
"QuantileTransformer provides non-linear transformations in which distances between marginal outliers and inliers are shrunk. PowerTransformer provides non-linear transformations in which data is mapped to a normal distribution to stabilize variance and minimize skewness. "
Source and more info here: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#:~:text=QuantileTransformer%20provides%20non%2Dlinear%20transformations,stabilize%20variance%20and%20minimize%20skewness.
The main difference is PowerTransformer() being parametric and QuantileTransformer() being non-parametric. Box-Cox or Yeo-Johnson will make your data look more 'normal' (i.e. less skewed and more centered) but it's often still far from the perfect gaussian. QuantileTransformer(output_distribution='normal') results usually look much closer to gaussian, at the cost of distorting linear relationships somewhat more. I believe there's no rule of thumb to decide which one would work better in a certain case, but it's worth noting you can select an optimal scaler in a pipeline when doing e.g. GridSearchCV().

Questions about standardizing and scaling

I am trying to generate a model that uses several physico-chemical properties of a molecule (incl. number of atoms, number of rings, volume, etc.) to predict a numeric value Y. I would like to use PLS Regression, and I understand that standardization is very important here. I am programming in Python, using scikit-learn. The type and range for the features varies. Some are int64 while other are float. Some features generally have small (positive or negative) values, while other have very large value. I have tried using various scalers (e.g. standard scaler, normalize, minmax scaler, etc.). Yet, the R2/Q2 are still low. I have a few questions:
Is it possible that by scaling, some of the very important features lose their significance, and thus contribute less to explaining the variance of the response variable?
If yes, if I identify some important features (by expert knowledge), is it OK to scale other features but those? Or scale the important features only?
Some of the features, although not always correlated, have values that are in a similar range (e.g. 100-400), compared to others (e.g. -1 to 10). Is it possible to scale only a specific group of features that are within the same range?
The whole idea of scaling is to make models more robust to analysis on features space. For example, if you have 2 features as 5 Kg and 5000 gm, we know both are same, but for some algorithm, which are sensitive to metric space such as KNN, PCA etc, they will be more weighted towards second features, so scaling must be done for these algos.
Now coming to your question,
Scaling doesn't effect the significance of features. As i explained above, it helps in better analysis of data.
No, you should not do, reason explained above.
If you want to include domain knowledge in your model, you can use it as prior information. In short, for linear model, this is same as regularization. It has very good features. if you think, you have many useless-features, you can use L1 regularization, which creates sparse effect on features space, which is nothing but assign 0 weight to useless features. Here is the link for more-info.
One more point, some method such as tree based model doesn't need scaling, In last, it mostly depend on the model, you choose.
Lose significance? Yes. Contribute less? No.
No, it's not OK. It's either all or nothing.
No. The idea of scaling is not to decrease / increase significance / effect of a variable. It's to transform all variables to a common scale that can be interpreted.

Feature scaling and its affect on various algorithm

Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.

PCA in Spark MLlib and Spark ML

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility.
My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns
spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.
Also, if you look at the Java code example there is also this
The number of columns should be small, e.g, less than 1000.
On the other hand, if you look at ML documentation, there are no limitations mentioned.
So, my question is - does this limitation also exists in Spark ML? And if so, why the limitation and is there any workaround to be able to use this implementation even if the number of columns is large?
PCA consists in finding a set of decorrelated random variables that you can represent your data with, sorted in decreasing order with respect to the amount of variance they retain.
These variables can be found by projecting your data points onto a specific orthogonal subspace. If your (mean-centered) data matrix is X, this subspace is comprised of the eigenvectors of X^T X.
When X is large, say of dimensions n x d, you can compute X^T X by computing the outer product of each row of the matrix by itself, then adding all the results up. This is of course amenable to a simple map-reduce procedure if d is small, no matter how large n is. That's because the outer product of each row by itself is a d x d matrix, which will have to be manipulated in main memory by each worker. That's why you might run into trouble when handling many columns.
If the number of columns is large (and the number of rows not so much so) you can indeed compute PCA. Just compute the SVD of your (mean-centered) transposed data matrix and multiply it by the resulting eigenvectors and the inverse of the diagonal matrix of eigenvalues. There's your orthogonal subspace.
Bottom line: if the spark.ml implementation follows the first approach every time, then the limitation should be the same. If they check the dimensions of the input dataset to decide whether they should go for the second approach, then you won't have problems dealing with large numbers of columns if the number of rows is small.
Regardless of that, the limit is imposed by how much memory your workers have, so perhaps they let users hit the ceiling by themselves, rather than suggesting a limitation that may not apply for some. That might be the reason why they decided not to mention the limitation in the new docs.
Update: The source code reveals that they do take the first approach every time, regardless of the dimensionality of the input. The actual limit is 65535, and at 10,000 they issue a warning.

What are "Factor Graphs" and what are they useful for?

A friend is using Factor Graphs to do text mining (identifying references to people in text), and it got me interested in this tool, but I'm having a hard time finding an intuitive explanation of what Factor Graphs are and how to use them.
Can anyone provide an explanation of Factor Graphs that isn't math heavy, and which focusses on practical applications rather than abstract theory?
They are used extensively for breaking down a problem into pieces. One very interesting application of factor graphs (and message passing on them) is the XBox Live TrueSkill algorithm. I wrote extensively about it on my blog where I tried to go for an introductory explanation rather than an overly academic one.
A factor graph is the graphical representation of the dependencies between variables and factors (parts of a formula) that are present in a particular kind of formula.
Suppose you have a function f(x_1,x_2,...,x_n) and you want to compute the marginalization of this function for some argument x_i, thus summing over all assignments to the remaining formula. Further f can be broken into factors, e.g.
f(x_1,x_2,...,x_n)=f_1(x_1,x_2)f_2(x_5,x_8,x_9)...f_k(x_1,x_10,x_11)
Then in order to compute the marginalization of f for some of the variables you can use a special algorithm called sum product (or message passing), that breaks the problem into smaller computations. For this algortithm, it is very important which variables appear as arguments to which factor. This information is captured by the factor graph.
A factor graph is a bipartite graph with both factor nodes and variable nodes. And there is an edge between a factor and a variable node if the variable appears as an argument of the factor. In our example there would be an edge between the factor f_2 and the variable x_5 but not between f_2 and x_1.
There is a great article: Factor graphs and the sum-product algorithm.
Factor graph is math model, and can be explained only with math equations. In nutshell it is way to explain complex relations between interest variables in your model. Example: A is temperature, B is pressure, components C,D,E are depends on B,A in some way, and component K is depends on B,A. And you want to predict value K based on A and B. So you know only visible states. Basic ML libraries don't allow to model such structure. Neural network do it better. And Factor Graph is exactly solve that problem.
Factor graph is an example of deep learning. When it is impossible to present model with features and output, Factor models allow to build hidden states, layers and complex structure of variables to fit real world behavior. Examples are Machine translation alignment, fingerprint recognition, co-reference etc.

Resources