fitting for offset in a patsy model - python-3.x

Using patsy, I understand how to turn intercepts on or off. But I haven't managed to get horizontal offsets. For instance, I would like to be able to fit, in essence
y = alpha + beta * abs(x_opt - x_obs)
with x_opt free in the fit. I tried write this like so:
y ~ 1 + np.abs(y - x)
using a constant column for y. But within the np.abs() parentheses, patsy "turns off," and y - x is just interpreted as a number. If I shift y to 1 or 20, I get different answers.
A similar question applies for e.g., np.pow(1-x, 2) or a sine wave. Being able to fit for the x offset would be extremely helpful. Is this possible? Or is this precisely what is meant that patsy doesn't do non-linear?

patsy and most of statsmodels only handle models that are linear in parameters. Or more precisely, models where the design matrix and estimated parameters are combined in a linear way, x * beta.
Polynomials and splines are nonlinear in the underlying variables but have a linear representation in terms of basis function and are therefore linear in parameters.
The only non-linearities in the models that are currently implemented in statsmodels are predefined nonlinearities like link functions in GLM or discrete models, shape parameters in models like NegativeBinomial, or covariances in mixed models and GEE.
The best Python package for nonlinear least squares is currently lmfit https://pypi.python.org/pypi/lmfit/

Related

Determining the Distance between two matrices using numpy

I am developing my own Architecture Search algorithm using Pythons numpy. Currently I am trying to determine how to develop a cost function that can see the distance between X and Y, or two matrices.
I'd like to reduce the difference between the two, to a meaningful scalar value.
Ideally between 0 and 1, so that if both sets of elements within the matrices are the same numerically and positionally, a 0 is returned.
In the example below, I have the output of my algorithm X. Both X and Y are the same shape. I tried to sum the difference between the two matrices; however I'm not sure that using summation will work in all conditions. I also tried returning the mean. I don't think that either approach will work though. Aside from looping through both matrices and comparing elements directly, is there a way to capture the degree of difference in a scalar?
Y = np.arange(25).reshape(5, 5)
for i in range(1000):
X = algorithm(Y)
# I try to reduce the difference between the two matrices to a scalar value
cost = np.sum(X-Y)
There are many ways to calculate a scalar "difference" between two matrices. Here are just two examples.
The mean square error:
((m1 - m2) ** 2).mean() ** 0.5
The max absolute error:
np.abs(m1 - m2).max()
The choice of the metric depends on your problem.

Is elastic net equivalent in scikit-learn and glmnet?

In particular, glmnet docs imply it creates a "Generalised Linear Model" of the gaussian family for regression, while scikit-learn imply no such thing (ie, seems like it's a pure linear regression, not generalised). But I'm not sure about this.
In the documentation you link to, there is an optimization problem which shows exactly what is optimized in GLMnet:
1/(2N) * sum_i(y_i - beta_0 - x_i^T beta) + lambda * [(1 - alpha)/2 ||beta||_2^2 + alpha * ||beta||_1]
Now take a look here, where you will find the same formula written as the optimization of a euclidean norm. Note that the docs have omitted the intercept w_0, equivalent to beta_0, but the code does estimate it.
Please also note that lambda becomes alpha and alpha becomes rho...
The "Gaussian family" aspect probably refers to the fact that an L2-loss is used, which corresponds to assuming that the noise is additive Gaussian.

scikit-learn Standard Scaler - get the standard deviation in the original unscaled space for GMM

Before running a GMM clustering model, I use a standard Scaler to transform my data into a 0 mean, 1 std dataset
Having then performed clustering, I am interested in representing the learned cluster back in the original space rather than the 0-mean, 1 standard deviation, where the feature values make more sense.
Is it then correct to do the following:
Get the mean by multiplying the mean of each GMM cluster by the
scaler.mean_ parameters.
Get the standard deviation by multiplying the square of the
diagonal covariance matrix by the scaler.std_ parameters.
I'd appreciate any feedback,
Thank you!
For the cluster centers you can use scaler.inverse_transform() directly (because they live in the same space as your data). It adds the column means back and scales each column back up by its standard deviation.
import numpy as np
from sklearn.preprocessing import StandardScaler
X = np.random.randn(10, 3)
scaler = StandardScaler()
scaler.fit(X)
You will then see that
scaler.inverse_transform(scaler.transform(X)) - X
is equal or extremely close to 0, making the two essentially equal. In order to automate you r pipeline, you should also take a look at sklearn.pipeline.Pipeline, with which you can concatenate your processes and invoke transform and inverse_transform methods.
As for the rescaling of the covariance, you should multiply np.diag(scaler.std_) to the right and to the left of your cluster covariance matrices.
To answer your question:
1) You obtain the mean by multiplying the cluster means by scaler.std_ and adding scaler.mean_ back.
2) You rescale the cluster covariances by multiplying left and right by, np.diag(scaler.std_), viz rescaled_cov = np.diag(scaler.std_).dot(cov).dot(np.diag(scaler.std_))
Note: If your covariance matrices are rather large, you may not want to create another (diagonal, but dense) matrix of the same size. The operation scaler.std_[:, np.newaxis] * cov * scaler.std_ is equivalent mathematically to 2) but does not require creating the diagonal matrix.

Fit a GMM to a 3D histogram in scikit-learn

The mixture model code in scikit-learn works for a list individual data points, but what if you have a histogram? That is, I have a density value for every voxel, and I want the mixture model to approximate it. Is this possible? I suppose one solution would be to sample values from this histogram, but that shouldn't be necessary.
Scikit-learn has extensive utilities and algorithms for kernel density estimation, which is specifically centered around inferring distributions from things like histograms. See the documentation here for some examples. If you have no expectations for the distribution of your data, KDE might be a more general approach.
For 2D histogram Z (your 2D array of voxels)
import numpy as np
# create the co-ordinate values
X, Y = np.mgrid[0:Z.shape[0], 0:Z.shape[1]]
# artificially create a list of points from your histogram
data_points = []
for x, y, z in zip(X.ravel(), Y.ravel(), Z.ravel()):
# add the data point / voxel (x, y) as many times as it occurs
# in the histogram
for iz in z:
data_points.append((x, y))
# now fit your GMM
from sklearn.mixture import GMM
gmm = GMM()
gmm.fit(data_points)
Though, as #Kyle Kastner points out, there are better methods for achieving this. For a start, your histogram will be 'binned' which will already loose you some resolution. Can you get hold of the raw data before it was binned?

How to scale input DBSCAN in scikit-learn

Should the input to sklearn.clustering.DBSCAN be pre-processeed?
In the example http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py the distances between the input samples X are calculated and normalized:
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
db = DBSCAN(eps=0.95, min_samples=10).fit(S)
In another example for v0.14 (http://jaquesgrobler.github.io/online-sklearn-build/auto_examples/cluster/plot_dbscan.html) some scaling is done:
X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
I base my code on the latter example and have the impression clustering works better with this scaling. However, this scaling "Standardizes features by removing the mean and scaling to unit variance". I try to find 2d clusters. If I have my clusters distributed in a squared area - let's say 100x100 I see no problem in the scaling. However, if the are distributed in an rectangled area e.g. 800x200 the scaling 'squeezes' my samples and changes the relative distances between them in one dimension. This deteriorates the clustering, doesn't it? Or am I understanding sth. wrong?
Do I need to apply some preprocessing at all, or can I simply input my 'raw' data?
It depends on what you are trying to do.
If you run DBSCAN on geographic data, and distances are in meters, you probably don't want to normalize anything, but set your epsilon threshold in meters, too.
And yes, in particular a non-uniform scaling does distort distances. While a non-distorting scaling is equivalent to just using a different epsilon value!
Note that in the first example, apparently a similarity and not a distance matrix is processed. S = (1 - D / np.max(D)) is a heuristic to convert a similarity matrix into a dissimilarity matrix. Epsilon 0.95 then effectively means at most "0.05 of the maximum dissimilarity observed". An alternate version that should yield the same result is:
D = distance.squareform(distance.pdist(X))
S = np.max(D) - D
db = DBSCAN(eps=0.95 * np.max(D), min_samples=10).fit(S)
Whereas in the second example, fit(X) actually processes the raw input data, and not a distance matrix. IMHO that is an ugly hack, to overload the method this way. It's convenient, but it leads to misunderstandings and maybe even incorrect usage sometimes.
Overall, I would not take sklearn's DBSCAN as a referene. The whole API seems to be heavily driven by classification, not by clustering. Usually, you don't fit a clustering, you do that for supervised methods only. Plus, sklearn currently does not use indexes for acceleration, and needs O(n^2) memory (which DBSCAN usually would not).
In general, you need to make sure that your distance works. If your distance function doesn't work no distance-based algorithm will produce the desired results. On some data sets, naive distances such as Euclidean work better when you first normalize your data. On other data sets, you have a good understanding on what distance is (e.g. geographic data. Doing a standardization on this obivously does not make sense, nor does Euclidean distance!)

Resources