sklearn customized standarization of data - scikit-learn

Suppose I have a 2D numpy array:
X = np.array[
[..., ...],
[..., ...]]
And I want to standardize the data either with:
X = StandardScaler().fit_transform(X)
or:
X = (X - X.mean())/X.std()
The results are different. Why are they different?

Assuming X is a feature matrix of shape (n x m) (n instances and m features). We want to scale each feature so its instances are distributed with a mean of zero and with unit variance.
To do this you need to calculate the mean and standard deviation of each feature for the provided instances (column of X) and then calculate the scaled feature vectors. Currently you are calculating the mean and standard deviation of the whole dataset and scaling the data using these values: this will give you meaningless results in all but a few special cases (i.e., X = np.ones((100,2)) is such a special case).
Practically, to calculate these statistics for each feature you will need to set the axis parameter of the .mean() or .std() methods to 0. This will perform the calculations along the columns and return a (1 x m) shaped array (actually a (m,) array, but thats another story), where each value is the mean or standard deviation for the given column. You can then use numpy broadcasting to correctly scale the feature vectors.
The below example shows how you can correctly implement it manually. x1 and x2 are 2 features with 100 training instances. We store them in a feature matrix X.
x1 = np.linspace(0, 100, 100)
x2 = 10 * np.random.normal(size=100)
X = np.c_[x1, x2]
# scale the data using the sklearn implementation
X_scaled = StandardScaler().fit_transform(X)
# scale the data taking mean and std along columns
X_scaled_manual = (X - X.mean(axis=0)) / X.std(axis=0)
If you print the two you will see they match exactly, explicitly:
print(np.sum(X_scaled-X_scaled_manual))
returns 0.0.

Related

taking the norm of 3 vectors in python

This is probably a stupid question, but for some reason I can't get the norm of three matrices of vectors.
Each vector in the x matrix represents the x coordinate of a sensor (8 sensors total) for three different experiments. Same for y and z.
ex:
x = [array([ 2.239, 3.981, -8.415, 33.895, 48.237, 52.13 , 60.531, 56.74 ]), array([ 2.372, 6.06 , -3.672, 3.704, -5.926, -2.341, 35.667, 62.097])]
y = [array([ 18.308, -17.83 , -22.278, -99.67 , -121.575, -116.794,-123.132, -127.802]), array([ -3.808, 0.974, -3.14 , 6.645, 2.531, 7.312, -129.236, -112. ])]
z = [array([-1054.728, -1054.928, -1054.928, -1058.128, -1058.928, -1058.928, -1058.928, -1058.928]), array([-1054.559, -1054.559, -1054.559, -1054.559, -1054.559, -1054.559, -1057.959, -1058.059])]
I tried doing:
norm= np.sqrt(np.square(x)+np.square(y)+np.square(z))
x = x/norm
y = y/norm
z = z/norm
However, I'm pretty sure its wrong. When I then try and sum the components of let's say np.sum(x[0]) I don't get anywhere close to 1.
Normalization does not make the sum of the components equal to one. Normalization makes the norm of the vector equal to one. You can check if your code worked by taking the norm (square root of the sum of the squared elements) of the normalized vector. That should equal 1.
From what I can tell, your code is working as intended.
I made a mistake - your code is working as intended, but not for your application. You could define a function to normalize any vector that you pass to it, much as you did in your program as follows:
def normalize(vector):
norm = np.sqrt(np.sum(np.square(vector)))
return vector/norm
However, because x, y, and z each have 8 elements, you can't normalize x with the components from x, y, and z.
What I think you want to do is normalize the vector (x,y,z) for each of your 8 sensors. So, you should pass 8 vectors, (one for each sensor) into the normalize function I defined above. This might look something like this:
normalized_vectors = []
for i in range(8):
vector = np.asarray([x[i], y[i],z[i]])
normalized_vectors.append = normalize(vector)

How to calculate Covariance and Correlation in Python without using cov and corr?

How can we calculate the correlation and covariance between two variables without using cov and corr in Python3?
At the end, I want to write a function that returns three values:
a boolean that is true if two variables are independent
covariance of two variables
correlation of two variables.
You can find the definition of correlation and covariance here:
https://medium.com/analytics-vidhya/covariance-and-correlation-math-and-python-code-7cbef556baed
I wrote this part for covariance:
'''
ans=[]
mean_x , mean_y = x.mean() , y.mean()
n = len(x)
Cov = sum((x - mean_x) * (y - mean_y)) / n
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den
'''
For the covariance, just subtract the respective means and multiply the vectors together (using the dot product). (Of course, make sure whether you're using the sample covariance or population covariance estimate -- if you have "enough" data the difference will be tiny, but you should still account for it if necessary.)
For the correlation, divide the covariance by the standard deviations of both.
As for whether or not two columns are independent, that's not quite as easy. For two random variables, we just have that $\mathbb{E}\left[(X - \mu_X)(Y - \mu_Y)\right] = 0$, where $\mu_X, \mu_Y$ are the means of the two variables. But, when you have a data set, you are not dealing with the actual probability distributions; you are dealing with a sample. That means that the correlation will very likely not be exactly $0$, but rather a value close to $0$. Whether or not this is "close enough" will depend on your sample size and what other assumptions you're willing to make.

Solving vector second order differential equation while indexing into an array

I'm attempting to solve the differential equation:
m(t) = M(x)x'' + C(x, x') + B x'
where x and x' are vectors with 2 entries representing the angles and angular velocity in a dynamical system. M(x) is a 2x2 matrix that is a function of the components of theta, C is a 2x1 vector that is a function of theta and theta' and B is a 2x2 matrix of constants. m(t) is a 2*1001 array containing the torques applied to each of the two joints at the 1001 time steps and I would like to calculate the evolution of the angles as a function of those 1001 time steps.
I've transformed it to standard form such that :
x'' = M(x)^-1 (m(t) - C(x, x') - B x')
Then substituting y_1 = x and y_2 = x' gives the first order linear system of equations:
y_2 = y_1'
y_2' = M(y_1)^-1 (m(t) - C(y_1, y_2) - B y_2)
(I've used theta and phi in my code for x and y)
def joint_angles(theta_array, t, torques, B):
phi_1 = np.array([theta_array[0], theta_array[1]])
phi_2 = np.array([theta_array[2], theta_array[3]])
def M_func(phi):
M = np.array([[a_1+2.*a_2*np.cos(phi[1]), a_3+a_2*np.cos(phi[1])],[a_3+a_2*np.cos(phi[1]), a_3]])
return np.linalg.inv(M)
def C_func(phi, phi_dot):
return a_2 * np.sin(phi[1]) * np.array([-phi_dot[1] * (2. * phi_dot[0] + phi_dot[1]), phi_dot[0]**2])
dphi_2dt = M_func(phi_1) # (torques[:, t] - C_func(phi_1, phi_2) - B # phi_2)
return dphi_2dt, phi_2
t = np.linspace(0,1,1001)
initial = theta_init[0], theta_init[1], dtheta_init[0], dtheta_init[1]
x = odeint(joint_angles, initial, t, args = (torque_array, B))
I get the error that I cannot index into torques using the t array, which makes perfect sense, however I am not sure how to have it use the current value of the torques at each time step.
I also tried putting odeint command in a for loop and only evaluating it at one time step at a time, using the solution of the function as the initial conditions for the next loop, however the function simply returned the initial conditions, meaning every loop was identical. This leads me to suspect I've made a mistake in my implementation of the standard form but I can't work out what it is. It would be preferable however to not have to call the odeint solver in a for loop every time, and rather do it all as one.
If helpful, my initial conditions and constant values are:
theta_init = np.array([10*np.pi/180, 143.54*np.pi/180])
dtheta_init = np.array([0, 0])
L_1 = 0.3
L_2 = 0.33
I_1 = 0.025
I_2 = 0.045
M_1 = 1.4
M_2 = 1.0
D_2 = 0.16
a_1 = I_1+I_2+M_2*(L_1**2)
a_2 = M_2*L_1*D_2
a_3 = I_2
Thanks for helping!
The solver uses an internal stepping that is problem adapted. The given time list is a list of points where the internal solution gets interpolated for output samples. The internal and external time lists are in no way related, the internal list only depends on the given tolerances.
There is no actual natural relation between array indices and sample times.
The translation of a given time into an index and construction of a sample value from the surrounding table entries is called interpolation (by a piecewise polynomial function).
Torque as a physical phenomenon is at least continuous, a piecewise linear interpolation is the easiest way to transform the given function value table into an actual continuous function. Of course one also needs the time array.
So use numpy.interp1d or the more advanced routines of scipy.interpolate to define the torque function that can be evaluated at arbitrary times as demanded by the solver and its integration method.

Defining new kernels to use in approximate_kernel.py

I am trying to test a new kernel method in Kernel Ridge Regression and want to do this by implementing the Fastfood transformation (https://arxiv.org/abs/1408.3060). I can write a function which computes this transform but it isn't playing nicely with the kernel ridge regression function in sklearn. As a result I have gone to the source code for sklearn kernel ridge regression (https://insight.io/github.com/scikit-learn/scikit-learn/blob/master/sklearn/kernel_ridge.py) and approximate_kernel.py (https://insight.io/github.com/scikit-learn/scikit-learn/blob/master/sklearn/kernel_approximation.py) in order to try and define this new kernel as a class definition in approximate_kernel.py. The problem is that I have no idea how to convert my construction to something which will work in the approximate_kernel KernelRidge programs. Would anybody be able to advise how best to do this please?
My construction for the fastfood transform is:
def fastfood_product(d):
'''
Constructs the fastfood matrix composition V = const*S*H*G*Pi*B where
S is a scaling matrix
H is Hadamard transform
G is a diagonal random Gaussian
Pi is a permutation matrix
B is a diagonal Rademacher matrix.
Inputs: n - dimensionality of the feature vectors for the kernel.
must be a power of two and be divisible by d. If not then can
pad the matrix with zeros but for simplicity assume this condition
is always met.
Output: V'''
S = np.zeros(shape=(d,d))
G = np.zeros_like(S)
B = np.zeros_like(S)
H = hadamard(d)
Pi = np.eye(d)
np.random.shuffle(Pi) # Permutation matrix
# Construct the simple matrices
np.fill_diagonal(B, 2*np.random.randint(low=0,high=2,size=(d,1)).flatten() - 1)
np.fill_diagonal(G, np.random.randn(G.shape[0],1)) # May want to change standard normal to arbitrary which will affect the scaling for V
np.fill_diagonal(S, np.linalg.norm(G,'fro')**(-0.5))
#print('Shapes of B {}, S {}, G {}, H{}, Pi {}'.format(B.shape, S.shape, G.shape, H.shape, Pi.shape))
V = d**(-0.5)*S.dot(H).dot(G).dot(Pi).dot(H).dot(B)
return V
def fastfood_feature_map(X, n):
'''Given a matrix X of data compute the fastfood transformation and feature mapping.
Input: X data of dimension d by m, n = the number of nonlinear basis functions to choose (power of 2)
Outputs: Phi - matrix of random features for fastfood kernel approximation.
Usage: Phi must be transposed for computation in the kernel ridge regression.
i.e solve ||Phi.T * w - b || + regulariser
Comments: This only uses a standard normal distribution but this could
be altered with different hyperparameters.'''
d,m = A.shape
V = fastfood_product(d)
Phi = n**(-0.5)*np.exp(1j*np.dot(V, X))
return Phi
I think the imports numpy as np and from linalg import hadamard will be necessary for the above.

Calculating p-values with pnorm ( ). What makes p-values differ if data is transformed?

I am comparing two alternatives for calculating p-values with R's pnorm() function.
xbar <- 2.1
mu <- 2
sigma <- 0.25
n = 35
# z-transformation
z <- (xbar - mu) / (sigma / sqrt(n))
# Alternative I using transformed values
pval1 <- pnorm(q = z)
# Alternative II using untransformed values
pval2 <- pnorm(q = xbar, mean = mu, sd = sigma)
How come the two calculated p-values are not the same? Should not they?
They are different because you use two different estimates of the standard deviation.
In the z-transformation calculation you use sigma / sqrt(n) as the standard deviation, but in the untransformed calculation you use sd = sigma, ignoring n.

Resources