I would like to change the distance used by KNeighborsClassifier from sklearn. By distance I mean the one in the feature space to see who are the neighbors of a point. More specifically, I want to use the following distance:
d(X1,X2) = 0.1 * |X1[0] - X2[0]| + 0.9*|X1[1] - X2[1]|
Thank you.
Just define your custom metric like this:
def mydist(X1, X2):
return 0.1 * abs(X1[0] - X2[0]) + 0.9*abs(X1[1] - X2[1])
Then initialize your KNeighboursClassifier using the metric parameter like this
clf = KNeighborsClassifier(n_neighbors=3,metric=mydist,)
You can read more about distances available in sklearn and custom distance measures here
Just be sure that according to the official documentation, your custom metric should follow the following properties
Non-negativity: d(x, y) >= 0
Identity: d(x, y) = 0 if and only if x == y
Symmetry: d(x, y) = d(y, x)
Triangle Inequality: d(x, y) + d(y, z) >= d(x, z)
Here is an example of custom metric as well.
Related
trying to implement custom MLE for binomial distribution (for learning purpose) stuck with implantation of binomial coefficient in google JAX . there is no analog for scipy.special.binom() implemented.
what shall i use instead ?
The binomial coefficient for general real-valued inputs can be computed in terms of the gamma function, which is available in JAX via jax.scipy.special.gammaln. Here's one way you could define it:
def binom(x, y):
return jnp.exp(gammaln(x + 1) - gammaln(y + 1) - gammaln(x - y + 1))
Here is a (sequential) integer implementation using JAX.
def binom_int_seq(x : int, y : int):
def scan_body(carry, values):
n, d = values
carry = (carry*n)//d
return carry, None
y = max(y, x-y)
nd = jnp.concatenate(
(jnp.arange(y+2, x+1, dtype = 'u8')[:,None],
jnp.arange(2, x-y+1, dtype = 'u8')[:,None],),
axis = 1
)
bc, *_ = jax.lax.scan(scan_body, jnp.array(y+1, dtype = 'u8'), nd)
return bc
binom_int_seq_jit = jax.jit(binom_int_seq, static_argnums = (0, 1))
which gives
x, y = 60, 31
bc_ref = sp.special.comb(x, y, exact=True)
# 114449595062769120
binom_int_seq(x, y)-bc_ref
# DeviceArray(0, dtype=uint64)
# Using above logarithmic gamma function based implementation
binom(x, y)-bc_ref
# DeviceArray(496., dtype=float64, weak_type=True)
Keep in mind the binom_int_seq implementation is only correct if
(x-max(x-y, y))*sp.special.comb(x, y, exact=True) < jnp.iinfo(jnp.uint64).max
Unlike the real-valued version, the error will be sudden and catastrophic if this condition is not satisfied.
There may be other ways to increase this constraint, such as running cancellations based upon prime factorisation, without resorting to larger unsigned integers (/arbitrary precision).
A monoidal version could be implemented which computes the binomial coefficient numerator and denominator reductions then integer divides, but this places stricter constraints on the maximum arguments.
I'm trying to estimate the parameters (v, n, k) defined in fit_func. I tried the default least squares fit but I couldn't find the parameters successfully.
def fit_func(x, v, n, k):
return v * x ** n / (k ** n + x ** n)
x = [2.5, 2.71317829, 4.08, 4.18604651, 5.19379845, 6.92,
7.98449612, 8.94, 9.92248062, 9.94, 12.36, 13.48837209]
y = [0.16054661, 0.14643943, 0.11639118, 0.11796543, 0.15609638, 0.29527088,
0.40774818, 0.51331307, 0.6163489, 0.61807529, 0.78372639, 0.78643515]
popt, pcov = curve_fit(fit_func, x, y)
print(popt)
plt.plot(x, y, '*')
plt.plot(x, fit_func(x, *popt), 'r')
plt.show()
I get the following error:
raise RuntimeError("Optimal parameters not found: " + errmsg)
RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 800.
I'm not sure if I have selected the right method.
Suggestions on alternate methods that I could use to estimate the parameters will be really helpful.
The function y(x) = v * x ** n / (k ** n + x ** n) = v / (k ** n * x ** (-n) + 1) is strictly increasing or decreasing not both. This is not the shape convenient for the data : See the figure below. This can be a cause of bad fitting.
Another possible cause of failure might be the initial values of the parameters v , k , n which have to be set in order to start the iterative computation for non-linear regression.
Just looking at the graph of the points distribution one can see that a cubic function would be more convenient. This is much simpler because the regression is linear and doesn't require initial guess of the parameters. The fitting is very good.
I'm trying to implement vectorized logistic regression on the Iris dataset. This is the implementation from Andrew Ng's youtube series on deep learning. My best predictions using this method have been 81% accuracy while sklearn's implementation achieves 100% with completely different values for coefficients and bias. Also, I cant seem to get get proper outputs from my cost function. I suspect it is an issue with computing the gradients of the weights and bias with respect to the cost function though in the course he provides all of the necessary equations ( unless there is something in the actual exercise which I don't have access to being left out.) My code is as follows.
n = 4
m = 150
y = y.reshape(1, 150)
X = X.reshape(4, 150)
W = np.zeros((4, 1))
b = np.zeros((1,1))
for epoch in range(1000):
Z = np.dot(W.T, X) + b
A = sigmoid(Z) # 1/(1 + e **(-Z))
J = -1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A))) #cost function
dz = A - y
dw = 1/m * np.dot(X, dz.T)
db = np.sum(dz)
W = W - 0.01 * dw
b = b - 0.01 * db
if epoch % 100 == 0:
print(J)
My output looks something like this.
-1.6126604413879289
-1.6185960074767125
-1.6242504226045396
-1.6296400635926438
-1.6347800862216104
-1.6396845400653066
-1.6443664703028427
-1.648838008214648
-1.653110451818512
-1.6571943378913891
W and b values are:
array([[-0.68262679, -1.56816916, 0.12043066, 1.13296948]])
array([[0.53087131]])
Where as sklearn outputs:
(array([[ 0.41498833, 1.46129739, -2.26214118, -1.0290951 ]]),
array([0.26560617]))
I understand sklearn uses L2 regularization but even when turned off it's still far from the correct values. Any help would be appreciated. Thanks
You are likely getting strange results because you are trying to use logistic regression where y is not a binary choice. Categorizing the iris data is a multiclass problem, y can be one of three values:
> np.unique(iris.target)
> array([0, 1, 2])
The cross entropy cost function expects y to either be one or zero. One way to handle this is the one vs all method.
You can check each class by making y a boolean of whether the iris in in one class or not. For example here you can make y a data set of either class 1 or not:
y = (iris.target == 1).astype(int)
With that your cost function and gradient descent should work, but you'll need to run it multiple times and pick the best score for each example. Andrew Ng's class talks about this method.
EDIT:
It's not clear what you are starting with for data. When I do this, don't reshape the inputs. So you should double check that all your multiplication is delivering the shapes you want. On thing I notice that's a little odd, is the last term in your cost function. I generally do this:
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
not:
-1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A)))
Here's code that converges for me using the dataset from sklearn:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
# Iris is a multiclass problem. Here, just calculate the probabily that
# the class is `iris_class`
iris_class = 0
Y = np.expand_dims((iris.target == iris_class).astype(int), axis=1)
# Y is now a data set of booleans indicating whether the sample is or isn't a member of iris_class
# initialize w and b
W = np.random.randn(4, 1)
b = np.random.randn(1, 1)
a = 0.1 # learning rate
m = Y.shape[0] # number of samples
def sigmoid(Z):
return 1/(1 + np.exp(-Z))
for i in range(1000):
Z = np.dot(X ,W) + b
A = sigmoid(Z)
dz = A - Y
dw = 1/m * np.dot(X.T, dz)
db = np.mean(dz)
W -= a * dw
b -= a * db
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
if i%100 == 0:
print(cost)
I created a linear regression algorithm following a tutorial and applied it to the data-set provided and it works fine. However the same algorithm does not work on another similar data-set. Can somebody tell me why this happens?
def computeCost(X, y, theta):
inner = np.power(((X * theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
temp = np.matrix(np.zeros(theta.shape))
params = int(theta.ravel().shape[1])
cost = np.zeros(iters)
for i in range(iters):
err = (X * theta.T) - y
for j in range(params):
term = np.multiply(err, X[:,j])
temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))
theta = temp
cost[i] = computeCost(X, y, theta)
return theta, cost
alpha = 0.01
iters = 1000
g, cost = gradientDescent(X, y, theta, alpha, iters)
print(g)
On running the algo through this dataset I get the output as matrix([[ nan, nan]]) and the following errors:
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: RuntimeWarning: overflow encountered in power
from ipykernel import kernelapp as app
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:11: RuntimeWarning: invalid value encountered in double_scalars
However this data set works just fine and outputs matrix([[-3.24140214, 1.1272942 ]])
Both the datasets are similar, I have been over it many times but can't seem to figure out why it works on one dataset but not on other. Any help is welcome.
Edit: Thanks Mark_M for editing tips :-)
[Much better question, btw]
It's hard to know exactly what's going on here, but basically your cost is going the wrong direction and spiraling out of control, which results in an overflow when you try to square the value.
I think in your case it boils down to your step size (alpha) being too big which can cause gradient descent to go the wrong way. You need to watch the cost in gradient descent and makes sure it's always going down, if it's not either something is broken or alpha is to large.
Personally, I would reevaluate the code and try to get rid of the loops. It's a matter of preference, but I find it easier to work with X and Y as column vectors. Here is a minimal example:
from numpy import genfromtxt
# this is your 'bad' data set from github
my_data = genfromtxt('testdata.csv', delimiter=',')
def computeCost(X, y, theta):
inner = np.power(((X # theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
for i in range(iters):
# you don't need the extra loop - this can be vectorize
# making it much faster and simpler
theta = theta - (alpha/len(X)) * np.sum((X # theta.T - y) * X, axis=0)
cost = computeCost(X, y, theta)
if i % 10 == 0: # just look at cost every ten loops for debugging
print(cost)
return (theta, cost)
# notice small alpha value
alpha = 0.0001
iters = 100
# here x is columns
X = my_data[:, 0].reshape(-1,1)
ones = np.ones([X.shape[0], 1])
X = np.hstack([ones, X])
# theta is a row vector
theta = np.array([[1.0, 1.0]])
# y is a columns vector
y = my_data[:, 1].reshape(-1,1)
g, cost = gradientDescent(X, y, theta, alpha, iters)
print(g, cost)
Another useful technique is to normalize your data before doing regression. This is especially useful when you have more than one feature you're trying to minimize.
As a side note - if you're step size is right you shouldn't get overflows no matter how many iterations you do because the cost will will decrease with every iteration and the rate of decrease will slow.
After 1000 iterations I arrived at a theta and cost of:
[[ 1.03533399 1.45914293]] 56.041973778
after 100:
[[ 1.01166889 1.45960806]] 56.0481988054
You can use this to look at the fit in an iPython notebook:
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(my_data[:, 0].reshape(-1,1), y)
axes = plt.gca()
x_vals = np.array(axes.get_xlim())
y_vals = g[0][0] + g[0][1]* x_vals
plt.plot(x_vals, y_vals, '--')
I have a problem fitting with LinearRegressionWithSGD in Spark's MLlib. I used their example for fitting from here https://spark.apache.org/docs/latest/mllib-linear-methods.html (using Python interface).
In their example all features are almost scaled with mean around 0 and standard deviation around 1. Now if I un-scale one of them by a factor of 10, the regression breaks (gives nans or very large coefficients):
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
# UN-SCALE one of the features by a factor of 10
values[3] *= 10
return LabeledPoint(values[0], values[1:])
data = sc.textFile(spark_home+"data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)
# Build the model
model = LinearRegressionWithSGD.train(parsedData)
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
print "Model coefficients:", str(model)
So, I guess I need to do the feature scaling. If I do pre-scaling it works (because I'm back at scaled features). However now I don't know how to get coefficients in the original space.
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from numpy import array
from pyspark.mllib.feature import StandardScaler
from pyspark.mllib.feature import StandardScalerModel
# Load and parse the data
def parseToDenseVector(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
# UN-SCALE one of the features by a factor of 10
values[3] *= 10
return Vectors.dense(values[0:])
# Load and parse the data
def parseToLabel(values):
return LabeledPoint(values[0], values[1:])
data = sc.textFile(spark_home+"data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parseToDenseVector)
scaler = StandardScaler(True, True)
scaler_model = scaler.fit(parsedData)
parsedData_scaled = scaler_model.transform(parsedData)
parsedData_scaled_transformed = parsedData_scaled.map(parseToLabel)
# Build the model
model = LinearRegressionWithSGD.train(parsedData_scaled_transformed)
# Evaluate the model on training data
valuesAndPreds = parsedData_scaled_transformed.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
print "Model coefficients:", str(model)
So, here I have all the coefficients in the transformed space. Now how do I get to the original space? I also have scaler_model which is StandardScalerModel object. But I can't get neither means or variances out of it. The only public method that this class has is transform which can transform points from original space to transform. But I can't get it reverse.
I just ran into this problem. The models cannot even learn f(x) = x if x is high (>3) in the training data. So terrible.
I think rather than scaling the data another option is to change the step size. This is discussed in SPARK-1859. To paraphrase from there:
The step size should be smaller than 1 over the Lipschitz constant L.
For quadratic loss and GD, the best convergence happens at stepSize = 1/(2L). Spark has a (1/n) multiplier on the loss function.
Let's say you have n = 5 data points and the largest feature value is 1500. So L = 1500 * 1500 / 5. The best convergence happens at stepSize = 1/(2L) = 10 / (1500 ^ 2).
The last equality doesn't even make sense (how did we get a 2 in the numerator?) but I've never heard of a Lipschitz constant before, so I am not qualified to fix it. Anyway I think we can just try different step sizes until it starts to work.
To rephrase your question, you want to find the intercept I and coefficients C_1 and C_2 that solve the equation: Y = I + C_1 * x_1 + C_2 * x_2 (where x_1 and x_2 are unscaled).
Let i be the intercept that mllib returns. Likewise let c_1 and c_2 be the coefficients (or weights) that mllib returns.
Let m_1 be the unscaled mean of x_1 and m_2 be the unscaled mean of x_2.
Let s_1 be the unscaled standard deviation of x_1 and s_2 be the unscaled standard deviation of x_2.
Then C_1 = (c_1 / s_1), C_2 = (c_2 / s_2), and
I = i - c_1 * m_1 / s_1 - c_2 * m_2 / s_2
This can easily be extended to 3 input variables:
C_3 = (c_3 / s_3) and I = i - c_1 * m_1 / s_1 - c_2 * m_2 / s_2 - c_3 * m_3 / s_3
As you pointed out StandardScalerModel object in pyspark doesn't expose std and mean attributes. There is an issue https://issues.apache.org/jira/browse/SPARK-6523
You can easily calculate them yourself
import numpy as np
from pyspark.mllib.stat import Statistics
summary = Statistics.colStats(features)
mean = summary.mean()
std = np.sqrt(features.variance())
These are the same mean and std that your Scaler uses. You can verify this using python magic dict
print scaler_model.__dict__.get('_java_model').std()
print scaler_model.__dict__.get('_java_model').mean()