Maximizing variance of bounded variables - statistics
Let x_1, ..., x_i,..., x_p be p real numbers such that 0 <= x_i <= b for all i.
That is, each x_i can take any value between 0 and b.
I'd like to find the values of {x_i}'s that maximize the variance among them.
Do you have any hints?
I'd like to use this result for my example code.
Or, isn't this question well-defined?
At first I thought of something like x_1=0, x_2=...=x_p=b, but then I found this does not maximize the variance when p is a little bit large.
Thanks
After comments, I did some trials on a numerical prove for your problem. There's still some work to do, but I hope it puts you in the right track. Besides, I've used python, and I have no idea if this is ok for you or not. You can surely find equivalent ways to do it in matlab and R.
I use the well-known property of the variance = E[X^2] - E[X]^2, to make the derivatives easier. (If you have doubts, check wiki).
The python package scipy.optimize has a method minimize to minimize numerically a function. You can select an algorithm for solving the problem; I'm not so familiar with the possible algorithms and I was looking for the well-known plain gradient descent (well, at least I hope you know it), and I think a closed one could be SLSQP, but honestly I'm not 100% sure on the details.
And finally, I didn't make sure that the function you're minimizing is convex, or figured out whether it has local minima, but the results look fine.
I give you the code in python below, in case it is useful, but the bottomline is that I'd suggest you:
Choose a language/package you are familiar with
Choose an algorithm for the optimization
It would be nice to prove that the function is convex (so that the solution converges)
Set the parameters for which you want to make the prove
Code below. Hope it helps.
I'm not going to post the algebra for the derivatives, I hope you can make them yourself. And you must take into account that you are maximizing and not minimizing, so you have to multiply by -1, as explained I hope quite clearly here (look for "maximizing").
Setup,
In [1]:
from scipy.optimize import minimize
import numpy as np
The function you are maximizing, that is the variance (remember the the trick E[X^2] - E[X]^2, and the -1),
In [86]:
def func(x):
return (-1) * (np.mean([xi**2 for xi in x]) - np.mean(x)**2)
The derivative of that function for each of the xi of the vector x, (I hope you can derivate and get to the same result),
In [87]:
def func_deriv(x):
n = len(x)
l = []
for i in range(n):
res = (2 * x[i] / n) - ((2/(n**2)) * (x[i] + sum([x[j] for j in range(n) if j != i])))
l += [(-1) * res]
return np.array(l)
Actually, I made quite a few mistakes when writing this function, both in the derivative and the python implementation. But there is a trick that helps a lot, which is to check the derivative in a numeric way, by adding and subtracting a small epsilon in every dimension and calculating the slope of the curve see wiki. This would the function that approximates the derivative,
In [72]:
def func_deriv_approx(x, epsilon=0.00001):
l = []
for i in range(len(x)):
x_plus = [x[j]+((j == i)*epsilon) for j in range(len(x))]
x_minus = [x[j]-((j == i)*epsilon) for j in range(len(x))]
res = (-1) * (func(x_plus) - func(x_minus)) / (2*epsilon)
l += [res]
return l
And then I've checked func_deriv_approxversus func_deriv for a bunch of values.
And the minimizing itself. If I initialize the values to the solution we suspect is right, it works ok, it only iterates once and gives the expected results,
In [99]:
res = minimize(func, [0, 0, 10, 10], jac=func_deriv, bounds=[(0,10) for i in range(4)],
method='SLSQP', options={'disp': True})
Optimization terminated successfully. (Exit mode 0)
Current function value: -25.0
Iterations: 1
Function evaluations: 1
Gradient evaluations: 1
In [100]:
print(res.x)
[ 0. 0. 10. 10.]
(Note that you could use the length you wanted, since func and func_deriv are written in a way that accept any length).
You could initialize randomly like this,
In [81]:
import random
xinit = [random.randint(0, 10) for i in range(4)]
In [82]:
xinit
Out[82]:
[1, 2, 8, 7]
And then the maximization is,
In [83]:
res = minimize(func, xinit, jac=func_deriv, bounds=[(0,10) for i in range(4)],
method='SLSQP', options={'disp': True})
Optimization terminated successfully. (Exit mode 0)
Current function value: -25.0
Iterations: 3
Function evaluations: 3
Gradient evaluations: 3
In [84]:
print(res.x)
[ 1.27087156e-13 1.13797860e-13 1.00000000e+01 1.00000000e+01]
Or finally for length = 100,
In [85]:
import random
xinit = [random.randint(0, 10) for i in range(100)]
In [91]:
res = minimize(func, xinit, jac=func_deriv, bounds=[(0,10) for i in range(100)],
method='SLSQP', options={'disp': True})
Optimization terminated successfully. (Exit mode 0)
Current function value: -24.91
Iterations: 23
Function evaluations: 22
Gradient evaluations: 22
In [92]:
print(res.x)
[ 2.49143492e-16 1.00000000e+01 1.00000000e+01 -2.22962789e-16
-3.67692105e-17 1.00000000e+01 -8.83129256e-17 1.00000000e+01
7.41356521e-17 3.45804774e-17 -8.88402036e-17 1.31576404e-16
1.00000000e+01 1.00000000e+01 1.00000000e+01 1.00000000e+01
-3.81854094e-17 1.00000000e+01 1.25586928e-16 1.09703896e-16
-5.13701064e-17 9.47426071e-17 1.00000000e+01 1.00000000e+01
2.06912944e-17 1.00000000e+01 1.00000000e+01 1.00000000e+01
-5.95921560e-17 1.00000000e+01 1.94905365e-16 1.00000000e+01
-1.17250430e-16 1.32482359e-16 4.42735651e-17 1.00000000e+01
-2.07352528e-18 6.31602823e-17 -1.20809001e-17 1.00000000e+01
8.82956806e-17 1.00000000e+01 1.00000000e+01 1.00000000e+01
1.00000000e+01 1.00000000e+01 3.29717355e-16 1.00000000e+01
1.00000000e+01 1.00000000e+01 1.00000000e+01 1.00000000e+01
1.43180544e-16 1.00000000e+01 1.00000000e+01 1.00000000e+01
1.00000000e+01 1.00000000e+01 2.31039883e-17 1.06524134e-16
1.00000000e+01 1.00000000e+01 1.00000000e+01 1.00000000e+01
1.77002357e-16 1.52683194e-16 7.31516095e-17 1.00000000e+01
1.00000000e+01 3.07596508e-17 1.17683979e-16 -6.31665821e-17
1.00000000e+01 2.04530928e-16 1.00276075e-16 -1.20572493e-17
-3.84144993e-17 6.74420338e-17 1.00000000e+01 1.00000000e+01
-9.66066818e-17 1.00000000e+01 7.47080743e-17 4.82924982e-17
1.00000000e+01 -9.42773478e-17 1.00000000e+01 1.00000000e+01
1.00000000e+01 1.00000000e+01 1.00000000e+01 5.01810185e-17
-1.75162038e-17 1.00000000e+01 6.00111991e-17 1.00000000e+01
1.00000000e+01 7.62548028e-17 -6.90706135e-17 1.00000000e+01]
Related
Optimizing Numpy Operations
I am trying to train a multi-class classifier with multinomial logistic regression and gradient descent. Specifically, the model will have a trained weights matrix w with shape (C, D) where C is the number of classes and D is the number of features of each input. Also, we will have a bias vector b with dimension (C,). We have an (N, D) input matrix X, where N is the number of training inputs, and a vector y with shape (N,), where each entry in y is a number from 0 to C - 1, indicating which class the input belongs to. I have written the following code: for _ in range(max_iterations): z = np.apply_along_axis(lambda v: v - max(v), 1, X # w.T + b) probs = np.exp(z) denom = np.sum(probs, axis=1) for i in range(C): for j in range(N): if i == y[j]: w[i] -= (step_size / N) * ((probs[j][i] / denom[j]) - 1) * X[j] b[i] -= (step_size / N) * ((probs[j][i] / denom[j]) - 1) else: w[i] -= (step_size / N) * (probs[j][i] / denom[j]) * X[j] b[i] -= (step_size / N) * (probs[j][i] / denom[j]) This produces the correct weights and bias that I want, but clearly it doesn't take advantage of numpy's operations to speed things up. So I tried to speed some of it up with the following code: for _ in range(max_iterations): z = np.apply_along_axis(lambda v: v - max(v), 1, X # w.T + b) probs = np.exp(z) denom = np.sum(probs, axis=1) s = np.zeros((N, C)) for i in range(N): s[i] = probs[i] / denom[i] for i in range(N): s[i][y[i]] += -1 for c in range(C): grad_w = s.T[c] # X w[c] += (step_size / N) * grad_w b[c] += (step_size / N) * sum(s.T[c]) I was hoping that this would produce the same results as in the previous part while being faster... and it managed to be faster, but with incorrect results. So I have a couple of questions. First, why is my second piece of code not producing the right results, and what would be a fix for it? Second, and more importantly, how would I optimize this further? This is mainly for me to learn how to take advantage of numpy's vectorized operations.
This may help with some of the iterations. Start with a small 2d array: In [251]: probs = np.arange(12).reshape(3,4) In [252]: denom = np.sum(probs, axis=1) In [253]: denom Out[253]: array([ 6, 22, 38]) To divide a (3,4) array by a (3,), we need to make the later (3,1): In [254]: probs/denom[:,None] Out[254]: array([[0. , 0.16666667, 0.33333333, 0.5 ], [0.18181818, 0.22727273, 0.27272727, 0.31818182], [0.21052632, 0.23684211, 0.26315789, 0.28947368]]) Read, and reread, the numpy documentation on broadcasting if that doesn't make sense. Another way to get the required 2d denom, is: In [255]: denom = np.sum(probs, axis=1, keepdims=True) In [256]: denom Out[256]: array([[ 6], [22], [38]]) In [257]: probs/denom Out[257]: array([[0. , 0.16666667, 0.33333333, 0.5 ], [0.18181818, 0.22727273, 0.27272727, 0.31818182], [0.21052632, 0.23684211, 0.26315789, 0.28947368]]) The same should work for the max subtraction that you use with apply_along_axis. apply... is not a speed tool, and not superior to simple iteration. In [258]: np.max(probs, axis=1, keepdims=True) Out[258]: array([[ 3], [ 7], [11]]) In [259]: probs - _ Out[259]: array([[-3, -2, -1, 0], [-3, -2, -1, 0], [-3, -2, -1, 0]])
Error converting covariance to correlation using scipy
I am trying to convert a covaraince matrix (from scipy.optimize.curve_fit) to a correlation matrix using the method here: https://math.stackexchange.com/questions/186959/correlation-matrix-from-covariance-matrix My test data is from here https://blogs.sas.com/content/iml/2010/12/10/converting-between-correlation-and-covariance-matrices.html My code is here import numpy as np S = [[1.0, 1.0, 8.1], [1.0, 16.0, 18.0], [8.1, 18.0, 81.0] ] S = np.array(S) diag = np.sqrt(np.diag(np.diag(S))) gaid = np.linalg.inv(diag) corl = gaid * S * gaid print(corl) I was expecting to see [[1. 0.25 0.9 ], [0.25 1. 0.5 ], [0.9 0.5 1. ]] but instead get [[1. 0. 0.], [0. 1. 0.], [0. 0. 1.]]. I am obviously doing something silly but just not sure what so all suggestions gratefully received - thanks!
you've probably figured it out by now but you have to use the # operator for matrix multiplication in numpy. The operator * is for an element-wise multiplication. So corl = gaid # S # gaid gives the answer you are looking for.
Perform matrix multiplication with cosine similarity function
I have two lists: list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'], ['scent', 'scents', 'aroma', 'smell', 'odor'], ['mental_illness', 'mental_disorders','bipolar_disorder'] ['romance', 'romances', 'romantic', 'budding_romance']] list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'], ['also', 'like', 'buy', 'perfumes'], ['suffer', 'from', 'clinical', 'depression'], ['really', 'love', 'my', 'wife']] I would like to compute the cosine similarity between the two lists above in such a way where the cosine similarity between the first sub-list in list1 and all sublists of list 2 are measured against each other. Then the same thing but with the second sub-list in list 1 and all sub-lists in list 2, etc. The goal is to create a len(list_2) by len(list_1) matrix, and each entry in that matrix is a cosine similarity score. Currently I've done this the following way: import gensim import numpy as np from gensim.models import KeyedVectors model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True) similarity_mat = np.zeros([len(list_2), len(list_1)]) for i, L2 in enumerate(list_2): for j, L1 in enumerate(list_1): similarity_mat[i, j] = model.n_similarity(L2, L1) However, I'd like to implement this with matrix multiplication and no for loops. My two questions are: Is there a way to do some sort of element-wise matrix multiplication but with gensim's n_similiarity() method to generate the required matrix? Would it be more efficient and faster using the current method or matrix multiplication? I hope my question was clear enough, please let me know if I can clarify even further.
Here's an approach, but it's not clear from the question whether you understand the underlying mechanics of the calculation, which might be causing the block. I've changed the input strings to give more exact word matches, and given the two strings different dimensions to make it a bit clearer: from sklearn.feature_extraction.text import CountVectorizer import numpy as np list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'], ['scent', 'my', 'aroma', 'smell', 'odor'], ['mental_illness', 'mental_disorders','bipolar_disorder'], ['romance', 'romances', 'romantic', 'budding_romance']] list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'], ['also', 'like', 'buy', 'perfumes'], ['suffer', 'from', 'clinical', 'depression'], ['really', 'love', 'my', 'wife'], ['flavor', 'taste', 'romantic', 'aroma', 'what']] cnt = CountVectorizer() # Combine each sublist into single str, and join everything into corpus combined_lists = ([' '.join(item) for item in list_1] + [' '.join(item) for item in list_2]) count_matrix = cnt.fit_transform(combined_lists).toarray() # Split them again into list_1 and list_2 word counts count_matrix_1 = count_matrix[:len(list_1),] count_matrix_2 = count_matrix[len(list_1):,] match_matrix = np.matmult(count_matrix_1, count_matrix_2.T) Output of match_matrix: array([[0, 0, 0, 0, 2], [0, 0, 0, 1, 1], [0, 0, 0, 0, 0], [0, 0, 0, 0, 1]], dtype=int64) You can see that the 1st string in list_1 has 2 matches with the 5th string in list_2, and so on. So the first part of the calculation (the dot product) has been calculated. Now we need the magnitudes: magnitudes = np.array([np.linalg.norm(count_matrix[i,:]) for i in range(len(count_matrix))]) Now we can use matrix multiplication to turn that into a matrix of divisors (we need to reshape magnitudes into n x 1 and 1 x n matrices for this to produce an n x n matrix: divisor_matrix = np.matmul(magnitudes.reshape(len(magnitudes),1), magnitudes.reshape(1,len(magnitudes))) Now since we didn't compare every single sublist, but only the list_1 with the list_2 sublists, we need to take a subsection of this divisor matrix to get the right magnitudes: divisor_matrix = divisor_matrix[:len(list_1), len(list_1):] Output: array([[4.89897949, 4. , 4. , 4. , 4.47213595], [5.47722558, 4.47213595, 4.47213595, 4.47213595, 5. ], [4.24264069, 3.46410162, 3.46410162, 3.46410162, 3.87298335], [4.89897949, 4. , 4. , 4. , 4.47213595]]) Now we can calculate the final matrix of cosine similarity scores: cos_sim = match_matrix / divisor_matrix Output: array([[0. , 0. , 0. , 0. , 0.4472136], [0. , 0. , 0. , 0.2236068, 0.2 ], [0. , 0. , 0. , 0. , 0. ], [0. , 0. , 0. , 0. , 0.2236068]]) Note these scores differ from the example given, since in the example every cosine similarity score would be 0.
There are two problems in code, the second last and last line. import gensim import numpy as np from gensim.models import KeyedVectors model = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True) similarity_mat = np.zeros([len(list_2), len(list_1)]) for i, L2 in enumerate(list_2): for j, L1 in enumerate(list_1): similarity_mat[i, j] = model.n_similarity(L2, L1) Answers to you questions: 1. You are already using a direct function to calculate the similarity between two sentences(L1 and L2) which are first converted to two vectors and then cosine similarity is calculated of those two vectors. Everything is already done inside the n_similarity() so you can't do any kind of matrix multiplication. If you want to do your own matrix multiplication then instead of directly using n_similarity() calculates the vectors of the sentences and then apply matrix multiplication while calculating cosine similarity. 2. As I said in (1) that everything is done in n_similarity() and creators of gensim takes care of the efficiency when writing the libraries so any other multiplication method will most likely not make a difference.
Affinity Propagation (sklearn) - strange behavior
Trying to use affinity propagation for a simple clustering task: from sklearn.cluster import AffinityPropagation c = [[0], [0], [0], [0], [0], [0], [0], [0]] af = AffinityPropagation (affinity = 'euclidean').fit (c) print (af.labels_) I get this strange result: [0 1 0 1 2 1 1 0] I would expect to have all samples in the same cluster, like in this case: c = [[0], [0], [0]] af = AffinityPropagation (affinity = 'euclidean').fit (c) print (af.labels_) which indeed puts all samples in the same cluster: [0 0 0] What am I missing? Thanks
I believe this is because your problem is essentially ill-posed (you pass lots of the same point to an algorithm which is trying to find similarity between different points). AffinityPropagation is doing matrix math under the hood, and your similarity matrix (which is all zeros) is nastily degenerate. In order to not error out, the implementation adds a small random matrix to the similarity matrix, preventing the algorithm from quitting when it encounters two of the same point.
Sci-kit learn pairwise_distances is imprecise?
The scikit-learn function pairwise_distances provides the distance matrix from an array X. However for some inputs the results seems not to be precise. Example: from sklearn.metrics.pairwise import pairwise_distances X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.55215782]] print pairwise_distances(X) Gives the following output: [[ 0. 0.] [ 0. 0.]] Although there is a distance of 0.00000002. 2nd Example: X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.552157821]] gives [[ 0.00000000e+00 2.10734243e-08] [ 2.10734243e-08 0.00000000e+00]] Here there is a distance but is only correct up to the first digit. For my application it is undesirable if the output can be zero although there is a distance. Is there a good way to increase the precision?
I didn't dig on why scikit-learn gives such unprecise result, but it seems scipy gives better precision. Try this: from scipy.spatial.distance import pdist, squareform squareform(pdist(X)) For example, X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.552157821]] gives array([[ 0.00000000e+00, 2.10000000e-08], [ 2.10000000e-08, 0.00000000e+00]])