How to vectorize a function of two matrices in numpy? - python-3.x

Say, I have a binary (adjacency) matrix A of dimensions nxn and another matrix U of dimensions nxl. I use the following piece of code to compute a new matrix that I need.
import numpy as np
from numpy import linalg as LA
new_U = np.zeros_like(U)
for idx, a in np.ndenumerate(A):
diff = U[idx[0], :] - U[idx[1], :]
if a == 1.0:
new_U[idx[0], :] += 2 * diff
elif a == 0.0:
norm_diff = LA.norm(U[idx[0], :] - U[idx[1], :])
new_U[idx[0], :] += -2 * diff * np.exp(-norm_diff**2)
return new_U
This takes quite a lot of time to run even when n and l are small. Is there a better way to rewrite (vectorize) this code to reduce the runtime?
Edit 1: Sample input and output.
A = np.array([[0,1,0], [1,0,1], [0,1,0]], dtype='float64')
U = np.array([[2,3], [4,5], [6,7]], dtype='float64')
new_U = np.array([[-4.,-4.], [0,0],[4,4]], dtype='float64')
Edit 2: In mathematical notation, I am trying to compute the following:
where u_ik = U[i, k],u_jk = U[j, k], and u_i = U[i, :]. Also, (i,j) \in E corresponds to a == 1.0 in the code.

Leveraging broadcasting and np.einsum for the sum-reductions -
# Get pair-wise differences between rows for all rows in a vectorized manner
Ud = U[:,None,:]-U
# Compute norm L1 values with those differences
L = LA.norm(Ud,axis=2)
# Compute 2 * diff values for all rows and mask it with ==0 condition
# and sum along axis=1 to simulate the accumulating behaviour
p1 = np.einsum('ijk,ij->ik',2*Ud,A==1.0)
# Similarly, compute for ==1 condition and finally sum those two parts
p2 = np.einsum('ijk,ij,ij->ik',-2*Ud,np.exp(-L**2),A==0.0)
out = p1+p2
Alternatively, use einsum for computing squared-norm values and using those to get p2 -
Lsq = np.einsum('ijk,ijk->ij',Ud,Ud)
p2 = np.einsum('ijk,ij,ij->ik',-2*Ud,np.exp(-Lsq),A==0.0)

Related

Numpy.linalg.eig is giving different results than numpy.linalg.eigh for Hermitian matrices

I have one hermitian matrix (specifically, a Hamiltonian). Though phase of a singe eigenvector can be arbitrary, the quantities I am calculating is physical (I reduced the code a bit keeping just the reproducible part). eig and eigh are giving very different results.
import numpy as np
import numpy.linalg as nlg
import matplotlib.pyplot as plt
def Ham(Ny, Nx, t, phi):
h = np.zeros((Ny,Ny), dtype=complex)
for ii in range(Ny-1):
h[ii+1,ii] = t
h[Ny-1,0] = t
h=h+np.transpose(np.conj(h))
u = np.zeros((Ny,Ny), dtype=complex)
for ii in range(Ny):
u[ii,ii] = -t*np.exp(-2*np.pi*1j*phi*ii)
u = u + 1e-10*np.eye(Ny)
H = np.kron(np.eye(Nx,dtype=int),h) + np.kron(np.diag(np.ones(Nx-1), 1),u) + np.kron(np.diag(np.ones(Nx-1), -1),np.transpose(np.conj(u)))
H[0:Ny,Ny*(Nx-1):Ny*Nx] = np.transpose(np.conj(u))
H[Ny*(Nx-1):Ny*Nx,0:Ny] = u
x=[]; y=[];
for jj in range (1,Nx+1):
for ii in range (1,Ny+1):
x.append(jj); y.append(ii)
x = np.asarray(x)
y = np.asarray(y)
return H, x, y
def C_num(Nx, Ny, E, t, phi):
H, x, y = Ham(Ny, Nx, t, phi)
ifhermitian = np.allclose(H, np.transpose(np.conj(H)), rtol=1e-5, atol=1e-8)
assert ifhermitian == True
Hp = H
V,wf = nlg.eigh(Hp) ##Check. eig gives different result
idx = np.argsort(np.real(V))
wf = wf[:, idx]
normmat = wf*np.conj(wf)
norm = np.sqrt(np.sum(normmat, axis=0))
wf = wf/(norm*np.sqrt(len(H)))
wf = wf[:, V<=E] ##Chose a subset of eigenvectors
V01 = wf*np.exp(1j*x)[:,None]; V12 = wf*np.exp(1j*y)[:,None]
V23 = wf*np.exp(1j*x)[:,None]; V30 = wf*np.exp(1j*y)[:,None]
wff = np.transpose(np.conj(wf))
C01 = np.dot(wff,V01); C12 = np.dot(wff,V12); C23 = np.dot(wff,V23); C30 = np.dot(wff,V30)
F = nlg.multi_dot([C01,C12,C23,C30])
ifhermitian = np.allclose(F, np.transpose(np.conj(F)), rtol=1e-5, atol=1e-8)
assert ifhermitian == True
evals, efuns = nlg.eig(F) ##Check eig gives different result
C = (1/(2*np.pi))*np.sum(np.angle(evals));
return C
C = C_num(16, 16, 0, 1, 1/8)
print(C)
Changing both nlg.eigh to nlg.eig, or even changing only the last one, giving very different results.
As I mentioned elsewhere, the eigenvalue and eigenvector are not unique.
The only thing that is true is that for each eigenvalue $A v = lambda v$, the two matrices returned by eig and eigh describe those solutions, it is natural that eig inexact but approximate results.
You can see that both the solutions will triangularize your matrix in different ways
H, x, y = Ham(16, 16, 1, 1./8)
D, V = nlg.eig(H)
Dh, Vh = nlg.eigh(H)
Then
import matplotlib.pyplot as plt
plt.figure(figsize=(14, 7))
plt.subplot(121);
plt.imshow(abs(np.conj(Vh.T) # H # Vh))
plt.title('diagonalized with eigh')
plt.subplot(122);
plt.imshow(abs(np.conj(V.T) # H # V))
plt.title('diagonalized with eig')
Plots this
That both diagonalizations were successfull, but the eigenvalues are indifferent order.
If you sort the eigenvalues you see they match
plt.plot(np.diag(np.real(np.conj(Vh.T) # H # Vh)))
plt.plot(np.diag(np.imag(np.conj(Vh.T) # H # Vh)))
plt.plot(np.sort(np.diag(np.real(np.conj(V.T) # H # V))))
plt.title('eigenvalues')
plt.legend(['real eigh', 'imag eigh', 'sorted real eig'], loc='upper left')
Since many eigenvalues are repeated, the eigenvector associated with a given eigenvalue is not unique as well, the only thing we can guarantee is that the eigenvectors for a given eigenvalue must span the same subspace.
The diagonalization test is the best in my opinion.
Is eigh always better than eig?
If you search for the eigenvalues in the lapack routines you will have many options. So it is I cannot discuss each possible implementation here. The common sense says that we can expect that the symmetric/hermitian routines to perform better, otherwise ther would be no reason to add one more routine that is more limited. But I never tested carefully the behavior of eig vs eigh.
To have an intuition compare the equation for tridiagonalization for symmetric matrices, and the equation for reduction of a general matrix to its Heisenberg form found here.

Speed Up a for Loop - Python

I have a code that works perfectly well but I wish to speed up the time it takes to converge. A snippet of the code is shown below:
def myfunction(x, i):
y = x + (min(0, target[i] - data[i, :]x))*data[i]/(norm(data[i])**2))
return y
rows, columns = data.shape
start = time.time()
iterate = 0
iterate_count = []
norm_count = []
res = 5
x_not = np.ones(columns)
norm_count.append(norm(x_not))
iterate_count.append(0)
while res > 1e-8:
for row in range(rows):
y = myfunction(x_not, row)
x_not = y
iterate += 1
iterate_count.append(iterate)
norm_count.append(norm(x_not))
res = abs(norm_count[-1] - norm_count[-2])
print('Converge at {} iterations'.format(iterate))
print('Duration: {:.4f} seconds'.format(time.time() - start))
I am relatively new in Python. I will appreciate any hint/assistance.
Ax=b is the problem we wish to solve. Here, 'A' is the 'data' and 'b' is the 'target'
Ugh! After spending a while on this I don't think it can be done the way you've set up your problem. In each iteration over the row, you modify x_not and then pass the updated result to get the solution for the next row. This kind of setup can't be vectorized easily. You can learn the thought process of vectorization from the failed attempt, so I'm including it in the answer. I'm also including a different iterative method to solve linear systems of equations. I've included a vectorized version -- where the solution is updated using matrix multiplication and vector addition, and a loopy version -- where the solution is updated using a for loop to demonstrate what you can expect to gain.
1. The failed attempt
Let's take a look at what you're doing here.
def myfunction(x, i):
y = x + (min(0, target[i] - data[i, :] # x)) * (data[i] / (norm(data[i])**2))
return y
You subtract
the dot product of (the ith row of data and x_not)
from the ith row of target,
limited at zero.
You multiply this result with the ith row of data divided my the norm of that row squared. Let's call this part2
Then you add this to the ith element of x_not
Now let's look at the shapes of the matrices.
data is (M, N).
target is (M, ).
x_not is (N, )
Instead of doing these operations rowwise, you can operate on the entire matrix!
1.1. Simplifying the dot product.
Instead of doing data[i, :] # x, you can do data # x_not and this gives an array with the ith element giving the dot product of the ith row with x_not. So now we have data # x_not with shape (M, )
Then, you can subtract this from the entire target array, so target - (data # x_not) has shape (M, ).
So far, we have
part1 = target - (data # x_not)
Next, if anything is greater than zero, set it to zero.
part1[part1 > 0] = 0
1.2. Finding rowwise norms.
Finally, you want to multiply this by the row of data, and divide by the square of the L2-norm of that row. To get the norm of each row of a matrix, you do
rownorms = np.linalg.norm(data, axis=1)
This is a (M, ) array, so we need to convert it to a (M, 1) array so we can divide each row. rownorms[:, None] does this. Then divide data by this.
part2 = data / (rownorms[:, None]**2)
1.3. Add to x_not
Finally, we're adding each row of part1 * part2 to the original x_not and returning the result
result = x_not + (part1 * part2).sum(axis=0)
Here's where we get stuck. In your approach, each call to myfunction() gives a value of part1 that depends on target[i], which was changed in the last call to myfunction().
2. Why vectorize?
Using numpy's inbuilt methods instead of looping allows it to offload the calculation to its C backend, so it runs faster. If your numpy is linked to a BLAS backend, you can extract even more speed by using your processor's SIMD registers
The conjugate gradient method is a simple iterative method to solve certain systems of equations. There are other more complex algorithms that can solve general systems well, but this should do for the purposes of our demo. Again, the purpose is not to have an iterative algorithm that will perfectly solve any linear system of equations, but to show what kind of speedup you can expect if you vectorize your code.
Given your system
data # x_not = target
Let's define some variables:
A = data.T # data
b = data.T # target
And we'll solve the system A # x = b
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
p = resid
while (np.abs(resid) > tolerance).any():
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
x = x + alpha * p
resid_new = resid - alpha * Ap
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
To contrast the fully vectorized approach with one that uses iterations to update the rows of x and resid_new, let's define another implementation of the CG solver that does this.
def solve_loopy(data, target, itermax = 100, tolerance = 1e-8):
A = data.T # data
b = data.T # target
rows, columns = data.shape
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
resid_new = b - A # x
p = resid
niter = 0
while (np.abs(resid) > tolerance).any() and niter < itermax:
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
for i in range(len(x)):
x[i] = x[i] + alpha * p[i]
resid_new[i] = resid[i] - alpha * Ap[i]
# resid_new = resid - alpha * A # p
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
niter += 1
return x
And our original vector method:
def solve_vect(data, target, itermax = 100, tolerance = 1e-8):
A = data.T # data
b = data.T # target
rows, columns = data.shape
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
resid_new = b - A # x
p = resid
niter = 0
while (np.abs(resid) > tolerance).any() and niter < itermax:
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
x = x + alpha * p
resid_new = resid - alpha * Ap
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
niter += 1
return x
Let's solve a simple system to see if this works first:
2x1 + x2 = -5
−x1 + x2 = -2
should give a solution of [-1, -3]
data = np.array([[ 2, 1],
[-1, 1]])
target = np.array([-5, -2])
print(solve_loopy(data, target))
print(solve_vect(data, target))
Both give the correct solution [-1, -3], yay! Now on to bigger things:
data = np.random.random((100, 100))
target = np.random.random((100, ))
Let's ensure the solution is still correct:
sol1 = solve_loopy(data, target)
np.allclose(data # sol1, target)
# Output: False
sol2 = solve_vect(data, target)
np.allclose(data # sol2, target)
# Output: False
Hmm, looks like the CG method doesn't work for badly conditioned random matrices we created. Well, at least both give the same result.
np.allclose(sol1, sol2)
# Output: True
But let's not get discouraged! We don't really care if it works perfectly, the point of this is to demonstrate how amazing vectorization is. So let's time this:
import timeit
timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
# Output: 0.25586539999994784
timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
# Output: 0.12008900000000722
Nice! A ~2x speedup simply by avoiding a loop while updating our solution!
For larger systems, this will be even better.
for N in [10, 50, 100, 500, 1000]:
data = np.random.random((N, N))
target = np.random.random((N, ))
t_loopy = timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
t_vect = timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
print(N, t_loopy, t_vect, t_loopy/t_vect)
This gives us:
N t_loopy t_vect speedup
00010 0.002823 0.002099 1.345390
00050 0.051209 0.014486 3.535048
00100 0.260348 0.114601 2.271773
00500 0.980453 0.240151 4.082644
01000 1.769959 0.508197 3.482822

How to draw from two different gaussians conditionally on some Bernoulli distribution?

I have two gaussian distributions (I'm using multivariate_normal) and I'd like to draw from them with probability of p for the first gaussian and 1-p for the other one. I'd like to make n draws.
Is it possible to do that without a for loop? (for efficiency purposes)
Thanks
Yes, it is possible to perform this operation without a loop. Try:
import numpy as np
from scipy import stats
sample_size = 100
p = 0.25
# Flip a coin with P(HEADS) = p to determine which distribution to draw from
indicators = stats.bernoulli.rvs(p, size=sample_size)
# Draw from N(0, 1) w/ probability p and N(-1, 1) w/ probability (1-p)
draws = (indicators == 1) * np.random.normal(0, 1, size=sample_size) + \
(indicators == 0) * np.random.normal(-1, 1, size=sample_size)
You can accomplish the same thing using np.vectorize (caveat emptor):
def draw(x):
if x == 0:
return np.random.normal(-1, 1)
elif x == 1:
return np.random.normal(0, 1)
draw_vec = np.vectorize(draw)
draws = draw_vec(indicators)
If you need to extend the solution to a mixture of more than 2 distributions, you can use np.random.multinomial to assign samples to distributions and add additional cases to the if/else in draw.

Neural network numerical gradient check not working with matrices using Python-numpy

I'm trying to implement a simple numerical gradient check using Python 3 and numpy to be used for neural network.
It works well for simple 1D functions but fails when applied to matrices of parameters.
My guess is that either my cost function is not calculated well for a matrix or that the way I do the numerical gradient check is wrong somehow.
See code below and thanks for your help!
import numpy as np
import random
import copy
def gradcheck_naive(f, x):
""" Gradient check for a function f.
Arguments:
f -- a function that takes a single argument (x) and outputs the
cost (fx) and its gradients grad
x -- the point (numpy array) to check the gradient at
"""
rndstate = random.getstate()
random.setstate(rndstate)
fx, grad = f(x) # Evaluate function value at original point
#fx=cost
#grad=gradient
h = 1e-4
# Iterate over all indexes in x
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
ix = it.multi_index #multi-index number
random.setstate(rndstate)
xp = copy.deepcopy(x)
xp[ix] += h
fxp, gradp = f(xp)
random.setstate(rndstate)
xn = copy.deepcopy(x)
xn[ix] -= h
fxn, gradn = f(xn)
numgrad = (fxp-fxn) / (2*h)
# Compare gradients
reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
if reldiff > 1e-5:
print ("Gradient check failed.")
print ("First gradient error found at index %s" % str(ix))
print ("Your gradient: %f \t Numerical gradient: %f" % (
grad[ix], numgrad))
return
it.iternext() # Step to next dimension
print ("Gradient check passed!")
#sanity check with 1D function
exp_f = lambda x: (np.sum(np.exp(x)), np.exp(x))
gradcheck_naive(exp_f, np.random.randn(4,5)) #this works fine
#sanity check with matrices
#forward pass
W = np.random.randn(5,10)
x = np.random.randn(10,3)
D = W.dot(x)
#backpropagation pass
gradx = W
func_f = lambda x: (np.sum(W.dot(x)), gradx)
gradcheck_naive(func_f, np.random.randn(10,3)) #this does not work (grad check fails)
I figured it out! (my math teacher would be so proud...)
The short answer is that I was mixing up matrices dot product and element wise product.
When using an element wise product, the gradient is equal to:
W = np.array([[2,4],[3,5],[3,1]])
x = np.array([[1,7],[5,-1],[4,7]])
D = W*x #element-wise multiplication
gradx = W
func_f = lambda x: (np.sum(W*x), gradx)
gradcheck_naive(func_f, np.random.randn(3,2))
When using the dot product, the gradient becomes:
W = np.array([[2,4],[3,5]]))
x = np.array([[1,7],[5,-1],[5,1]])
D = x.dot(W)
unitary = np.array([[1,1],[1,1],[1,1]])
gradx = unitary.dot(np.transpose(W))
func_f = lambda x: (np.sum(x.dot(W)), gradx)
gradcheck_naive(func_f, np.random.randn(3,2))
I was also wondering how did the element wise product behave with matrices of not equal dimensions like below:
x = np.random.randn(10)
W = np.random.randn(3,10)
D1 = x*W
D2 = W*x
Turns out that D1=D2 (same dimension as W=3x10) and my understanding is that x is being broadcasted by numpy to be a 3x10 matrix to allow the element wise multiplication.
Conclusion: when in doubt, write it out with small matrices to figure out where the error is.

K means with a condition

I want to apply K means ( or any other simple clustering algorithm ) to data with two variables, but i want clusters to respect a condition : the sum of a third variable per cluster > some_value.
Is that possible?
Notations :
- K is the number of clusters
- let's say that the first two variables are point coordinnates (x,y)
- V denotes the third variable
- Ci : the sum of V over each cluster i
- S the total sum (sum Ci)
- and the threshold T
Problem definition :
From what I understood, the aim is to run an algorithm that keeps the spirit of kmeans while respecting the constraints.
Task 1 - group points by proximity to centroids [kmeans]
Task 2 - for each cluster i, Ci > T* [constraint]
Regular kmeans limitation for the constraint problem :
A regular kmeans, assign points to centroids by taking them in arbitrary order. In our case, this will lead to uncontrol growth of the Ci while adding points.
For exemple, with K=2, T=40 and 4 points with the third variables equal to V1=50, V2=1, V3=50, V4=50.
Suppose also that point P1, P3, P4 are closer to centroid 1. Point P2 is closer to centroid 2.
Let's run the assignement step of a regular kmeans and keep track of Ci :
1-- take point P1, assign it to cluster 1. C1=50 > T
2-- take point P2, assign it to cluster 2 C2=1
3-- take point P3, assign it to cluster 1. C1=100 > T => C1 grows too much !
4-- take point P4, assign it to cluster 1. C1=150 > T => !!!
Modified kmeans :
In the previous, we want to prevent C1 from growing too much and help C2 grow.
This is like pouring champagne into several glasses : if you see a glass with less champaigne, you go and fill it. You do that because you have constraints : limited amound of champaigne (S is bounded) and because you want every glass to have enough champaign (Ci>T).
Of course this is just a analogy. Our modified kmeans will add new poins to the cluster with minimal Ci until the constraint is achieved (Task2). Now in which order should we add points ? By proximity to centroids (Task1). After all constraints are achieved for all cluster i, we can just run a regular kmeans on remaining unassigned points.
Implementation :
Next, I give a python implementation of the modified algorithm. Figure 1 displays a reprensentation of the third variable using transparency for vizualizing large VS low values. Figure 2 displays the evolution clusters using color.
You can play with the accept_thresh parameter. In particular, note that :
For accept_thresh=0 => regular kmeans (constraint is reached immediately)
For accept_thresh = third_var.sum().sum() / (2*K), you might observe that some points that closer to a given centroid are affected to another one for constraint reasons.
CODE :
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import time
nb_samples = 1000
K = 3 # for demo purpose, used to generate cloud points
c_std = 1.2
# Generate test samples :
points, classes = datasets.make_blobs(n_features=2, n_samples=nb_samples, \
centers=K, cluster_std=c_std)
third_var_distribution = 'cubic_bycluster' # 'uniform'
if third_var_distribution == 'uniform':
third_var = np.random.random((nb_samples))
elif third_var_distribution == 'linear_bycluster':
third_var = np.random.random((nb_samples))
third_var = third_var * classes
elif third_var_distribution == 'cubic_bycluster':
third_var = np.random.random((nb_samples))
third_var = third_var * classes
# Threshold parameters :
# Try with K=3 and :
# T = K => one cluster reach cosntraint, two clusters won't converge
# T = 2K =>
accept_thresh = third_var.sum().sum() / (2*K)
def dist2centroids(points, centroids):
'''return arrays of ordered points to each centroids
first array is index of points
second array is distance to centroid
dim 0 : centroid
dim 1 : distance or point index
'''
dist = np.sqrt(((points - centroids[:, np.newaxis]) ** 2).sum(axis=2))
ord_dist_indices = np.argsort(dist, axis=1)
ord_dist_indices = ord_dist_indices.transpose()
dist = dist.transpose()
return ord_dist_indices, dist
def assign_points_with_constraints(inds, dists, tv, accept_thresh):
assigned = [False] * nb_samples
assignements = np.ones(nb_samples, dtype=int) * (-1)
cumul_third_var = np.zeros(K, dtype=float)
current_inds = np.zeros(K, dtype=int)
max_round = nb_samples * K
for round in range(0, max_round): # we'll break anyway
# worst advanced cluster in terms of cumulated third_var :
cluster = np.argmin(cumul_third_var)
if cumul_third_var[cluster] > accept_thresh:
continue # cluster had enough samples
while current_inds[cluster] < nb_samples:
# add points to increase cumulated third_var on this cluster
i_inds = current_inds[cluster]
closest_pt_index = inds[i_inds][cluster]
if assigned[closest_pt_index] == True:
current_inds[cluster] += 1
continue # pt already assigned to a cluster
assignements[closest_pt_index] = cluster
cumul_third_var[cluster] += tv[closest_pt_index]
assigned[closest_pt_index] = True
current_inds[cluster] += 1
new_cluster = np.argmin(cumul_third_var)
if new_cluster != cluster:
break
return assignements, cumul_third_var
def assign_points_with_kmeans(points, centroids, assignements):
new_assignements = np.array(assignements, copy=True)
count = -1
for asg in assignements:
count += 1
if asg > -1:
continue
pt = points[count, :]
distances = np.sqrt(((pt - centroids) ** 2).sum(axis=1))
centroid = np.argmin(distances)
new_assignements[count] = centroid
return new_assignements
def move_centroids(points, labels):
centroids = np.zeros((K, 2), dtype=float)
for k in range(0, K):
centroids[k] = points[assignements == k].mean(axis=0)
return centroids
rgba_colors = np.zeros((third_var.size, 4))
rgba_colors[:, 0] = 1.0
rgba_colors[:, 3] = 0.1 + (third_var / max(third_var))/1.12
plt.figure(1, figsize=(14, 14))
plt.title("Three blobs", fontsize='small')
plt.scatter(points[:, 0], points[:, 1], marker='o', c=rgba_colors)
# Initialize centroids
centroids = np.random.random((K, 2)) * 10
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', color='red')
# Step 1 : order points by distance to centroid :
inds, dists = dist2centroids(points, centroids)
# Check if clustering is theoriticaly possible :
tv_sum = third_var.sum()
tv_max = third_var.max()
if (tv_max > 1 / 3 * tv_sum):
print("No solution to the clustering problem !\n")
print("For one point : third variable is too high.")
sys.exit(0)
stop_criter_eps = 0.001
epsilon = 100000
prev_cumdist = 100000
plt.figure(2, figsize=(14, 14))
ln, = plt.plot([])
plt.ion()
plt.show()
while epsilon > stop_criter_eps:
# Modified kmeans assignment :
assignements, cumul_third_var = assign_points_with_constraints(inds, dists, third_var, accept_thresh)
# Kmeans on remaining points :
assignements = assign_points_with_kmeans(points, centroids, assignements)
centroids = move_centroids(points, assignements)
inds, dists = dist2centroids(points, centroids)
epsilon = np.abs(prev_cumdist - dists.sum().sum())
print("Delta on error :", epsilon)
prev_cumdist = dists.sum().sum()
plt.clf()
plt.title("Current Assignements", fontsize='small')
plt.scatter(points[:, 0], points[:, 1], marker='o', c=assignements)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='o', color='red', linewidths=10)
plt.text(0,0,"THRESHOLD T = "+str(accept_thresh), va='top', ha='left', color="red", fontsize='x-large')
for k in range(0, K):
plt.text(centroids[k, 0], centroids[k, 1] + 0.7, "Ci = "+str(cumul_third_var[k]))
plt.show()
plt.pause(1)
Improvements :
- use the distribution of the third variable for assignments.
- manage divergence of the algorithm
- better initialization (kmeans++)
One way to handle this would be to filter the data before clustering.
>>> cluster_data = df.loc[df['third_variable'] > some_value]
>>> from sklearn.cluster import KMeans
>>> y_pred = KMeans(n_clusters=2).fit_predict(cluster_data)
If by sum you mean the sum of the third variable per cluster then you could use RandomSearchCV to find hyperparameters of KMeans that do or do not meet the condition.
K-means itself is an optimization problem.
Your additional constraint is a rather common optimization constraint, too.
So I'd rather approach this with an optimization solver.

Resources