updating centroids in k-means Python

updating centroids in k-means Python - python-3.x

I'm implementing the K means algorithm in python and I got stuck in
the part in which we suppose to update the centroids.
I have created something that works but its really not python-like.
I know it can be written better and would love for some suggestions
for example how to improve the histogram that counts how many points
are assigned to each centroid.
Here is my code:
def updateCentroids(centroids, pixelList):
k = len(centroids)
centoidsCount = [0]*k #couts how many pixels classified for each cent.
centroidsSum = np.zeros([k, 3])#sum value of centroids
for pixel in pixelList:
index = 0
#find whitch centroid equals
for centroid in centroids:
if np.array_equal(pixel.classification, centroid):
centoidsCount[index] += 1
centroidsSum[index] += pixel.point
break
index += 1
index = 0
for centroid in centroidsSum:
centroids[index] = centroid/centoidsCount[index]
index += 1

Related

What is the logic behind this index in Kruskal's minimum spanning tree algorithm?

I don't understand why we are increasing e += 1 when the parents are not same. And why the while loop stops based on e's value? Why we need that index?
def kruskal(self):
i, e = 0, 0
ds = dst.disjointSet(self.nodes)
self.graph = sorted(self.graph, key=lambda graph:graph[2])
while e < self.v - 1: # vertices start from zero thats why -1
s,d,w = self.graph[i]
i += 1
x = ds.findParent(s)
y = ds.findParent(d)
if x != y:
e += 1
self.MST.append([s,d,w])
ds.union(x, y)
self.printSolution()
ds is disjointSet's object where findParent and union methods are.

Variable e, I'd rather call it edges, represents a number of edges here. You may ask a question what is the mininum number of edges needed to connect n vertices? You can easily prove that's n - 1. So we iterate until we have a necessary number of edges to connect all vertices. We're checking parents x!=y to be sure there's no cycle.

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
Each row represents a coordinates in X-Y plane.
r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
Then the window moves one point down the rows and repeat this process till S_wind is completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]

You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).

Fastest way to determine if two points are closest to one another

My problems consists of the following: I am given two pairs angles (in spherical coordinates) which consists of two parts--an azimuth and a colatitude angle. If we extend both angles (thereby increasing their respective radii) infinitely to make a long line pointing in the direction given by the pair of angles, then my goal is to determine
if they intersect or extremely close to one another and
where exactly they intersect.
Currently, I have tried several methods:
The most obvious one is to iteratively compare each radii until there is either a match or a small enough distance between the two. (When I say compare each radii, I am referring to converting each spherical coordinate into Cartesian and then finding the euclidean distance between the two). However, this runtime is $O(n^{2})$, which is extremely slow if I am trying to scale this program
The second most obvious method is to use the optimization package to find this distance. Unfortunately, I cannot the optimization package iteratively and after one instance the optimization algorithm repeats the same answer, which is not useful.
The least obvious method is to directly calculate (using calculus) the exact radii from the angles. While this is fast method, it is not extremely accurate.
Note: while it might seem simple that the intersection is always at the zero-origin (0,0,0), this is not ALWAYS the case. Some points never intersect.
Code for Method (1)
def match1(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2, colatitude_recon_2,centroid_1,centroid_2 ):
# Constants: tolerance factor and extremely large distance
tol = 3e-2
prevDist = 99999999
# Initialize a list of radii to loop through
# Checking iteravely for a solution
for r1 in list(np.arange(0,5,tol)):
for r2 in list(np.arange(0,5,tol)):
# Get the estimates
estimate_1 = np.array(spher2cart(r1,azimuth_recon_1,colatitude_recon_1)) + np.array(centroid_1)
estimate_2 = np.array(spher2cart(r2,azimuth_recon_2,colatitude_recon_2))+ np.array(centroid_2)
# Calculate the euclidean distance between them
dist = np.array(np.sqrt(np.einsum('i...,i...', (estimate_1 - estimate_2), (estimate_1 - estimate_2)))[:,np.newaxis])
# Compare the distance to this tolerance
if dist < tol:
if dist == 0:
return estimate_1, [], True
else:
return estimate_1, estimate_2, False
## If the distance is too big break out of the loop
if dist > prevDist:
prevDist = 9999999
break
prevDist = dist
return [], [], False
Code for Method (3)
def match2(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2, colatitude_recon_2,centriod_1,centroid_2):
# Set a Tolerance factor
tol = 3e-2
def calculate_radius_2(azimuth_1,colatitude_1,azimuth_2,colatitude_2):
"""Return radius 2 using both pairs of angles (azimuth and colatitude). Equation is provided in the document"""
return 1/((1-(math.sin(azimuth_1)*math.sin(azimuth_2)*math.cos(colatitude_1-colatitude_2))
+math.cos(azimuth_1)*math.cos(azimuth_2))**2)
def calculate_radius_1(radius_2,azimuth_1,colatitude_1,azimuth_2,colatitude_2):
"""Returns radius 1 using both pairs of angles (azimuth and colatitude) and radius 2.
Equation provided in document"""
return (radius_2)*((math.sin(azimuth_1)*math.sin(azimuth_2)*math.cos(colatitude_1-colatitude_2))
+math.cos(azimuth_1)*math.cos(azimuth_2))
# Compute radius 2
radius_2 = calculate_radius_2(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2,colatitude_recon_2)
#Compute radius 1
radius_1 = calculate_radius_1(radius_2,azimuth_recon_1,colatitude_recon_1,azimuth_recon_2,colatitude_recon_2)
# Get the estimates
estimate_1 = np.array(spher2cart(radius_1,azimuth_recon_1,colatitude_recon_1))+ np.array(centroid_1)
estimate_2 = np.array(spher2cart(radius_2,azimuth_recon_2,colatitude_recon_2))+ np.array(centroid_2)
# Calculate the euclidean distance between them
dist = np.array(np.sqrt(np.einsum('i...,i...', (estimate_1 - estimate_2), (estimate_1 - estimate_2)))[:,np.newaxis])
# Compare the distance to this tolerance
if dist < tol:
if dist == 0:
return estimate_1, [], True
else:
return estimate_1, estimate_2, False
else:
return [], [], False
My question is two-fold:
Is there a faster and more accurate way to find the radii for both
points?
If so, how do I do it?
EDIT: I am thinking about just creating two numpy arrays of the two radii and then comparing them via numpy boolean logic. However, I would still be comparing them iteratively. Is there is a faster way to perform this comparison?

Use a kd-tree for such situations. It will easily look up the minimal distance:
def match(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2, colatitude_recon_2,centriod_1,centroid_2):
cartesian_1 = np.array([np.cos(azimuth_recon_1)*np.sin(colatitude_recon_1),np.sin(azimuth_recon_1)*np.sin(colatitude_recon_1),np.cos(colatitude_recon_1)]) #[np.newaxis,:]
cartesian_2 = np.array([np.cos(azimuth_recon_2)*np.sin(colatitude_recon_2),np.sin(azimuth_recon_2)*np.sin(colatitude_recon_2),np.cos(colatitude_recon_2)]) #[np.newaxis,:]
# Re-center them via adding the centroid
estimate_1 = r1*cartesian_1.T + np.array(centroid_1)[np.newaxis,:]
estimate_2 = r2*cartesian_2.T + np.array(centroid_2)[np.newaxis,:]
# Add them to the output list
n = estimate_1.shape[0]
outputs_list_1.append(estimate_1)
outputs_list_2.append(estimate_2)
# Reshape them so that they are in proper format
a = np.array(outputs_list_1).reshape(len(two_pair_mic_list)*n,3)
b = np.array(outputs_list_2).reshape(len(two_pair_mic_list)*n,3)
# Get the difference
c = a - b
# Put into a KDtree
tree = spatial.KDTree(c)
# Find the indices where the radius (distance between the points) is 3e-3 or less
indices = tree.query_ball_tree(3e-3)
This will output a list of the indices where the distance is 3e-3 or less. Now all you will have to do is use the list of indices with the estimate list to find the exact points. And there you have it, this will save you a lot of time and space!

How to set LpVariable and Objective Function in pulp for LPP as per the formula?

I want to calculate the Maximised value of the particular user based on his Interest | Popularity | both Interest and Popularity using following Linear Programming Problem(LPP) equation
using pulp package in python3.7.
I have 4 lists
INTEREST = [5,10,15,20,25]
POPULARITY = [4,8,12,16,20]
USER = [1,2,3,4,5]
cost = [2,4,6,8,10]
and 2 variable values as
e=0.5 ; e may take (0 or 1 or 0.5)
budget=20
and
i=0 to n ; n is length of the list
means, the summation want to perform for all list values.
Here, if e==0 means Interest will 0 ; if e==1 means Popularity will 0 ; if e==0.5 means Interest and Popularity will be consider for Max Value
Also xi takes 0 or 1; if xi==1 then the user will be consider else if xi==0 then the user will not be consider.
and my pulp code as below
from pulp import *
INTEREST = [5,10,15,20,25]
POPULARITY = [4,8,12,16,20]
USER = [1,2,3,4,5]
cost = [2,4,6,8,10]
e=0.5
budget=10
#PROBLEM VARIABLE
prob = LpProblem("MaxValue", LpMaximize)
# DECISION VARIABLE
int_vars = LpVariable.dicts("Interest", INTEREST,0,4,LpContinuous)
pop_vars = LpVariable.dicts("Popularity",
POPULARITY,0,4,LpContinuous)
user_vars = LpVariable.dicts("User",
USER,0,4,LpBinary)
#OBJECTIVE fUNCTION
prob += lpSum(USER(i)((INTEREST[i]*e for i in INTEREST) +
(POPULARITY[i]*(1-e) for i in POPULARITY)))
# CONSTRAINTS
prob += USER(i)cost(i) <= budget
#SOLVE
prob.solve()
print("Status : ",LpStatus[prob.status])
# PRINT OPTIMAL SOLUTION
print("The Max Value = ",value(prob.objective))
Now I am getting 2 errors as
1) line 714, in addInPlace for e in other:
2) line 23, in
prob += lpSum(INTEREST[i]e for i in INTEREST) +
lpSum(POPULARITY[i](1-e) for i in POPULARITY)
IndexError: list index out of range
What I did wrong in my code. Guide me to resolve this problem. Thanks in advance.

I think I finally understand what you are trying to achieve. I think the problem with your description is to do with terminology. In a linear program we reserve the term variable for those variables which we want to be selected or chosen as part of the optimisation.
If I understand your needs correctly your python variables e and budget would be considered parameters or constants of the linear program.
I believe this does what you want:
from pulp import *
import numpy as np
INTEREST = [5,10,15,20,25]
POPULARITY = [4,8,12,16,20]
COST = [2,4,6,8,10]
N = len(COST)
set_user = range(N)
e=0.5
budget=10
#PROBLEM VARIABLE
prob = LpProblem("MaxValue", LpMaximize)
# DECISION VARIABLE
x = LpVariable.dicts("user_selected", set_user, 0, 1, LpBinary)
# OBJECTIVE fUNCTION
prob += lpSum([x[i]*(INTEREST[i]*e + POPULARITY[i]*(1-e)) for i in set_user])
# CONSTRAINTS
prob += lpSum([x[i]*COST[i] for i in set_user]) <= budget
#SOLVE
prob.solve()
print("Status : ",LpStatus[prob.status])
# PRINT OPTIMAL SOLUTION
print("The Max Value = ",value(prob.objective))
# Show which users selected
x_soln = np.array([x[i].varValue for i in set_user])
print("user_vars: ")
print(x_soln)
Which should return the following, i.e. with these particular parameters only the last user is selected for inclusion - but this decision will change - for example if you increase the budget to 100 all users will be selected.
Status : Optimal
The Max Value = 22.5
user_vars:
[0. 0. 0. 0. 1.]

How to start this "Number Density of Particles" homework in Python?

Part 2 - Determination of Number Density of Particles
If we say that q is the production rate of particles of a speciﬁc size, then in an interval dt the total number of particles produced is just q dt. To make things concrete in what follows, please adopt the case:
a = 0.9amax
q = 100000
Consider this number of particles at some distance r from the nucleus. The number density of particles will be number divided by volume, so to ﬁnd number density we must compute the volume of a shell of radius r with a thickness that corresponds to how far the particles will travel in our time interval dt. Obviously that’s just the velocity of the particle at radius r times the time interval v(r) dt, so the volume of our shell is:
Volume = Shell Surface Area×Shell Thickness = 4πr2v(r)dt
Therefore, the number density, n, at radius r is:
n(r) = q dt /4πr2v(r)dt = q /4πr2v(r) (equation5)
You will note that our expression above will have a singularity for the number density of particles right at the surface of the nucleus, since at that position the outward velocity, v(R), is 0. Obviously this is an indication that we expect the particle density n to drop very rapidly as the dust is accelerated away from the surface. For now, let’s not worry about this point — we don’t need it later — and just graph how the number density varies with distance from the nucleus, starting with the 1st point after the surface value
• Evaluate Eqaution 5 for all calculated points using the parameters for q and a given above.
• Make a log-log graph of the number density versus radius. You should ﬁnd that, after terminal velocity is achieved, the number density decreases as r−2, corresponding to a slope of -2 on a log-log plot
Current code:
% matplotlib inline
import numpy as np
import matplotlib.pyplot as pl
R = 2000 #Nucleus Radius (m)
GM_n = 667 #Nucleus Mass (m^3 s^-2)
Q = 7*10**27 #Gas Production Rate (molecules s^-1)
V_g = 1000 #Gas Velocity (m s^-1)
C_D = 4 #Drag Coefficient Dimensionless
p_d = 500 #Grain Density (kg m^-3)
M_h2o = .01801528/(6.022*10**23) #Mass of a water molecule (g/mol)
pi = np.pi
p_g_R = M_h2o*Q/(4*np.pi*R**2*V_g)
print ('Gas Density at the comets nucleus: ', p_g_R)
a_max = (3/8)*C_D*(V_g**2)*p_g_R*(1/p_d)*((R**2)/GM_n)
print ('Radius of Maximum Size Particle: ', a_max)
def drag_force(C_D,V_g,p_g_R,pi,a,v):
drag = .5*C_D*((V_g - v)**2)*p_g_R*pi*a**2
return drag
def grav_force(GM_n,M_d,r):
grav = -(GM_n*M_d)/(r**2)
return grav
def p_g_r(p_g_R,R,r):
p_g_r = p_g_R*(R**2/r**2)
return p_g_r
dt = 1
tfinal = 100000
v0 = 0
t = np.arange(0.,tfinal+dt,dt)
npoints = len(t)
r = np.zeros(npoints)
v = np.zeros(npoints)
r[0]= R
v[0]= v0
a = np.array([0.9,0.5,0.1,0.01,0.001])*a_max
for j in range(len(a)):
M_d = 4/3*pi*a[j]**3*p_d
for i in range(len(t)-1):
rmid = r[i] + v[i]*dt/2.
vmid = v[i] + (grav_force(GM_n,M_d,r[i])+drag_force(C_D,V_g,p_g_r(p_g_R,R,r[i]),pi,a[j],v[i]))*dt/2.
r[i+1] = r[i] + vmid*dt
v[i+1] = v[i] + (grav_force(GM_n,M_d,rmid)+drag_force(C_D,V_g,p_g_r(p_g_R,R,rmid),pi,a[j],vmid))*dt
pl.plot(r,v)
pl.show()
a_2= 0.9*a_max
q = 100000
I have never programmed anything like this before, my class is very difficult for me and I don't understand it. I have developed the above code with the help of the professor, and I am nearly out of time to finish this project. I just want help understanding the problem.
How do I find v(r) when I only have v(t), r(t)?
What do I do to calculate the r values and what r values do I even use?

You have v as a known function of time and also r as another known function of time. You can invert these to get t vs. v and t vs. r. To get v as a function of r, eliminate t.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

updating centroids in k-means Python - python-3.x

Related

What is the logic behind this index in Kruskal's minimum spanning tree algorithm?

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

Fastest way to determine if two points are closest to one another

How to set LpVariable and Objective Function in pulp for LPP as per the formula?

How to start this "Number Density of Particles" homework in Python?

Categories

Resources