Kmeans clustering with map reduce in spark - apache-spark

Hello can someone help me to do map reduce with Kmeans using Spark . Actually can do Kmeans with spark , but i dont how to map and reduce it .
Thanks .

Below a proposed pseudo-code for your exercise:
centroids = k random sampled points from the dataset
Map:
Given a point and the set of centroids
Calculate the distance between the point and each centroid
Emit the point and the closest centroid
Reduce:
Given the centroid and the points belonging to its cluster
Calculate the new centroid as the arithmetic mean position of the points
Emit the new centroid
prev_centroids = centroids
centroids = new_centroids
while prev_centroids - centroids > threshold
The mapper class calculates the distance between the data point and each centroid. Then emits the index of the closest centroid and the data point:
class MAPPER
method MAP(file_offset, point)
min_distance = POSITIVE_INFINITY
closest_centroid = -1
for all centroid in list_of_centroids
distance = distance(centroid, point)
if (distance < min_distance)
closest_centroid = index_of(centroid)
min_distance = distance
EMIT(closest_centroid, point)
The reducer calculates the new approximation of the centroid and emits it.
class REDUCER
method REDUCER(centroid_index, list_of_point_sums)
number_of_points = partial_sum.number_of_points
point_sum = 0
for all partial_sum in list_of_partial_sums:
point_sum += partial_sum
point_sum.number_of_points += partial_sum.number_of_points
centroid_value = point_sum / point_sum.number_of_points
EMIT(centroid_index, centroid_value)
The actual K-Means Spark implementation:
First you read the file with the points and generate the initial centroids with a random sampling, using takeSample(False, k): this function takes k random samples, without replacement, from the RDD; so, the application generates the initial centroids in a distributed manner, avoiding to move all the data to the driver. You may reuse the RDD in an iterative algorithm, hence cache it in memory with cache() to avoid to re-evaluate it every time an action is triggered:
points = sc.textFile(INPUT_PATH).map(Point).cache()
initial_centroids = init_centroids(points, k=parameters["k"])
def init_centroids(dataset, k):
start_time = time.time()
initial_centroids = dataset.takeSample(False, k)
print("init centroid execution:", len(initial_centroids), "in",
(time.time() - start_time), "s")
return initial_centroids
After that, you iterate the mapper and the reducer stages until the stopping criterion is verified or when the maximum number of iterations is reached.
while True:
print("--Iteration n. {itr:d}".format(itr=n+1), end="\r",
flush=True)
cluster_assignment_rdd = points.map(assign_centroids)
sum_rdd = cluster_assignment_rdd.reduceByKey(lambda x, y: x.sum(y))
centroids_rdd = sum_rdd.mapValues(lambda x:
x.get_average_point()).sortByKey(ascending=True)
new_centroids = [item[1] for item in centroids_rdd.collect()]
stop = stopping_criterion(new_centroids,parameters["threshold"])
n += 1
if(stop == False and n < parameters["maxiteration"]):
centroids_broadcast = sc.broadcast(new_centroids)
else:
break
The stopping condition is computed this way:
def stopping_criterion(new_centroids, threshold):
old_centroids = centroids_broadcast.value
for i in range(len(old_centroids)):
check = old_centroids[i].distance(new_centroids[i],
distance_broadcast.value) <= threshold
if check == False:
return False
return True
In order to represent the points, a class Point has been defined. It's characterized by the following fields:
a numpyarray of components
number of points: a point can be seen as the aggregation of many points, so this variable is used to track the number of points that are represented by the object
It includes the following operations:
distance (it is possible to pass as parameter the type of distance)
sum
get_average_point: this method returns a point that has as components the average of the actual components on the number of the points represented by the object
class Point:
def init(self, line):
values = line.split(",")
self.components = np.array([round(float(k), 5) for k in values])
self.number_of_points = 1
def sum(self, p):
self.components = np.add(self.components, p.components)
self.number_of_points += p.number_of_points
return self
def distance(self, p, h):
if (h < 0):
h = 2
return linalg.norm(self.components - p.components, h)
def get_average_point(self):
self.components = np.around(np.divide(self.components,
self.number_of_points), 5)
return self
The mapper method is invoked, at each iteration, on the input file, that contains the points from the dataset
cluster_assignment_rdd = points.map(assign_centroids)
The assign_centroids function, for each point on which is invoked, assign the closest centroid to that point. The centroids are taken from the broadcast variable. The function returns the result as a tuple (id of the centroid, point)
def assign_centroids(p):
min_dist = float("inf")
centroids = centroids_broadcast.value
nearest_centroid = 0
for i in range(len(centroids)):
distance = p.distance(centroids[i], distance_broadcast.value)
if(distance < min_dist):
min_dist = distance
nearest_centroid = i
return (nearest_centroid, p)
The reduce stage is done using two spark transformations:
reduceByKey: for each cluster, compute the sum of the points belonging to it. It is mandatory to pass one associative function as a parameter. The associative function (which accepts two arguments and returns a single element) should be commutative and associative in mathematical nature
sum_rdd = cluster_assignment_rdd.reduceByKey(lambda x, y: x.sum(y))
mapValues: it is used to calculate the average point for each cluster at the end of each stage. The points are already divided by key. This trasformation works only on the value of a key. The results are sorted in order to make easier comparisons.
centroids_rdd = sum_rdd.mapValues(lambda x:
x.get_average_point()).sortBy(lambda x: x[1].components[0])
The get_average_point() function returns the new computed centroid.
def get_average_point(self):
self.components = np.around(np.divide(self.components,
self.number_of_points), 5)
return self

You don't need to write map-reduce. You can use spark dataframe API and use Spark ML library.
You can read more about it here.
https://spark.apache.org/docs/latest/ml-clustering.html

Related

What is the time complexity of Dijkstra's Algorithm? Am I implementing my code right?

I was trying to implement Dijkstra's SPA for the LeetCode Question #743, but I wasn't sure if my algorithm was written in the most efficient manner. Would I be able to get some feedbacks on how I can be improving my code? Also, I would like to get some explanations on my code's time complexity. Thanks!
The variables are:
times: list of directed edges in [u, v, w] format. u is the source node, v is the target node, and w is the weight of the edge.
n: number of nodes
k: initial node
Here is my code:
def listToDict(self, times):
times_dict = defaultdict(list)
for edge in times:
u,v,w = edge
times_dict[u].append((w,v))
return times_dict
def networkDelayTime(self, times: List[List[int]], n: int, k: int) -> int:
distances = [float("inf")] * (n)
# O(E)
graph_dict = self.listToDict(times)
nodes_to_visit = [(0,k)]
while nodes_to_visit:
w, u = heapq.heappop(nodes_to_visit)
if distances[u-1] > w:
distances[u-1] = w
for neighbor in graph_dict[u]:
w2, v = neighbor
if distances[v-1] == float("inf"):
heapq.heappush(nodes_to_visit, (w+w2, v))
if float("inf") in distances:
return -1
else:
return max(distances)

Writing user defined function to evaluate the Saha equation with for-loops for an expected output

I am trying to create a function that evaluates the Saha function for certain values of temperature and electron pressure. The question is a little in depth so I will provide as much detail as possible about past code used before this section.
Previous sections code
Evaluating the partition function (part 1):
k= 8.617333262145179e-05
T=10000.
g=1.0
Ca_ion_energies = np.array([6.1131554, 11.871719, 50.91316, 67.2732, 84.34]) #in eV
Ca_partition_values= []
def partfunc_E(chiI,T):
for chiI in Ca_ion_energies:
elem = 0
for i in np.arange(chiI):
elem = elem + (g*np.exp(-(i/(k*T))))
Ca_partition_values.append(elem)
return Ca_partition_values
print(partfunc_E(Ca_ion_energies,T))
Output:
[1.455902590894594, 1.45633321917395, 1.4563345239240013, 1.4563345239240013, 1.4563345239240013]
Evaluating the Boltzmann equation (part 2):
chiI = np.array([6.1131554, 11.871719, 50.91316, 67.2732, 84.34]) #in eV
k= 8.617333262145179e-05
T=10000.
def boltz_E(chiI,T,I,i):
Z_1 = partfunc_E(chiI,T)
ratio = np.exp(-i/(k*T)) / Z_1
return ratio [I-1]
print(Ca_ion_energies)
print("i Fraction in level i for I=1 (neutral)")
print("- -------------------------------------")
for n in range(0,10):
print(n,boltz_E(chiI,10000,1,n))
Output:
[ 6.1131554 11.871719 50.91316 67.2732 84.34 ]
i Fraction in level i for I=1 (neutral)
- -------------------------------------
0 0.6868591389658425
1 0.21522358567610525
2 0.06743914320048579
3 0.021131689732463026
4 0.006621500359539954
5 0.002074811222693332
6 0.0006501308428703751
7 0.0002037149733085943
8 6.383298193775377e-05
9 2.0001718660577703e-05
Question I need help with (and my code so far):
Evaluating the Saha equation (part 3):
The instructions for this section are as follows:
The simplest way to get this ratio is to set 𝑁_𝐼=1 (i.e. the neutral atom) to some value (e.g. unity), evaluate the next ionisation-stage populations successively from the Saha equation in a for loop, and at the end divide them by the sum of all the 𝑁 on the same scale. You will find the numpy np.sum function useful to get the total over all stages. We want temperature T to be 5000K and electron pressure Pe to be 100.0 N/m^2.
FYI: I is the ionisation stage, Z_1 is the partition function from part 1, Z_I is the partition function for stage I+1, Pe is the electron pressure, chiI are the ionisation energies (for Calcium in my code), T is temperature and the function that "fraction" is set equal to is the Saha equation.
It should start something like:
def saha_E(chiI,T,Pe,I):
compute Saha population fraction N_I/N
input: ionisation energies, temperature, electron pressure, ion stage
Compute the partition functions
Loop over each ionisation stage that you have an energy for, computing the fraction via the saha equation. Note that the first stage should be set to 1.
Divide each stage by the total
Return the fraction of the requested stage
My code attempt:
k= 8.617333262145179e-05
T=10000.
g=1.0
Ca_ion_energies = np.array([6.1131554, 11.871719, 50.91316, 67.2732, 84.34])
N_I = 1
h = 6.626e-34
m = 9.11e-31
fractions = []
fraction_sum = []
def saha_E(chiI,T,Pe,I):
Z_1 = partfunc_E(chiI,T)
Z_I = partfunc_E(chiI+1,T)
for I in Ca_ion_energies:
fraction = (N_I*(Z_I/Z_1)*((2*k*T)/((h**3)*Pe))*((2*np.pi*m*k*T)**(3/2))*np.exp(-I/(k*T)))
fractions.append(fraction)
fraction_sum.append(np.sum(fractions))
for i in fractions:
i/fraction_sum
return fraction
print("For ionisation energies (in eV) of:",chiI)
print()
print("I Fraction in stage I")
print("- -------------------")
for I in range(0,6):
print(I,saha_E(chiI,5000,100.0,I))
I am instructed also that the output should be something similar to:
For ionisation energies (in eV) of: [ 6.11 11.87 50.91 67.27 84.34]
I Fraction in stage I
- -------------------
1 0.999998720736
2 1.27926351211e-06
3 7.29993420039e-52
4 1.3474665329e-113
5 1.54848994685e-192
Firstly, I don't think my code is correct but it is the best I can do which is why I need some help, but also, this code is giving me the following error:
TypeError: unsupported operand type(s) for /: 'list' and 'list'
If my code is totally wrong please tell me as I have spent so much time trying to figure this out already.
Edit
This question is still not completely answered, please keep commenting!
If I understood your problem well, my approach is to calculate the "fractions" and "fractions sums" in a single loop on the various energies, and normalize only once we are outside the loop.
Also, careful with the scope of your code. I pushed some variables you declared outside of the function inside of it because there is no reason to keep them alive outside of the function's scope.
Careful also not to use the same variable twice. Your function takes a I argument but then has a I variable in a for loop.
As said in the chat, you want to write dosctrings and comments so that you know where you are going even before touching any code. Here is a base to complete:
import numpy as np
# Constants.
k = 8.617333262145179e-05
g = 1.0
h = 6.626e-34
m = 9.11e-31
Ca_ion_energies = np.array([6.1131554, 11.871719, 50.91316, 67.2732, 84.34]) # in eV.
# Partition function.
def partfunc_E(chiI, T):
"""This function returns the partition of blablabla.
args:
------
:chiI: (array or list) the energy levels of a chosen ion.
:T: (float) the temperature at which kT will be calculated."""
Ca_partition_values = []
for energy_level in chiI: # For each energy level.
elem = 0
for i in np.arange(energy_level): # From 0 to current energy level.
elem += g*np.exp(-(i/(k*T)))
Ca_partition_values.append(elem)
return np.array(Ca_partition_values) # Conversion to numpy array to support operations later.
print(partfunc_E(Ca_ion_energies, T=10000))
# Boltzmann equation.
def boltz_E(chiI, T, I, i):
Z_1 = partfunc_E(chiI, T)
ratio = np.exp(-i/(k*T)) / Z_1
return ratio[I-1]
print(Ca_ion_energies)
print("i Fraction in level i for I=1 (neutral)")
print("- -------------------------------------")
for n in range(0,10):
print(n, boltz_E(Ca_ion_energies, T=10000, I=1, i=n))
# Saha equation.
def saha_E(chiI, T, Pe, i):
p = partfunc_E(chiI, T)
Z_ratios = np.array([p[n]/p[0] for n in range(len(chiI))])
fractions = []
fractions_sum = []
for n, I in enumerate(chiI):
fraction = Z_ratios[n]*((2*k*T)/((h**3)*Pe))*((2*np.pi*m*k*T)**(3/2))*np.exp(-I/(k*T))
fractions.append(fraction)
fractions_sum.append(np.sum(fractions))
# Let's normalize the array before returning it.
fractions = np.divide(fractions, fractions_sum)
return fractions[i]
print("For ionisation energies (in eV) of:", Ca_ion_energies)
print()
print("I Fraction in stage n")
print("- -------------------")
for n in range(0, 4):
print(n, saha_E(Ca_ion_energies, T=5000, Pe=100.0, i=n))

Grouping algorithm based on distance

I have a set of 200 points with x and y coordinates. I need to make batches of 20 such that each point in the batch is "n" centimeters away from the other 19 i.e. no two points in one batch are within "n" cm of the other. A point should only belong to one batch. How do I solve this?
I've used trees to draw out branches such that a new node is added only if it is "n" cm away from all other nodes in the branch. This works but is extremely slow.
Input.csv: |Point Name||X coordinate||Y coordinate|
Output: Lists of batches
From what I guess from your question, this should to it:
import math
from random import randrange
test_data = {(randrange(0, 1000), randrange(0, 1000)) for _ in range(200)}
inf = float("inf")
def distance(p1, p2):
return math.hypot(p2[0] - p1[0], p2[1] - p1[1])
def batch(data: set, min_distance, max_distance=inf, count=20, remove=True):
result = [next(iter(data))]
candidates = {t for t in data if max_distance > distance(result[0], t) > min_distance}
while len(result) < count and len(candidates) > 0:
result.append(candidates.pop())
candidates = {t for t in data if all(max_distance > distance(p, t) > min_distance for p in result)}
if len(result)<count:
raise ValueError("Not enough values in data that have great enough distances between each other")
if remove:
data.difference_update(result)
return result
print(len(test_data))
i = 1
while test_data:
print(i,batch(test_data, 10))
i+=1

K means with a condition

I want to apply K means ( or any other simple clustering algorithm ) to data with two variables, but i want clusters to respect a condition : the sum of a third variable per cluster > some_value.
Is that possible?
Notations :
- K is the number of clusters
- let's say that the first two variables are point coordinnates (x,y)
- V denotes the third variable
- Ci : the sum of V over each cluster i
- S the total sum (sum Ci)
- and the threshold T
Problem definition :
From what I understood, the aim is to run an algorithm that keeps the spirit of kmeans while respecting the constraints.
Task 1 - group points by proximity to centroids [kmeans]
Task 2 - for each cluster i, Ci > T* [constraint]
Regular kmeans limitation for the constraint problem :
A regular kmeans, assign points to centroids by taking them in arbitrary order. In our case, this will lead to uncontrol growth of the Ci while adding points.
For exemple, with K=2, T=40 and 4 points with the third variables equal to V1=50, V2=1, V3=50, V4=50.
Suppose also that point P1, P3, P4 are closer to centroid 1. Point P2 is closer to centroid 2.
Let's run the assignement step of a regular kmeans and keep track of Ci :
1-- take point P1, assign it to cluster 1. C1=50 > T
2-- take point P2, assign it to cluster 2 C2=1
3-- take point P3, assign it to cluster 1. C1=100 > T => C1 grows too much !
4-- take point P4, assign it to cluster 1. C1=150 > T => !!!
Modified kmeans :
In the previous, we want to prevent C1 from growing too much and help C2 grow.
This is like pouring champagne into several glasses : if you see a glass with less champaigne, you go and fill it. You do that because you have constraints : limited amound of champaigne (S is bounded) and because you want every glass to have enough champaign (Ci>T).
Of course this is just a analogy. Our modified kmeans will add new poins to the cluster with minimal Ci until the constraint is achieved (Task2). Now in which order should we add points ? By proximity to centroids (Task1). After all constraints are achieved for all cluster i, we can just run a regular kmeans on remaining unassigned points.
Implementation :
Next, I give a python implementation of the modified algorithm. Figure 1 displays a reprensentation of the third variable using transparency for vizualizing large VS low values. Figure 2 displays the evolution clusters using color.
You can play with the accept_thresh parameter. In particular, note that :
For accept_thresh=0 => regular kmeans (constraint is reached immediately)
For accept_thresh = third_var.sum().sum() / (2*K), you might observe that some points that closer to a given centroid are affected to another one for constraint reasons.
CODE :
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import time
nb_samples = 1000
K = 3 # for demo purpose, used to generate cloud points
c_std = 1.2
# Generate test samples :
points, classes = datasets.make_blobs(n_features=2, n_samples=nb_samples, \
centers=K, cluster_std=c_std)
third_var_distribution = 'cubic_bycluster' # 'uniform'
if third_var_distribution == 'uniform':
third_var = np.random.random((nb_samples))
elif third_var_distribution == 'linear_bycluster':
third_var = np.random.random((nb_samples))
third_var = third_var * classes
elif third_var_distribution == 'cubic_bycluster':
third_var = np.random.random((nb_samples))
third_var = third_var * classes
# Threshold parameters :
# Try with K=3 and :
# T = K => one cluster reach cosntraint, two clusters won't converge
# T = 2K =>
accept_thresh = third_var.sum().sum() / (2*K)
def dist2centroids(points, centroids):
'''return arrays of ordered points to each centroids
first array is index of points
second array is distance to centroid
dim 0 : centroid
dim 1 : distance or point index
'''
dist = np.sqrt(((points - centroids[:, np.newaxis]) ** 2).sum(axis=2))
ord_dist_indices = np.argsort(dist, axis=1)
ord_dist_indices = ord_dist_indices.transpose()
dist = dist.transpose()
return ord_dist_indices, dist
def assign_points_with_constraints(inds, dists, tv, accept_thresh):
assigned = [False] * nb_samples
assignements = np.ones(nb_samples, dtype=int) * (-1)
cumul_third_var = np.zeros(K, dtype=float)
current_inds = np.zeros(K, dtype=int)
max_round = nb_samples * K
for round in range(0, max_round): # we'll break anyway
# worst advanced cluster in terms of cumulated third_var :
cluster = np.argmin(cumul_third_var)
if cumul_third_var[cluster] > accept_thresh:
continue # cluster had enough samples
while current_inds[cluster] < nb_samples:
# add points to increase cumulated third_var on this cluster
i_inds = current_inds[cluster]
closest_pt_index = inds[i_inds][cluster]
if assigned[closest_pt_index] == True:
current_inds[cluster] += 1
continue # pt already assigned to a cluster
assignements[closest_pt_index] = cluster
cumul_third_var[cluster] += tv[closest_pt_index]
assigned[closest_pt_index] = True
current_inds[cluster] += 1
new_cluster = np.argmin(cumul_third_var)
if new_cluster != cluster:
break
return assignements, cumul_third_var
def assign_points_with_kmeans(points, centroids, assignements):
new_assignements = np.array(assignements, copy=True)
count = -1
for asg in assignements:
count += 1
if asg > -1:
continue
pt = points[count, :]
distances = np.sqrt(((pt - centroids) ** 2).sum(axis=1))
centroid = np.argmin(distances)
new_assignements[count] = centroid
return new_assignements
def move_centroids(points, labels):
centroids = np.zeros((K, 2), dtype=float)
for k in range(0, K):
centroids[k] = points[assignements == k].mean(axis=0)
return centroids
rgba_colors = np.zeros((third_var.size, 4))
rgba_colors[:, 0] = 1.0
rgba_colors[:, 3] = 0.1 + (third_var / max(third_var))/1.12
plt.figure(1, figsize=(14, 14))
plt.title("Three blobs", fontsize='small')
plt.scatter(points[:, 0], points[:, 1], marker='o', c=rgba_colors)
# Initialize centroids
centroids = np.random.random((K, 2)) * 10
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', color='red')
# Step 1 : order points by distance to centroid :
inds, dists = dist2centroids(points, centroids)
# Check if clustering is theoriticaly possible :
tv_sum = third_var.sum()
tv_max = third_var.max()
if (tv_max > 1 / 3 * tv_sum):
print("No solution to the clustering problem !\n")
print("For one point : third variable is too high.")
sys.exit(0)
stop_criter_eps = 0.001
epsilon = 100000
prev_cumdist = 100000
plt.figure(2, figsize=(14, 14))
ln, = plt.plot([])
plt.ion()
plt.show()
while epsilon > stop_criter_eps:
# Modified kmeans assignment :
assignements, cumul_third_var = assign_points_with_constraints(inds, dists, third_var, accept_thresh)
# Kmeans on remaining points :
assignements = assign_points_with_kmeans(points, centroids, assignements)
centroids = move_centroids(points, assignements)
inds, dists = dist2centroids(points, centroids)
epsilon = np.abs(prev_cumdist - dists.sum().sum())
print("Delta on error :", epsilon)
prev_cumdist = dists.sum().sum()
plt.clf()
plt.title("Current Assignements", fontsize='small')
plt.scatter(points[:, 0], points[:, 1], marker='o', c=assignements)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='o', color='red', linewidths=10)
plt.text(0,0,"THRESHOLD T = "+str(accept_thresh), va='top', ha='left', color="red", fontsize='x-large')
for k in range(0, K):
plt.text(centroids[k, 0], centroids[k, 1] + 0.7, "Ci = "+str(cumul_third_var[k]))
plt.show()
plt.pause(1)
Improvements :
- use the distribution of the third variable for assignments.
- manage divergence of the algorithm
- better initialization (kmeans++)
One way to handle this would be to filter the data before clustering.
>>> cluster_data = df.loc[df['third_variable'] > some_value]
>>> from sklearn.cluster import KMeans
>>> y_pred = KMeans(n_clusters=2).fit_predict(cluster_data)
If by sum you mean the sum of the third variable per cluster then you could use RandomSearchCV to find hyperparameters of KMeans that do or do not meet the condition.
K-means itself is an optimization problem.
Your additional constraint is a rather common optimization constraint, too.
So I'd rather approach this with an optimization solver.

How to initialize centroids in "k-means clustering" belonging to the domain of the data points?

How can I modify this code to initialize the centroids within the domain of the datapoints taken ?
For ex: if DATA = [[2.0, 5.0], [1.0, 5.0], [22.0, 55.0], [42.0, 12.0], [15.0, 16.0]]
Then centroids(x,y) could be any value such that x belong to :[1,42] and y belongs to : [5,55].
The centroids should not necessarily be datapoints.
Note: The dataype for data is float.
import random
import math
BIG_NUMBER = math.pow(10, 10)
data = []
centroids = []
class Centroid:
def __init__(self, x, y):
self.x = x
self.y = y
def set_x(self, x):
self.x = x
def get_x(self):
return self.x
def set_y(self, y):
self.y = y
def get_y(self):
return self.y
def initialize_centroids(k,DATA):
for j in range(k):
x = random.choice(DATA)
centroids.append(Centroid(x[0], x[1]))
return
The usual way of initializing k-means uses randomly sampled data points.
Initialization by drawing random numbers from the data range does not improve results. This may seem like a good idea at first, but it is highly problematic, because it is built on the false assumption that the data is uniformly distributed. On the contrary, data is clustered, and the best centers are in the very middle of the cluster. In particular, you will see empty clusters very often, so this initialization is usually your worst choice.
If you insist, find the minimum and maximum on each axis, then draw random values from Uniform[min; max] each.
Your current method is akin to the Forgy method of choosing initial centroids. Instead of looping and making random choices, you could use random.sample to chose k data-points. This is generally a good method. However your comment contradicts the question, in stating that the centroids must not be data-points.
An alternative method is to assign each data-point to an initial partition at random, (for example shuffle and then slice the data) and use the calculated centroids of the k randomly chosen partitions
random.shuffle(data)
random_partitions = [data[i::k] for i in range(k)]
centroids = [ "calculate centroid of partition()" for partition in random_partitions]
This method tends to put the centroids near the middle of the data, which may be desirable.
See https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods

Resources