i'm approaching a clustering problem using k-means on my laptop (using jupyter-lab).
As first step, after data pre-processing, i'm trying to understand the optimal number of cluster to choose, by calculating silhouette and elbow measures.
So i'm running this code:
test = df_std
wcss_cust = []
sil_score = []
n_clusters = range(2, 9)
for i in n_clusters:
clusterer = KMeans(n_clusters = i, init = 'k-means++', random_state = 42 , max_iter=150)
cluster_labels = clusterer.fit_predict(test)
silhouette_avg = silhouette_score(test, cluster_labels)
wcss_cust.append(clusterer.inertia_/1000)
sil_score.append(silhouette_avg*1000)
print("For n_clusters =", i, "The average silhouette_score and wcss are : silhouette_score= ", silhouette_avg.round(decimals=2) , " , wcss: " , clusterer.inertia_ )
Note:
Please condiser that my dataframe has 250k rows each of one is a customer i want to be clustered.
The number of feature that i'm using is not that high (just 4 numeric measures)
The problem i'm having is that: the code is running wiredly slow.
my python is 64 bit like my laptop
Is there any way to increase the memory usage?
Thank you
I have a simple pi-approximating script like so:
import numpy as np
import matplotlib.pyplot as plt
import time
start = 10
stop = 1000000
step = 100
exactsolution = np.pi
def montecarlopi(N=1000000):
random_x = np.random.random(size = N)
random_y = np.random.random(size = N)
bod = np.array([random_x, random_y]).T
square_area = N
quarter_circle_area = np.count_nonzero(np.linalg.norm(bod, axis = 1)<=1)
pi_approx = 4*quarter_circle_area/square_area
return pi_approx
if __name__ == '__main__':
times = []
results = []
attemps = np.arange(start = start, stop = stop, step = step)
for i in attemps:
start_time = time.time()
results.append(montecarlopi(i))
times.append(time.time()-start_time)
absolute_errors = np.abs(np.array(results)-exactsolution)
and I want to know how long the calculation takes based on the number of random attemps I use. As you can see I use a for loop to get each of the calculation times I need, but this defeats the purpose of Numpy, slowing down my code a lot. Effectively I'd like to just call montecarlopi() on the whole attemps array, but then I wouldn't have the calculation times.
Is there a way to time each paralelized calculation numpy does?
I used the timing code from the answer provided here:
https://codereview.stackexchange.com/questions/165245/plot-timings-for-a-range-of-inputs
I only had to change labels to codecs in the line:
empty_multi_index = pd.MultiIndex(levels=[[], []], codes=[[], []], names=['func', 'result'])
Timing linear
Then you can run your whole timing experiment using
timings.plot_times([montecarlopi], inputs=np.arange(10, 1000000, 1000), repeats=3)
And get an output like this
Timing Logspace
Or more clear using logspacing
timings.plot_times([montecarlopi], inputs=np.logspace(1, 8, 8, dtype=np.int), repeats=3)
I have loaded a huge image as Numpy array of dimensions H x W x 3. I want to split this single image into 15 x 15 grid and transform it into 225 x H/15 x W/15 x 3 NumPy array where the ordering happens either row-wise or column-wise. Note that H and W are perfect multiples of 15.
I know that this can be done using two for loops as shown below,
for row in range(15):
for col in range(15):
count+=1
subimage[count,:,:,:] = img[h1:h2, w1:w2, :]
but this takes time (I have to repeat this process for 100,000 images which are very huge).
Is there a faster NumPy code to re-organize a single image into 225 sub-images as illustrated above?
It looks like most of the time is spent in copying the hugeimage array values in the subimages array. The only solution I've found to speed up your process is to get the resulted subimages as a list of subarray references instead of a numpy array. This enables to speed up the subimage creation a lot but has 2 drawbacks:
You'll need to adapt the following code to the new format.
The elements of the list are references to the hugeimage so modifying subimageslist2[i] array will also alter hugeimage array values.
Here is a small script that compares your version and the list version:
import numpy as np
import time
# Preparation of testdata
R, C = 15, 15
H, W, D = 400*R, 400*C, 3
hugeimage = np.random.randint(0,255,(H,W,D))
# For loop verion
t_start = time.time()
subimages = np.zeros((R*C,H//R,W//C,D),dtype='int')
count = -1
for row in range(R):
for col in range(C):
count+=1
h1, h2, w1, w2 = row*(H//R), (row+1)*(H//R), col*(W//C), (col+1)*(W//C)
subimages[count,:,:,:] = hugeimage[h1:h2, w1:w2, :]
print(f'Timer 1: {time.time()-t_start}s')
# For loop list (no copy)
t_start = time.time()
subimageslist2 = []
for row in range(R):
for col in range(C):
h1, h2, w1, w2 = row*(H//R), (row+1)*(H//R), col*(W//C), (col+1)*(W//C)
subimageslist2.append(hugeimage[h1:h2, w1:w2, :])
print(f'Timer 2: {time.time()-t_start}s')
subimages2 = np.array(subimageslist2)
print(f'Timer 2 bis: {time.time()-t_start}s')
print('Results 1&2 are equal' if np.linalg.norm(subimages-subimages2)==0 else 'Results 1&2 differ')
Output:
% python3 script.py
Timer 1: 0.38389086723327637s
Timer 2: 0.0003371238708496094s
Timer 2 bis: 0.3779451847076416s
Results 1&2 are equal
As you can see, adapting your code to work with the list subimageslist2 speeds up this portion of code. You can then run subimages2 = np.array(subimageslist2) to transform the list of subarray references to a numpy array but this will perform a copy and you'll lose the performance improvement (Timer 2 bis).
I have this similarity matrix plot of some documents. I want to sort the values of the matrix, which is a numpynd array, to group colors, while maintaining their relative position (diagonal yellow line), and labels as well.
path = "C:\\Users\\user\\Desktop\\texts\\dataset"
text_files = os.listdir(path)
#print (text_files)
tfidf_vectorizer = TfidfVectorizer()
documents = [open(f, encoding="utf-8").read() for f in text_files if f.endswith('.txt')]
sparse_matrix = tfidf_vectorizer.fit_transform(documents)
labels = []
for f in text_files:
if f.endswith('.txt'):
labels.append(f)
pairwise_similarity = sparse_matrix * sparse_matrix.T
pairwise_similarity_array = pairwise_similarity.toarray()
fig, ax = plt.subplots(figsize=(20,20))
cax = ax.matshow(pairwise_similarity_array, interpolation='spline16')
ax.grid(True)
plt.title('News articles similarity matrix')
plt.xticks(range(23), labels, rotation=90);
plt.yticks(range(23), labels);
fig.colorbar(cax, ticks=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
plt.show()
Here is one possibility.
The idea is to use the information in the similarity matrix and put elements next to each other if they are similar. If two items are similar they should also be similar with respect to other elements ie have similar colors.
I start with the element which has the most in common with all other elements (this choice is a bit arbitrary) [a] and as next element I choose from the remaining elements the one which is closest to the current [b].
import numpy as np
import matplotlib.pyplot as plt
def create_dummy_sim_mat(n):
sm = np.random.random((n, n))
sm = (sm + sm.T) / 2
sm[range(n), range(n)] = 1
return sm
def argsort_sim_mat(sm):
idx = [np.argmax(np.sum(sm, axis=1))] # a
for i in range(1, len(sm)):
sm_i = sm[idx[-1]].copy()
sm_i[idx] = -1
idx.append(np.argmax(sm_i)) # b
return np.array(idx)
n = 10
sim_mat = create_dummy_sim_mat(n=n)
idx = argsort_sim_mat(sim_mat)
sim_mat2 = sim_mat[idx, :][:, idx] # apply reordering for rows and columns
# Plot results
fig, ax = plt.subplots(1, 2)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat2)
def ticks(_ax, ti, la):
_ax.set_xticks(ti)
_ax.set_yticks(ti)
_ax.set_xticklabels(la)
_ax.set_yticklabels(la)
ticks(_ax=ax[0], ti=range(n), la=range(n))
ticks(_ax=ax[1], ti=range(n), la=idx)
After meTchaikovsky's answer I also tested my idea on a clustered similarity matrix (see first image) this method works but is not perfect (see second image).
Because I use the similarity between two elements as approximation to their similarity to all other elements, it is quite clear why this does not work perfectly.
So instead of using the initial similarity to sort the elements one could calculate a second order similarity matrix which measures how similar the similarities are (sorry).
This measure describes better what you are interested in. If two rows / columns have similar colors they should be close to each other. The algorithm to sort the matrix is the same as before
def add_cluster(sm, c=3):
idx_cluster = np.array_split(np.random.permutation(np.arange(len(sm))), c)
for ic in idx_cluster:
cluster_noise = np.random.uniform(0.9, 1.0, (len(ic),)*2)
sm[ic[np.newaxis, :], ic[:, np.newaxis]] = cluster_noise
def get_sim_mat2(sm):
return 1 / (np.linalg.norm(sm[:, np.newaxis] - sm[np.newaxis], axis=-1) + 1/n)
sim_mat = create_dummy_sim_mat(n=100)
add_cluster(sim_mat, c=4)
sim_mat2 = get_sim_mat2(sim_mat)
idx = argsort_sim_mat(sim_mat)
idx2 = argsort_sim_mat(sim_mat2)
sim_mat_sorted = sim_mat[idx, :][:, idx]
sim_mat_sorted2 = sim_mat[idx2, :][:, idx2]
# Plot results
fig, ax = plt.subplots(1, 3)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(sim_mat_sorted2)
The results with this second method are quite good (see third image)
but I guess there exist cases where this approach also fails, so I would be happy about feedback.
Edit
I tried to explain it and did also link the ideas to the code with [a] and [b], but obviously I did not do a good job, so here is a second more verbose explanation.
You have n elements and a n x n similarity matrix sm where each cell (i, j) describes how similar element i is to element j. The goal is to order the rows / columns in such a way that one can see existing patterns in the similarity matrix. My idea to achieve this is really simple.
You start with an empty list and add elements one by one. The criterion for the next element is the similarity to the current element. If element i was added in the last step, I chose the element argmax(sm[i, :]) as next, ignoring the elements already added to the list. I ignore the elements by setting the values of those elements to -1.
You can use the function ticks to reorder the labels:
labels = np.array(labels) # make labels an numpy array, to index it with a list
ticks(_ax=ax[0], ti=range(n), la=labels[idx])
#scleronomic's solution is very elegant, but it also has one shortage, which is we cannot set the number of clusters in the sorted correlation matrix. Assume we are working with a set of variables, in which some of them are weakly correlated
import string
import numpy as np
import pandas as pd
n_variables = 20
n_clusters = 10
n_samples = 100
np.random.seed(100)
names = list(string.ascii_lowercase)[:n_variables]
belongs_to_cluster = np.random.randint(0,n_clusters,n_variables)
latent = np.random.randn(n_clusters,n_samples)
variables = np.random.rand(n_variables,n_samples)
for ind in range(n_clusters):
mask = belongs_to_cluster == ind
# weakening the correlation
if ind % 2 == 0:variables[mask] += latent[ind]*0.1
variables[mask] += latent[ind]
df = pd.DataFrame({key:val for key,val in zip(names,variables)})
corr_mat = np.array(df.corr())
As you can see, there are 10 clusters of variables by construction, however, variables within clusters that has an even index are weakly correlated. If we only want to see roughly 5 clusters in the sorted correlation matrix, maybe we need to find another way.
Based on this post, which is the accepted answer to the question "Clustering a correlation matrix", to sort a correlation matrix into blocks, what we need to find are blocks, where correlations within blocks are high and correlations between blocks are low. However, the solution provided by this accepted answer works best when we know how many blocks are there in the first place, and more importantly, the sizes of the underlying blocks are the same, or at least similar. Therefore, I improved the solution with a new function sort_corr_mat
def sort_corr_mat(corr_mat,clusters_guess):
def _swap_rows(corr_mat, var1, var2):
rs = corr_mat.copy()
rs[var2, :],rs[var1, :]= corr_mat[var1, :],corr_mat[var2, :]
cs = rs.copy()
cs[:, var2],cs[:, var1] = rs[:, var1],rs[:, var2]
return cs
# analysis
max_iter = 500
best_score,current_score,best_count = -1e8,-1e8,0
num_minimua_to_visit = 20
best_corr = corr_mat
best_ordering = np.arange(n_variables)
for i in range(max_iter):
for row1 in range(n_variables):
for row2 in range(n_variables):
if row1 == row2: continue
option_ordering = best_ordering.copy()
option_ordering[row1],option_ordering[row2] = best_ordering[row2],best_ordering[row1]
option_corr = _swap_rows(best_corr,row1,row2)
option_score = score(option_corr,n_variables,clusters_guess)
if option_score > best_score:
best_corr = option_corr
best_ordering = option_ordering
best_score = option_score
if best_score > current_score:
best_count += 1
current_corr = best_corr
current_ordering = best_ordering
current_score = best_score
if best_count >= num_minimua_to_visit:
return best_corr#,best_ordering
return best_corr#,best_ordering
With this function and the corr_mat constructed in the first place, I compared the result obtained with my function (on the right) with that obtained with #scleronomic's solution (in the middle)
sim_mat_sorted = corr_mat[argsort_sim_mat(corr_mat), :][:, argsort_sim_mat(corr_mat)]
corr_mat_sorted = sort_corr_mat(corr_mat,clusters_guess=5)
# Plot results
fig, ax = plt.subplots(1,3,figsize=(18,6))
ax[0].imshow(corr_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(corr_mat_sorted)
Clearly, #scleronomic's solution works much better and faster, but my solution offers more control to the pattern of the output.
I'm trying to generate the initial population for a genetic algorithm. I need to generate 20 random binary strings of length 18. I have been able to generate just one chain. My question is: How do I use another loop in order to generate the 20 strings that I need?
I think that this could solved using nested loops. I've tried to do that but I don't know how to use them correctly.
import random
binaryString = []
for i in range(0, 18):
x = str(random.randint(0, 1))
binaryString.append(x)
print (''.join(binaryString))
import numpy as geek
num_bits = 18
individualsPer_pop = 20
#Defining the population size
pop_size = (individualsPer_pop,num_bits) # The population will have
individualsPer-pop chromosome where each chromosome has num_bits genes.
#Creating the initial population.
new_population = geek.random.randint(low = 0, high = 2, size = pop_size)
print(new_population)