Increasing memory usage python | jupyterlab - python-3.x

i'm approaching a clustering problem using k-means on my laptop (using jupyter-lab).
As first step, after data pre-processing, i'm trying to understand the optimal number of cluster to choose, by calculating silhouette and elbow measures.
So i'm running this code:
test = df_std
wcss_cust = []
sil_score = []
n_clusters = range(2, 9)
for i in n_clusters:
clusterer = KMeans(n_clusters = i, init = 'k-means++', random_state = 42 , max_iter=150)
cluster_labels = clusterer.fit_predict(test)
silhouette_avg = silhouette_score(test, cluster_labels)
wcss_cust.append(clusterer.inertia_/1000)
sil_score.append(silhouette_avg*1000)
print("For n_clusters =", i, "The average silhouette_score and wcss are : silhouette_score= ", silhouette_avg.round(decimals=2) , " , wcss: " , clusterer.inertia_ )
Note:
Please condiser that my dataframe has 250k rows each of one is a customer i want to be clustered.
The number of feature that i'm using is not that high (just 4 numeric measures)
The problem i'm having is that: the code is running wiredly slow.
my python is 64 bit like my laptop
Is there any way to increase the memory usage?
Thank you

Related

What is an efficient way to make a dataset and dataloader for high frequency time series with multiple individuals?

I'm trying to forecast high frequency time series using LSTMs and PyTorch library. I'm going through PyTorch tutorial for creating custom datasets and models and figured out how to create my Dataset class and my Dataloader and they work perfectly fine but they take too much time to generate one batch.
I want to generate batches of fixed size, each batch contains time series from different individuals and the input window is of the same length as the output window (multi-step prediction).
I think the issue is due to the fact that I'm verifying the windows are correct.
My dataframe of a little bit more than 3M lines with 6 columns. I have some 100 individuals and for each individual I have 4 different time series $y_{1}$, $y_{2}$, $y_{3}$ and $y_{4}$. I have no missing values at all and the time steps are consecutive. For each individual I have the same time steps.
My code is:
class TSDataset(Dataset):
def __init__(self, train_data, unique_column = 'unique_id', input_length = 3840, target_length = 3840, targets = ['y1', 'y2', 'y3', 'y4'], transform = None):
self.train_data = train_data
self.unique_column = unique_column
self.input_length = input_length
self.target_length = target_length
self.total_window_length = input_length + target_length
self.targets = targets
def __len__(self):
return len(self.train_data)
def verify_time_steps(self, idx):
change = False
# Check if the window doesn't overlap over many individuals
num_individuals = self.train_data.iloc[np.arange(idx + self.total_window_length), :][self.unique_column].unique().shape[0]
if num_stations != 1:
change = True
if idx + self.total_window_length >= len(self.train_data):
change = True
return change
def reshuffle(self):
return np.random.randint(0, len(self.train_data))
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
change = self.verify_time_steps(idx)
if change == True:
while change != False:
idx = self.reshuffle()
change = self.verify_time_steps(idx)
sample = self.train_data.iloc[np.arange(idx, idx + self.input_length), :][self.targets].values
labels = self.train_data.iloc[np.arange(idx + self.input_length, idx + self.input_length + self.target_length), :][self.targets].values
sample = torch.from_numpy(sample)
labels = torch.from_numpy(labels)
return sample, labels
I've tried using the TimeSeriesDataset from PyTorchForecasting but I had a hard time creating models that suit it.
I've also tried creating the dataset outside, as a numpy array but my RAM can't handle it.
Hope you can help me figure out how to alleviate the computations.

Extraction of N most frequent keywords per cluster in Hierarchical Clustering NLP

I want to extract n most frequent keywords per cluster from the results of Agglomerative hiearchichal clustering.
def agglomerative_clustering(tfidf_matrix):
cluster = AgglomerativeClustering(n_clusters=95, affinity='euclidean', linkage='ward')
cluster.fit_predict(tfidf_matrix)
print(cluster.n_clusters_)
labels=cluster.labels_
print("lables is "+str(labels.shape))
#labels = list(labels)[0]
print("test"+str(labels))
return labels
def tfidf(data):
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
return vectors,feature_names
vectors,terms=tfidf(cleaned_documents)
labels =agglomerative_clustering(vectors.toarray())
lib['cleaned_documents'] = pd.Series(cleaned_documents)
lib['clusterAgglomerative']= pd.Series(labels)
X = pd.DataFrame(vectorized_data.toarray(),lib['cleaned_documents']) # columns argument is optional
X['Cluster'] = labels
# Add column corresponding to cluster number
word_frequencies_by_cluster = X.groupby('Cluster').sum()
# To get sorted list for a numbered cluster, in this case 1
print("Top terms per cluster:")
print(word_frequencies_by_cluster.loc[2, :].sort_values(ascending=False))
The results i want each Cluster with the N most frequent keywords ?
i tried this solution but seems it's not efficient
df_lib = pd.DataFrame(lib['cleaned_documents'],lib['clusterAgglomerative'])
print(df_lib)
grouped_df = df_lib.groupby("clusterAgglomerative")
grouped_lists = (grouped_df["cleaned_documents"]).agg(lambda column: ", ".join(set(column)))
print("keywords per cluster")
print(grouped_lists)

Sort simmilarity matrix according to plot colors

I have this similarity matrix plot of some documents. I want to sort the values of the matrix, which is a numpynd array, to group colors, while maintaining their relative position (diagonal yellow line), and labels as well.
path = "C:\\Users\\user\\Desktop\\texts\\dataset"
text_files = os.listdir(path)
#print (text_files)
tfidf_vectorizer = TfidfVectorizer()
documents = [open(f, encoding="utf-8").read() for f in text_files if f.endswith('.txt')]
sparse_matrix = tfidf_vectorizer.fit_transform(documents)
labels = []
for f in text_files:
if f.endswith('.txt'):
labels.append(f)
pairwise_similarity = sparse_matrix * sparse_matrix.T
pairwise_similarity_array = pairwise_similarity.toarray()
fig, ax = plt.subplots(figsize=(20,20))
cax = ax.matshow(pairwise_similarity_array, interpolation='spline16')
ax.grid(True)
plt.title('News articles similarity matrix')
plt.xticks(range(23), labels, rotation=90);
plt.yticks(range(23), labels);
fig.colorbar(cax, ticks=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
plt.show()
Here is one possibility.
The idea is to use the information in the similarity matrix and put elements next to each other if they are similar. If two items are similar they should also be similar with respect to other elements ie have similar colors.
I start with the element which has the most in common with all other elements (this choice is a bit arbitrary) [a] and as next element I choose from the remaining elements the one which is closest to the current [b].
import numpy as np
import matplotlib.pyplot as plt
def create_dummy_sim_mat(n):
sm = np.random.random((n, n))
sm = (sm + sm.T) / 2
sm[range(n), range(n)] = 1
return sm
def argsort_sim_mat(sm):
idx = [np.argmax(np.sum(sm, axis=1))] # a
for i in range(1, len(sm)):
sm_i = sm[idx[-1]].copy()
sm_i[idx] = -1
idx.append(np.argmax(sm_i)) # b
return np.array(idx)
n = 10
sim_mat = create_dummy_sim_mat(n=n)
idx = argsort_sim_mat(sim_mat)
sim_mat2 = sim_mat[idx, :][:, idx] # apply reordering for rows and columns
# Plot results
fig, ax = plt.subplots(1, 2)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat2)
def ticks(_ax, ti, la):
_ax.set_xticks(ti)
_ax.set_yticks(ti)
_ax.set_xticklabels(la)
_ax.set_yticklabels(la)
ticks(_ax=ax[0], ti=range(n), la=range(n))
ticks(_ax=ax[1], ti=range(n), la=idx)
After meTchaikovsky's answer I also tested my idea on a clustered similarity matrix (see first image) this method works but is not perfect (see second image).
Because I use the similarity between two elements as approximation to their similarity to all other elements, it is quite clear why this does not work perfectly.
So instead of using the initial similarity to sort the elements one could calculate a second order similarity matrix which measures how similar the similarities are (sorry).
This measure describes better what you are interested in. If two rows / columns have similar colors they should be close to each other. The algorithm to sort the matrix is the same as before
def add_cluster(sm, c=3):
idx_cluster = np.array_split(np.random.permutation(np.arange(len(sm))), c)
for ic in idx_cluster:
cluster_noise = np.random.uniform(0.9, 1.0, (len(ic),)*2)
sm[ic[np.newaxis, :], ic[:, np.newaxis]] = cluster_noise
def get_sim_mat2(sm):
return 1 / (np.linalg.norm(sm[:, np.newaxis] - sm[np.newaxis], axis=-1) + 1/n)
sim_mat = create_dummy_sim_mat(n=100)
add_cluster(sim_mat, c=4)
sim_mat2 = get_sim_mat2(sim_mat)
idx = argsort_sim_mat(sim_mat)
idx2 = argsort_sim_mat(sim_mat2)
sim_mat_sorted = sim_mat[idx, :][:, idx]
sim_mat_sorted2 = sim_mat[idx2, :][:, idx2]
# Plot results
fig, ax = plt.subplots(1, 3)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(sim_mat_sorted2)
The results with this second method are quite good (see third image)
but I guess there exist cases where this approach also fails, so I would be happy about feedback.
Edit
I tried to explain it and did also link the ideas to the code with [a] and [b], but obviously I did not do a good job, so here is a second more verbose explanation.
You have n elements and a n x n similarity matrix sm where each cell (i, j) describes how similar element i is to element j. The goal is to order the rows / columns in such a way that one can see existing patterns in the similarity matrix. My idea to achieve this is really simple.
You start with an empty list and add elements one by one. The criterion for the next element is the similarity to the current element. If element i was added in the last step, I chose the element argmax(sm[i, :]) as next, ignoring the elements already added to the list. I ignore the elements by setting the values of those elements to -1.
You can use the function ticks to reorder the labels:
labels = np.array(labels) # make labels an numpy array, to index it with a list
ticks(_ax=ax[0], ti=range(n), la=labels[idx])
#scleronomic's solution is very elegant, but it also has one shortage, which is we cannot set the number of clusters in the sorted correlation matrix. Assume we are working with a set of variables, in which some of them are weakly correlated
import string
import numpy as np
import pandas as pd
n_variables = 20
n_clusters = 10
n_samples = 100
np.random.seed(100)
names = list(string.ascii_lowercase)[:n_variables]
belongs_to_cluster = np.random.randint(0,n_clusters,n_variables)
latent = np.random.randn(n_clusters,n_samples)
variables = np.random.rand(n_variables,n_samples)
for ind in range(n_clusters):
mask = belongs_to_cluster == ind
# weakening the correlation
if ind % 2 == 0:variables[mask] += latent[ind]*0.1
variables[mask] += latent[ind]
df = pd.DataFrame({key:val for key,val in zip(names,variables)})
corr_mat = np.array(df.corr())
As you can see, there are 10 clusters of variables by construction, however, variables within clusters that has an even index are weakly correlated. If we only want to see roughly 5 clusters in the sorted correlation matrix, maybe we need to find another way.
Based on this post, which is the accepted answer to the question "Clustering a correlation matrix", to sort a correlation matrix into blocks, what we need to find are blocks, where correlations within blocks are high and correlations between blocks are low. However, the solution provided by this accepted answer works best when we know how many blocks are there in the first place, and more importantly, the sizes of the underlying blocks are the same, or at least similar. Therefore, I improved the solution with a new function sort_corr_mat
def sort_corr_mat(corr_mat,clusters_guess):
def _swap_rows(corr_mat, var1, var2):
rs = corr_mat.copy()
rs[var2, :],rs[var1, :]= corr_mat[var1, :],corr_mat[var2, :]
cs = rs.copy()
cs[:, var2],cs[:, var1] = rs[:, var1],rs[:, var2]
return cs
# analysis
max_iter = 500
best_score,current_score,best_count = -1e8,-1e8,0
num_minimua_to_visit = 20
best_corr = corr_mat
best_ordering = np.arange(n_variables)
for i in range(max_iter):
for row1 in range(n_variables):
for row2 in range(n_variables):
if row1 == row2: continue
option_ordering = best_ordering.copy()
option_ordering[row1],option_ordering[row2] = best_ordering[row2],best_ordering[row1]
option_corr = _swap_rows(best_corr,row1,row2)
option_score = score(option_corr,n_variables,clusters_guess)
if option_score > best_score:
best_corr = option_corr
best_ordering = option_ordering
best_score = option_score
if best_score > current_score:
best_count += 1
current_corr = best_corr
current_ordering = best_ordering
current_score = best_score
if best_count >= num_minimua_to_visit:
return best_corr#,best_ordering
return best_corr#,best_ordering
With this function and the corr_mat constructed in the first place, I compared the result obtained with my function (on the right) with that obtained with #scleronomic's solution (in the middle)
sim_mat_sorted = corr_mat[argsort_sim_mat(corr_mat), :][:, argsort_sim_mat(corr_mat)]
corr_mat_sorted = sort_corr_mat(corr_mat,clusters_guess=5)
# Plot results
fig, ax = plt.subplots(1,3,figsize=(18,6))
ax[0].imshow(corr_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(corr_mat_sorted)
Clearly, #scleronomic's solution works much better and faster, but my solution offers more control to the pattern of the output.

How to create an initial population of 20 chromosomes with Python 3

I'm trying to generate the initial population for a genetic algorithm. I need to generate 20 random binary strings of length 18. I have been able to generate just one chain. My question is: How do I use another loop in order to generate the 20 strings that I need?
I think that this could solved using nested loops. I've tried to do that but I don't know how to use them correctly.
import random
binaryString = []
for i in range(0, 18):
x = str(random.randint(0, 1))
binaryString.append(x)
print (''.join(binaryString))
import numpy as geek
num_bits = 18
individualsPer_pop = 20
#Defining the population size
pop_size = (individualsPer_pop,num_bits) # The population will have
individualsPer-pop chromosome where each chromosome has num_bits genes.
#Creating the initial population.
new_population = geek.random.randint(low = 0, high = 2, size = pop_size)
print(new_population)

Simulating 10,000 Coinflips in Python Very Slow

I am writing a simulation that creates 10,000 periods of 25 sets, with each set consisting of 48 coin tosses. Something in this code is making it run very slowly. It has been running for at least 20 minutes and it is still working. A similar simulation in R runs in under 10 seconds.
Here is the python code I am using:
import pandas as pd
from random import choices
threshold=17
all_periods = pd.DataFrame()
for i in range(10000):
simulated_period = pd.DataFrame()
for j in range(25):
#Data frame with 48 weeks as rows. Each run through loop adds one more year as column until there are 25
simulated_period = pd.concat([simulated_period, pd.DataFrame(choices([1, -1], k=48))],\
ignore_index=True, axis=1)
positives = simulated_period[simulated_period==1].count(axis=1)
negatives = simulated_period[simulated_period==-1].count(axis=1)
#Combine positives and negatives that are more than the threshold into single dataframe
sig = pd.DataFrame([[sum(positives>=threshold), sum(negatives>=threshold)]], columns=['positive', 'negative'])
sig['total'] = sig['positive'] + sig['negative']
#Add summary of individual simulation to the others
all_periods = pd.concat([all_periods, sig])
If it helps, here is the R script that is running quickly:
flip <- function(threshold=17){
#threshold is min number of persistent results we want to see. For example, 17/25 positive or 17/25 negative
outcomes <- c(1, -1)
trial <- do.call(cbind, lapply(1:25, function (i) sample(outcomes, 48, replace=T)))
trial <- as.data.frame(t(trial)) #48 weeks in columns, 25 years in rows.
summary <- sapply(trial, function(x) c(pos=length(x[x==1]), neg=length(x[x==-1])))
summary <- as.data.frame(t(summary)) #use data frame so $pos/$neg can be used instead of [1,]/[2,]
sig.pos <- length(summary$pos[summary$pos>=threshold])
sig.neg <- length(summary$neg[summary$neg>=threshold])
significant <- c(pos=sig.pos, neg=sig.neg, total=sig.pos+sig.neg)
return(significant)
}
results <- do.call(rbind, lapply(1:10000, function(i) flip(threshold)))
results <- as.data.frame(results)
Can anyone tell me what I'm running in python that is slowing the process down? Thank you.
Why don't you generate the whole big set
idx = pd.MultiIndex.from_product((range(10000), range(25)),
names=('period', 'set'))
df = pd.DataFrame(data=np.random.choice([1,-1], (10000*25, 48)), index=idx)
Took about 120ms on my computer. And then the other operations:
positives = df.eq(1).sum(level=0).gt(17).sum(axis=1).to_frame(name='positives')
negatives = df.eq(-1).sum(level=0).gt(17).sum(axis=1).to_frame(name='negatives')
all_periods = pd.concat( (positives, negatives), axis=1 )
all_periods['total'] = all_periods.sum(1)
take about 600ms extra.

Resources