I can't find a way to show the most frequent number in this list
a = [1,2,4,5,6,7,15,16,19,23,24,26,27,28,29,30,31,33,36,37,38,39,40,41,42,43,44,45,47,48,49,50,51,52,56,57,58,60]
b = [1,3,4,5,6,8,9,10,15,16,17,18,20,21,22,24,26,28,29,31,32,33,36,37,38,40,41,43,44,47,48,50,52,53,54,56,57,58,60]
c = [2,3,5,6,8,9,12,13,17,19,20,23,25,26,27,28,29,30,31,33,34,35,36,37,40,44,45,47,48,52,53,54,55,56,57,58,60]
d = [2,5,7,9,11,12,13,14,16,18,20,22,23,26,29,30,33,34,36,38,40,41,42,43,44,46,47,49,50,51,53,56,57,58,60]
list_1 = [a,b]
list_2 = [c,d]
lists = [list_1, list_2]
I have tried the collections library with the most_common() funtion but it does't seem to work. Same happens with numpy arrays.
It would be perfect if I could get the top 10 most common number too.
The reason for the list to be multi-dimensional is for easy comparison between months
Jan_22 = [jan_01, jan_02, jan_03, jan_04]
Fev_22 = [fev_01, fev_02, fev_03, fev_04]
months = [Fev_22, Jan_22]
Each month has 4 data sets, making those lists allows me to compare big chunks of data, Top 10 most common values from 2021, most common number in jan, fev, mar, April, may ,jun. Would make it easier and clear
Thanks
Maybe I don't fully understand your question, but I don't understand why the list needs to be multi-dimensional if you only want to know the frequency of a given value.
import pandas as pd
a = [1,2,4,5,6,7,15,16,19,23,24,26,27,28,29,30,31,33,36,37,38,39,40,41,42,43,44,45,47,48,49,50,51,52,56,57,58,60]
b = [1,3,4,5,6,8,9,10,15,16,17,18,20,21,22,24,26,28,29,31,32,33,36,37,38,40,41,43,44,47,48,50,52,53,54,56,57,58,60]
c = [2,3,5,6,8,9,12,13,17,19,20,23,25,26,27,28,29,30,31,33,34,35,36,37,40,44,45,47,48,52,53,54,55,56,57,58,60]
d = [2,5,7,9,11,12,13,14,16,18,20,22,23,26,29,30,33,34,36,38,40,41,42,43,44,46,47,49,50,51,53,56,57,58,60]
values = pd.Series(a + b + c + d)
print(values.value_counts().head(10))
print(values.value_counts().head(10).index.to_list())
I dont get why are you adding up lists in each step to get a 3D element, you could just use arrays or smth like that, but here is a function that does what you want in a 3d list (returns the x most common elements in your 3d list ,a.k.a list of lists):
import numpy as np
arr = [[1,2,4,5,6,7,15,16,19,23,24,26,27,28,29,30,31,33,36,37,38,39,40,41,42,43,44,45,47,48,49,50,51,52,56,57,58,60],
[1,3,4,5,6,8,9,10,15,16,17,18,20,21,22,24,26,28,29,31,32,33,36,37,38,40,41,43,44,47,48,50,52,53,54,56,57,58,60],
[2,3,5,6,8,9,12,13,17,19,20,23,25,26,27,28,29,30,31,33,34,35,36,37,40,44,45,47,48,52,53,54,55,56,57,58,60],
[2,5,7,9,11,12,13,14,16,18,20,22,23,26,29,30,33,34,36,38,40,41,42,43,44,46,47,49,50,51,53,56,57,58,60]]
def x_most_common(arr, x):
l = [el for l in arr for el in l]
output = list(set([(el, l.count(el)) for el in l]))
output.sort(key= lambda i: i[1], reverse=True)
return output[:x]
# test:
print(x_most_common(arr, 5))
output:
[(56, 4), (36, 4), (47, 4), (58, 4), (5, 4)]
I have this similarity matrix plot of some documents. I want to sort the values of the matrix, which is a numpynd array, to group colors, while maintaining their relative position (diagonal yellow line), and labels as well.
path = "C:\\Users\\user\\Desktop\\texts\\dataset"
text_files = os.listdir(path)
#print (text_files)
tfidf_vectorizer = TfidfVectorizer()
documents = [open(f, encoding="utf-8").read() for f in text_files if f.endswith('.txt')]
sparse_matrix = tfidf_vectorizer.fit_transform(documents)
labels = []
for f in text_files:
if f.endswith('.txt'):
labels.append(f)
pairwise_similarity = sparse_matrix * sparse_matrix.T
pairwise_similarity_array = pairwise_similarity.toarray()
fig, ax = plt.subplots(figsize=(20,20))
cax = ax.matshow(pairwise_similarity_array, interpolation='spline16')
ax.grid(True)
plt.title('News articles similarity matrix')
plt.xticks(range(23), labels, rotation=90);
plt.yticks(range(23), labels);
fig.colorbar(cax, ticks=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
plt.show()
Here is one possibility.
The idea is to use the information in the similarity matrix and put elements next to each other if they are similar. If two items are similar they should also be similar with respect to other elements ie have similar colors.
I start with the element which has the most in common with all other elements (this choice is a bit arbitrary) [a] and as next element I choose from the remaining elements the one which is closest to the current [b].
import numpy as np
import matplotlib.pyplot as plt
def create_dummy_sim_mat(n):
sm = np.random.random((n, n))
sm = (sm + sm.T) / 2
sm[range(n), range(n)] = 1
return sm
def argsort_sim_mat(sm):
idx = [np.argmax(np.sum(sm, axis=1))] # a
for i in range(1, len(sm)):
sm_i = sm[idx[-1]].copy()
sm_i[idx] = -1
idx.append(np.argmax(sm_i)) # b
return np.array(idx)
n = 10
sim_mat = create_dummy_sim_mat(n=n)
idx = argsort_sim_mat(sim_mat)
sim_mat2 = sim_mat[idx, :][:, idx] # apply reordering for rows and columns
# Plot results
fig, ax = plt.subplots(1, 2)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat2)
def ticks(_ax, ti, la):
_ax.set_xticks(ti)
_ax.set_yticks(ti)
_ax.set_xticklabels(la)
_ax.set_yticklabels(la)
ticks(_ax=ax[0], ti=range(n), la=range(n))
ticks(_ax=ax[1], ti=range(n), la=idx)
After meTchaikovsky's answer I also tested my idea on a clustered similarity matrix (see first image) this method works but is not perfect (see second image).
Because I use the similarity between two elements as approximation to their similarity to all other elements, it is quite clear why this does not work perfectly.
So instead of using the initial similarity to sort the elements one could calculate a second order similarity matrix which measures how similar the similarities are (sorry).
This measure describes better what you are interested in. If two rows / columns have similar colors they should be close to each other. The algorithm to sort the matrix is the same as before
def add_cluster(sm, c=3):
idx_cluster = np.array_split(np.random.permutation(np.arange(len(sm))), c)
for ic in idx_cluster:
cluster_noise = np.random.uniform(0.9, 1.0, (len(ic),)*2)
sm[ic[np.newaxis, :], ic[:, np.newaxis]] = cluster_noise
def get_sim_mat2(sm):
return 1 / (np.linalg.norm(sm[:, np.newaxis] - sm[np.newaxis], axis=-1) + 1/n)
sim_mat = create_dummy_sim_mat(n=100)
add_cluster(sim_mat, c=4)
sim_mat2 = get_sim_mat2(sim_mat)
idx = argsort_sim_mat(sim_mat)
idx2 = argsort_sim_mat(sim_mat2)
sim_mat_sorted = sim_mat[idx, :][:, idx]
sim_mat_sorted2 = sim_mat[idx2, :][:, idx2]
# Plot results
fig, ax = plt.subplots(1, 3)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(sim_mat_sorted2)
The results with this second method are quite good (see third image)
but I guess there exist cases where this approach also fails, so I would be happy about feedback.
Edit
I tried to explain it and did also link the ideas to the code with [a] and [b], but obviously I did not do a good job, so here is a second more verbose explanation.
You have n elements and a n x n similarity matrix sm where each cell (i, j) describes how similar element i is to element j. The goal is to order the rows / columns in such a way that one can see existing patterns in the similarity matrix. My idea to achieve this is really simple.
You start with an empty list and add elements one by one. The criterion for the next element is the similarity to the current element. If element i was added in the last step, I chose the element argmax(sm[i, :]) as next, ignoring the elements already added to the list. I ignore the elements by setting the values of those elements to -1.
You can use the function ticks to reorder the labels:
labels = np.array(labels) # make labels an numpy array, to index it with a list
ticks(_ax=ax[0], ti=range(n), la=labels[idx])
#scleronomic's solution is very elegant, but it also has one shortage, which is we cannot set the number of clusters in the sorted correlation matrix. Assume we are working with a set of variables, in which some of them are weakly correlated
import string
import numpy as np
import pandas as pd
n_variables = 20
n_clusters = 10
n_samples = 100
np.random.seed(100)
names = list(string.ascii_lowercase)[:n_variables]
belongs_to_cluster = np.random.randint(0,n_clusters,n_variables)
latent = np.random.randn(n_clusters,n_samples)
variables = np.random.rand(n_variables,n_samples)
for ind in range(n_clusters):
mask = belongs_to_cluster == ind
# weakening the correlation
if ind % 2 == 0:variables[mask] += latent[ind]*0.1
variables[mask] += latent[ind]
df = pd.DataFrame({key:val for key,val in zip(names,variables)})
corr_mat = np.array(df.corr())
As you can see, there are 10 clusters of variables by construction, however, variables within clusters that has an even index are weakly correlated. If we only want to see roughly 5 clusters in the sorted correlation matrix, maybe we need to find another way.
Based on this post, which is the accepted answer to the question "Clustering a correlation matrix", to sort a correlation matrix into blocks, what we need to find are blocks, where correlations within blocks are high and correlations between blocks are low. However, the solution provided by this accepted answer works best when we know how many blocks are there in the first place, and more importantly, the sizes of the underlying blocks are the same, or at least similar. Therefore, I improved the solution with a new function sort_corr_mat
def sort_corr_mat(corr_mat,clusters_guess):
def _swap_rows(corr_mat, var1, var2):
rs = corr_mat.copy()
rs[var2, :],rs[var1, :]= corr_mat[var1, :],corr_mat[var2, :]
cs = rs.copy()
cs[:, var2],cs[:, var1] = rs[:, var1],rs[:, var2]
return cs
# analysis
max_iter = 500
best_score,current_score,best_count = -1e8,-1e8,0
num_minimua_to_visit = 20
best_corr = corr_mat
best_ordering = np.arange(n_variables)
for i in range(max_iter):
for row1 in range(n_variables):
for row2 in range(n_variables):
if row1 == row2: continue
option_ordering = best_ordering.copy()
option_ordering[row1],option_ordering[row2] = best_ordering[row2],best_ordering[row1]
option_corr = _swap_rows(best_corr,row1,row2)
option_score = score(option_corr,n_variables,clusters_guess)
if option_score > best_score:
best_corr = option_corr
best_ordering = option_ordering
best_score = option_score
if best_score > current_score:
best_count += 1
current_corr = best_corr
current_ordering = best_ordering
current_score = best_score
if best_count >= num_minimua_to_visit:
return best_corr#,best_ordering
return best_corr#,best_ordering
With this function and the corr_mat constructed in the first place, I compared the result obtained with my function (on the right) with that obtained with #scleronomic's solution (in the middle)
sim_mat_sorted = corr_mat[argsort_sim_mat(corr_mat), :][:, argsort_sim_mat(corr_mat)]
corr_mat_sorted = sort_corr_mat(corr_mat,clusters_guess=5)
# Plot results
fig, ax = plt.subplots(1,3,figsize=(18,6))
ax[0].imshow(corr_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(corr_mat_sorted)
Clearly, #scleronomic's solution works much better and faster, but my solution offers more control to the pattern of the output.
I am trying to find conditional mutual information between three discrete random variable using pyitlib package for python with the help of the formula:
I(X;Y|Z)=H(X|Z)+H(Y|Z)-H(X,Y|Z)
The expected Conditional Mutual information value is= 0.011
My 1st code:
import numpy as np
from pyitlib import discrete_random_variable as drv
X=[0,1,1,0,1,0,1,0,0,1,0,0]
Y=[0,1,1,0,0,0,1,0,0,1,1,0]
Z=[1,0,0,1,1,0,0,1,1,0,0,1]
a=drv.entropy_conditional(X,Z)
##print(a)
b=drv.entropy_conditional(Y,Z)
##print(b)
c=drv.entropy_conditional(X,Y,Z)
##print(c)
p=a+b-c
print(p)
The answer i am getting here is=0.4632245116328402
My 2nd code:
import numpy as np
from pyitlib import discrete_random_variable as drv
X=[0,1,1,0,1,0,1,0,0,1,0,0]
Y=[0,1,1,0,0,0,1,0,0,1,1,0]
Z=[1,0,0,1,1,0,0,1,1,0,0,1]
a=drv.information_mutual_conditional(X,Y,Z)
print(a)
The answer i am getting here is=0.1583445441575102
While the expected result is=0.011
Can anybody help? I am in big trouble right now. Any kind of help will be appreciable.
Thanks in advance.
I think that the library function entropy_conditional(x,y,z) has some errors. I also test my samples, the same problem happens.
however, the function entropy_conditional with two variables is ok.
So I code my entropy_conditional(x,y,z) as entropy(x,y,z), the results is correct.
the code may be not beautiful.
def gen_dict(x):
dict_z = {}
for key in x:
dict_z[key] = dict_z.get(key, 0) + 1
return dict_z
def entropy(x,y,z):
x = np.array([x,y,z]).T
x = x[x[:,-1].argsort()] # sorted by the last column
w = x[:,-3]
y = x[:,-2]
z = x[:,-1]
# dict_w = gen_dict(w)
# dict_y = gen_dict(y)
dict_z = gen_dict(z)
list_z = [dict_z[i] for i in set(z)]
p_z = np.array(list_z)/sum(list_z)
pos = 0
ent = 0
for i in range(len(list_z)):
w = x[pos:pos+list_z[i],-3]
y = x[pos:pos+list_z[i],-2]
z = x[pos:pos+list_z[i],-1]
pos += list_z[i]
list_wy = np.zeros((len(set(w)),len(set(y))), dtype = float , order ="C")
list_w = list(set(w))
list_y = list(set(y))
for j in range(len(w)):
pos_w = list_w.index(w[j])
pos_y = list_y.index(y[j])
list_wy[pos_w,pos_y] += 1
#print(pos_w)
#print(pos_y)
list_p = list_wy.flatten()
list_p = np.array([k for k in list_p if k>0]/sum(list_p))
ent_t = 0
for j in list_p:
ent_t += -j * math.log2(j)
#print(ent_t)
ent += p_z[i]* ent_t
return ent
X=[0,1,1,0,1,0,1,0,0,1,0,0]
Y=[0,1,1,0,0,0,1,0,0,1,1,0]
Z=[1,0,0,1,1,0,0,1,1,0,0,1]
a=drv.entropy_conditional(X,Z)
##print(a)
b=drv.entropy_conditional(Y,Z)
c = entropy(X, Y, Z)
p=a+b-c
print(p)
0.15834454415751043
Based on the definitions of conditional entropy, calculating in bits (i.e. base 2) I obtain H(X|Z)=0.784159, H(Y|Z)=0.325011, H(X,Y|Z) = 0.950826. Based on the definition of conditional mutual information you provide above, I obtain I(X;Y|Z)=H(X|Z)+H(Y|Z)-H(X,Y|Z)= 0.158344. Noting that pyitlib uses base 2 by default, drv.information_mutual_conditional(X,Y,Z) appears to be computing the correct result.
Note that your use of drv.entropy_conditional(X,Y,Z) in your first example to compute conditional entropy is incorrect, you can however use drv.entropy_conditional(XY,Z), where XY is a 1D array representing the joint observations about X and Y, for example XY = [2*xy[0] + xy[1] for xy in zip(X,Y)].