Python Matrix Multiplication - Append into empty list - python-3.x

How do I generate random matrices and get them multiplied in an efficient way.
This is what I've done:
`mat1 = []
for i in range(0, order):
num1 = random.sample(range(1,10), order)
print(num1)
mat1.append(num1)
print()
print("Result of Matrix Multiplication.")
for p in range(len(mat1)):
for q in range(len(mat2[0])):
for r in range(len(mat2)):
res_matrix[p][q] += mat1[p][r] * mat2[r][q]
for res in res_matrix:
print(res)`

You can use list comprehension to generate res_matrix using
res_matrix = [[0 for i in range(order)] for j in range(order)]
Also, have you heard of numpy? It does this kind of computations (and many more) in an easy and very fast way. This is what your code would become with numpy
import numpy as np
print("Generate 1st Matrix")
mat1 = np.random.randint(1, 10, size=(order, order))
print(mat1)
print("Generate 2nd Matrix")
mat2 = np.random.randint(1, 10, size=(order, order))
print(mat2)
res_matrix = mat1.dot(mat2)
print("Result of Matrix Multiplication.")
print(res_matrix)

Related

Function to Convert Square Matrix to Upper Hessenberg with Similarity Transformations

I am attempting to translate a MATLAB function to Python from Timothy Sauer,
Numerical Analysis Second Edition, page 546, Program 12.8. The original function
receives a square matrix and returns a matrix with the same eigenvalues but in
Upper Hessenberg form. The original function creates Householder reflectors to produce zeros in the
offdiagonals of the matrix and performs similarity transformations on the original matrix to
get it to upper hessenberg form.
My Python translation succeeds only in obtaining the eigenvalues for 3x3 matrices
but not for 4x4 matrices. Would anyone know the cause of the error? I pasted my code with success and failing cases below. Thank you.
import numpy as np
import math
norm = lambda v:math.sqrt(np.sum(v**2))
def upper_hessenberg(A):
'''
Translated from Timothy Sauer, Numerical Analysis Second Edition, page 546, Program 12.8
Input: Square Matrix, A
Output: B, a Similar Matrix with Same Eigenvalues as A except in Upper Hessenberg form
V, a matrix containing the reflectors used to produce zeros in the off diagonals
'''
rows, columns = A.shape
B = A[:,:].astype(np.float) #will store the similar matrix
V = np.zeros(shape=(rows,columns),dtype=float) #will store the reflectors
for column in range(columns-2): #start from the 1st column end at the third to last column
row = column
x = B[row+1: ,column] #decapitate the column
reflection_of_x = np.zeros(len(x)) #first entry is the norm, followed by 0s
if abs(norm(x)) <= np.finfo(float).eps: #if there are already 0s inthe offdiagonals skip this column
continue
reflection_of_x[0] = norm(x)
v = reflection_of_x - x # v, (the difference vector) represents the line connecting the original column to the reflection of the column (see Timothy Sauer Num Analysis 2nd Edition Figure 4.11 Householder reflector)
v = v/norm(v) #normalize to length of 1 (unit vector)
V[:len(v), column] = v #save the reflector in an upper triangular matrix called V
#verify with x-2*(x # v * v) should equal a vector with all zeros except the leading entry
column_projections = np.outer(v , v # B[row+1:, column:]) #project each col onto difference vector
B[row+1:, column:] = B[row+1:, column:] - (2 * column_projections)
row_projections = np.outer(v, B[row:, column + 1:] # v).T #project each row onto difference vector
B[row:, column + 1:] = B[row:, column + 1:] - (2 * row_projections)
return V, B
# Algorithm succeeds only with 3x3 matrices
eigvectors = np.array([
[1,3,2],
[4,5,6],
[7,8,9],
])
eigvalues = np.array([
[4,0,0],
[0,3,0],
[0,0,2]
])
M = eigvectors # eigvalues # np.linalg.inv(eigvectors)
print("The expected eigvals :", np.linalg.eigvals(M))
V,B = upper_hessenberg(M)
print("For 3x3 matrices, The function successfully produces these eigvals",np.linalg.eigvals(B))
#But with 4x4 matrices it fails
eigvectors = np.array([
[1,3,2,4],
[4,5,6,2],
[7,8,9,5],
[5,2,7,8]
])
eigvalues = np.array([
[4,0,0,0],
[0,3,0,0],
[0,0,2,0],
[0,0,0,1]
])
M = eigvectors # eigvalues # np.linalg.inv(eigvectors)
print("The expected eigvals :", np.linalg.eigvals(M))
V,B = upper_hessenberg(M)
print("For 4x4 matrices, The function fails to obtain correct eigvals",np.linalg.eigvals(B))
Your error is that you try to be too efficient. While the last rows are indeed increasingly reduced with leading zeros, this is not the case for the last columns. So in row_projections you need to remove the limiter row:, change to B[:, column + 1:].
You are using the unstable variant of the "improved" Householder reflector. The older version would use the larger of x_refl - x and x_refl + x by setting reflection_of_x[0] = -np.sign(x[0])*norm(x) (or remove all minus signs there).
The stable variant of the improved reflector would use the binomial trick in the normalization of x_refl - x if this difference becomes too small.
x_refl - x = [ norm(x) - x[0], - x[1:] ]
= [ norm(x[1:])^2/(norm(x) + x[0]), - x[1:] ]
(x_refl - x)/norm(x_refl - x)
[ norm(x[1:]), - (norm(x)+x[0])*(x[1:]/norm(x[1:])) ]
= -----------------------------------------------------
sqrt(2*norm(x)*(norm(x)+x[0]))
While the parts may have wildly different scales, no catastrophic cancellation happens for x[0]>0.
See the discussion about the same algorithm from Golub/van Loan 4th ed. in for further details and opinions and the code from that book.

Sort simmilarity matrix according to plot colors

I have this similarity matrix plot of some documents. I want to sort the values of the matrix, which is a numpynd array, to group colors, while maintaining their relative position (diagonal yellow line), and labels as well.
path = "C:\\Users\\user\\Desktop\\texts\\dataset"
text_files = os.listdir(path)
#print (text_files)
tfidf_vectorizer = TfidfVectorizer()
documents = [open(f, encoding="utf-8").read() for f in text_files if f.endswith('.txt')]
sparse_matrix = tfidf_vectorizer.fit_transform(documents)
labels = []
for f in text_files:
if f.endswith('.txt'):
labels.append(f)
pairwise_similarity = sparse_matrix * sparse_matrix.T
pairwise_similarity_array = pairwise_similarity.toarray()
fig, ax = plt.subplots(figsize=(20,20))
cax = ax.matshow(pairwise_similarity_array, interpolation='spline16')
ax.grid(True)
plt.title('News articles similarity matrix')
plt.xticks(range(23), labels, rotation=90);
plt.yticks(range(23), labels);
fig.colorbar(cax, ticks=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
plt.show()
Here is one possibility.
The idea is to use the information in the similarity matrix and put elements next to each other if they are similar. If two items are similar they should also be similar with respect to other elements ie have similar colors.
I start with the element which has the most in common with all other elements (this choice is a bit arbitrary) [a] and as next element I choose from the remaining elements the one which is closest to the current [b].
import numpy as np
import matplotlib.pyplot as plt
def create_dummy_sim_mat(n):
sm = np.random.random((n, n))
sm = (sm + sm.T) / 2
sm[range(n), range(n)] = 1
return sm
def argsort_sim_mat(sm):
idx = [np.argmax(np.sum(sm, axis=1))] # a
for i in range(1, len(sm)):
sm_i = sm[idx[-1]].copy()
sm_i[idx] = -1
idx.append(np.argmax(sm_i)) # b
return np.array(idx)
n = 10
sim_mat = create_dummy_sim_mat(n=n)
idx = argsort_sim_mat(sim_mat)
sim_mat2 = sim_mat[idx, :][:, idx] # apply reordering for rows and columns
# Plot results
fig, ax = plt.subplots(1, 2)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat2)
def ticks(_ax, ti, la):
_ax.set_xticks(ti)
_ax.set_yticks(ti)
_ax.set_xticklabels(la)
_ax.set_yticklabels(la)
ticks(_ax=ax[0], ti=range(n), la=range(n))
ticks(_ax=ax[1], ti=range(n), la=idx)
After meTchaikovsky's answer I also tested my idea on a clustered similarity matrix (see first image) this method works but is not perfect (see second image).
Because I use the similarity between two elements as approximation to their similarity to all other elements, it is quite clear why this does not work perfectly.
So instead of using the initial similarity to sort the elements one could calculate a second order similarity matrix which measures how similar the similarities are (sorry).
This measure describes better what you are interested in. If two rows / columns have similar colors they should be close to each other. The algorithm to sort the matrix is the same as before
def add_cluster(sm, c=3):
idx_cluster = np.array_split(np.random.permutation(np.arange(len(sm))), c)
for ic in idx_cluster:
cluster_noise = np.random.uniform(0.9, 1.0, (len(ic),)*2)
sm[ic[np.newaxis, :], ic[:, np.newaxis]] = cluster_noise
def get_sim_mat2(sm):
return 1 / (np.linalg.norm(sm[:, np.newaxis] - sm[np.newaxis], axis=-1) + 1/n)
sim_mat = create_dummy_sim_mat(n=100)
add_cluster(sim_mat, c=4)
sim_mat2 = get_sim_mat2(sim_mat)
idx = argsort_sim_mat(sim_mat)
idx2 = argsort_sim_mat(sim_mat2)
sim_mat_sorted = sim_mat[idx, :][:, idx]
sim_mat_sorted2 = sim_mat[idx2, :][:, idx2]
# Plot results
fig, ax = plt.subplots(1, 3)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(sim_mat_sorted2)
The results with this second method are quite good (see third image)
but I guess there exist cases where this approach also fails, so I would be happy about feedback.
Edit
I tried to explain it and did also link the ideas to the code with [a] and [b], but obviously I did not do a good job, so here is a second more verbose explanation.
You have n elements and a n x n similarity matrix sm where each cell (i, j) describes how similar element i is to element j. The goal is to order the rows / columns in such a way that one can see existing patterns in the similarity matrix. My idea to achieve this is really simple.
You start with an empty list and add elements one by one. The criterion for the next element is the similarity to the current element. If element i was added in the last step, I chose the element argmax(sm[i, :]) as next, ignoring the elements already added to the list. I ignore the elements by setting the values of those elements to -1.
You can use the function ticks to reorder the labels:
labels = np.array(labels) # make labels an numpy array, to index it with a list
ticks(_ax=ax[0], ti=range(n), la=labels[idx])
#scleronomic's solution is very elegant, but it also has one shortage, which is we cannot set the number of clusters in the sorted correlation matrix. Assume we are working with a set of variables, in which some of them are weakly correlated
import string
import numpy as np
import pandas as pd
n_variables = 20
n_clusters = 10
n_samples = 100
np.random.seed(100)
names = list(string.ascii_lowercase)[:n_variables]
belongs_to_cluster = np.random.randint(0,n_clusters,n_variables)
latent = np.random.randn(n_clusters,n_samples)
variables = np.random.rand(n_variables,n_samples)
for ind in range(n_clusters):
mask = belongs_to_cluster == ind
# weakening the correlation
if ind % 2 == 0:variables[mask] += latent[ind]*0.1
variables[mask] += latent[ind]
df = pd.DataFrame({key:val for key,val in zip(names,variables)})
corr_mat = np.array(df.corr())
As you can see, there are 10 clusters of variables by construction, however, variables within clusters that has an even index are weakly correlated. If we only want to see roughly 5 clusters in the sorted correlation matrix, maybe we need to find another way.
Based on this post, which is the accepted answer to the question "Clustering a correlation matrix", to sort a correlation matrix into blocks, what we need to find are blocks, where correlations within blocks are high and correlations between blocks are low. However, the solution provided by this accepted answer works best when we know how many blocks are there in the first place, and more importantly, the sizes of the underlying blocks are the same, or at least similar. Therefore, I improved the solution with a new function sort_corr_mat
def sort_corr_mat(corr_mat,clusters_guess):
def _swap_rows(corr_mat, var1, var2):
rs = corr_mat.copy()
rs[var2, :],rs[var1, :]= corr_mat[var1, :],corr_mat[var2, :]
cs = rs.copy()
cs[:, var2],cs[:, var1] = rs[:, var1],rs[:, var2]
return cs
# analysis
max_iter = 500
best_score,current_score,best_count = -1e8,-1e8,0
num_minimua_to_visit = 20
best_corr = corr_mat
best_ordering = np.arange(n_variables)
for i in range(max_iter):
for row1 in range(n_variables):
for row2 in range(n_variables):
if row1 == row2: continue
option_ordering = best_ordering.copy()
option_ordering[row1],option_ordering[row2] = best_ordering[row2],best_ordering[row1]
option_corr = _swap_rows(best_corr,row1,row2)
option_score = score(option_corr,n_variables,clusters_guess)
if option_score > best_score:
best_corr = option_corr
best_ordering = option_ordering
best_score = option_score
if best_score > current_score:
best_count += 1
current_corr = best_corr
current_ordering = best_ordering
current_score = best_score
if best_count >= num_minimua_to_visit:
return best_corr#,best_ordering
return best_corr#,best_ordering
With this function and the corr_mat constructed in the first place, I compared the result obtained with my function (on the right) with that obtained with #scleronomic's solution (in the middle)
sim_mat_sorted = corr_mat[argsort_sim_mat(corr_mat), :][:, argsort_sim_mat(corr_mat)]
corr_mat_sorted = sort_corr_mat(corr_mat,clusters_guess=5)
# Plot results
fig, ax = plt.subplots(1,3,figsize=(18,6))
ax[0].imshow(corr_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(corr_mat_sorted)
Clearly, #scleronomic's solution works much better and faster, but my solution offers more control to the pattern of the output.

Identify similar numbers from several lists

I have 3 lists:
r=[0.611695403733703, 0.833193902333201, 1.09120811998494]
g=[0.300675698437847, 0.612539072191236, 1.18046695352626]
b=[0.00668849762984564, 0.611946522017357, 1.16778502636141]
I want to calculate the average of the most similar numbers. In the example above, r[0], g[1] and b[1] are very similar (approximately 0.61...). How can I identify this kind of pattern?
Brute force using list comprehensions:
r=[0.611695403733703, 0.833193902333201, 1.09120811998494]
g=[0.300675698437847, 0.612539072191236, 1.18046695352626]
b=[0.00668849762984564, 0.611946522017357, 1.16778502636141]
rg = [ (idx_r, idx_g,r,g) if abs(rr-gg) < 0.001 else None
for idx_r,rr in enumerate(r)
for idx_g, gg in enumerate(g)]
rb = [ (idx_r, idx_b,r,b) if abs(rr-bb) < 0.001 else None
for idx_r,rr in enumerate(r)
for idx_b, bb in enumerate(b)]
gb = [ (idx_g, idx_b,g,b) if abs(gg-bb) < 0.001 else None
for idx_g,gg in enumerate(g)
for idx_b, bb in enumerate(b)]
print(filter(None,rg+rb+gb))
Output:
[(0, 1, [0.611695403733703, 0.833193902333201, 1.09120811998494],
[0.300675698437847, 0.612539072191236, 1.18046695352626]),
(0, 1, [0.611695403733703, 0.833193902333201, 1.09120811998494],
[0.00668849762984564, 0.611946522017357, 1.16778502636141]),
(1, 1, [0.300675698437847, 0.612539072191236, 1.18046695352626],
[0.00668849762984564, 0.611946522017357, 1.16778502636141])]
Output are tuples of index in 1. list, index in 2. list and both lists.
You are looking to compute the distance between all sets of points. Best way to do this is scipy.spatial.distance.cdist:
from scipy.spatial.distance import cdist
import numpy as np
r=[0.611695403733703, 0.833193902333201, 1.09120811998494]
g=[0.300675698437847, 0.612539072191236, 1.18046695352626]
b=[0.00668849762984564, 0.611946522017357, 1.16778502636141]
arr = np.array([r,g,b])
# need 2d set of points
arr_flat = arr.ravel()[:, np.newaxis]
# computes distance between every point, pairwise
dists = cdist(arr_flat, arr_flat)
# (1,2) is the same as (2,1), so only consider each pair once
# ie. use upper triangle
dists = np.triu(dists)
# set 0 values to inf so we don't consider the,m
dists[dists == 0] = np.inf
# get all pairs that are below this threshold level
ahold = 0.01
coords = np.nonzero(dists<thold)
labels = 'rgb'
print(f'Pairs of points closer than {thold}:')
for i, j in zip(*coords):
print(labels[i//3] + f'[{i%3}]', labels[j//3] + f'[{j%3}]')
>>> Pairs of points closer than 0.01:
r[0] g[1]
r[0] b[1]
g[1] b[1]
# can easily count the number of points as
np.count_nonzero(dists<thold)
>>> 3

some python3 behavior i am unable to understood

I have used following codes.
from collections import defaultdict
from random import randint, randrange,choice, shuffle
def random_array(low, high, step, size):
lst = []
while len(lst)<size:
nexts = randrange(low, high, step)
if nexts in lst:continue
lst.append(nexts)
return lst
def find_pair_from_two_list(a, b, val):
b_dict = defaultdict(int)
for i,v in enumerate(b): b_dict[v] = i
for v in a:
if (val - v) in b_dict:
return v, val-v
return -1, -1
arr1 = random_array(1, 100, 1, 99)
arr2 = random_array(1, 100, 1, 99)
val1 = choice(arr1)
val2 = choice(arr2)
val = val1 + val2
print(find_pair_from_two_list(arr1,arr2, val))
However if i change size value in
arr1 = random_array(1, 100, 1, 99)
arr2 = random_array(1, 100, 1, 99)
upto 99 it works instantly but if i change any of the size value to 100 or more it just seems to hang in there.
I am curious to know why this is happening.I mean it works well till 99 but what causes it to hang for even 100.
Why is yours slow:
Using arr1 = random_array(1, 100, 1, 100) your method can take lots of time to draw the last missing numbers because you draw new random values over and over and discard them when they are already inside your resultlist:
while len(lst)<size:
nexts = randrange(low, high, step)
if nexts in lst:continue # discards already inside numbers
lst.append(nexts)
return lst
With inputs like this you essentially draw "all" possible numbers until done and the more your result contains the longer it takes to draw another "fitting" one.
You can even produce endless loops if your range(low,high,steps) has less total values then your size demands.
(1,100,5,100) # => only 20 in this range with this stepper -> endless loop
Possible simplification (not optimal)
You could simplyfy and speedup the code by:
import random
def random_array(low, high, step, size):
poss = list(range(low,high,step)) # this does not contain duplicates
random.shuffle(poss) # shuffle it
return poss[:size] # return size (or all) elements from it
print(random_array(1,100,1,10))
This code will return if you specify "wrong" combinations to it, but the resulting list is then shorter as whatever you specified as size.
Even better
jonsharpes suggestion to use
random.sample(range(low,high,step),size)
like so:
def ra(low,high,step,size):
return random.sample(range(low,high,step),size)
Performance test
Performancewise they the random.sample outperforms mine for big lists easily:
import random
def random_array(low, high, step, size):
poss = list(range(low,high,step))
random.shuffle(poss)
return poss[:size]
def ra(low,high,step,size):
return random.sample(range(low,high,step),size)
import timeit
if __name__ == '__main__':
import timeit
# create 100 times 495 randoms of range (1,1000000,22)
print(timeit.timeit("ra(1,1000000,22,495)", setup="from __main__ import ra",number = 10000))
print(timeit.timeit("random_array(1,1000000,22,495)", setup="from __main__ import random_array",number = 10000))
Output:
1.1825043768664596 # random.sample(...) of range(...)
92.12594874871951 # mine
Reason probably being I create actual lists from ranges, random.sample uses ranges with iterators smartly...
Doku:
https://docs.python.org/3.1/library/random.html
https://docs.python.org/3/library/timeit.html

How can I compare two lists of numpy vectors?

I have two lists of numpy vectors and wish to determine whether they represent approximately the same points (but possibly in a different order).
I've found methods such as numpy.testing.assert_allclose but it doesn't allow for possibly different orders. I have also found unittest.TestCase.assertCountEqual but that doesn't work with numpy arrays!
What is my best approach?
import unittest
import numpy as np
first = [np.array([20, 40]), np.array([20, 60])]
second = [np.array([19.8, 59.7]), np.array([20.1, 40.5])]
np.testing.assert_all_close(first, second, atol=2) # Fails because the orders are different
unittest.TestCase.assertCountEqual(None, first, second) # Fails because numpy comparisons evaluate element-wise; and because it doesn't allow a tolerance
A nice list iteration approach
In [1047]: res = []
In [1048]: for i in first:
...: for j in second:
...: diff = np.abs(i-j)
...: if np.all(diff<2):
...: res.append((i,j))
In [1049]: res
Out[1049]:
[(array([20, 40]), array([ 20.1, 40.5])),
(array([20, 60]), array([ 19.8, 59.7]))]
Length of res is the number of matches.
Or as list comprehension:
def match(i,j):
diff = np.abs(i-j)
return np.all(diff<2)
In [1051]: [(i,j) for i in first for j in second if match(i,j)]
Out[1051]:
[(array([20, 40]), array([ 20.1, 40.5])),
(array([20, 60]), array([ 19.8, 59.7]))]
or with the existing array test:
[(i,j) for i in first for j in second if np.allclose(i,j, atol=2)]
Here you are :)
( idea based on
Euclidean distance between points in two different Numpy arrays, not within )
import numpy as np
import scipy.spatial
first = [np.array([20 , 60 ]), np.array([ 20, 40])]
second = [np.array([19.8, 59.7]), np.array([20.1, 40.5])]
def pointsProximityCheck(firstListOfPoints, secondListOfPoints, distanceTolerance):
pointIndex = 0
maxDistance = 0
lstIndices = []
for item in scipy.spatial.distance.cdist( firstListOfPoints, secondListOfPoints ):
currMinDist = min(item)
if currMinDist > maxDistance:
maxDistance = currMinDist
if currMinDist < distanceTolerance :
pass
else:
lstIndices.append(pointIndex)
# print("point with pointIndex [", pointIndex, "] in the first list outside of Tolerance")
pointIndex+=1
return (maxDistance, lstIndices)
maxDistance, lstIndicesOfPointsOutOfTolerance = pointsProximityCheck(first, second, distanceTolerance=0.5)
print("maxDistance:", maxDistance, "indicesOfOutOfTolerancePoints", lstIndicesOfPointsOutOfTolerance )
gives on output with distanceTolerance=0.5 :
maxDistance: 0.509901951359 indicesOfOutOfTolerancePoints [1]
but possibly in a different order
This is the key requirement. This problem can be treat as a classic problem in graph theory - finding perfect matching in unweighted bipartite graph. Hungarian Algorithm is a classic algo to solve this problem.
Here I implemented one.
import numpy as np
def is_matched(first, second):
checked = np.empty((len(first),), dtype=bool)
first_matching = [-1] * len(first)
second_matching = [-1] * len(second)
def find(i):
for j, point in enumerate(second):
if np.allclose(first[i], point, atol=2):
if not checked[j]:
checked[j] = True
if second_matching[j] == -1 or find(second_matching[j]):
second_matching[j] = i
first_matching[i] = j
return True
def get_max_matching():
count = 0
for i in range(len(first)):
if first_matching[i] == -1:
checked.fill(False)
if find(i):
count += 1
return count
return len(first) == len(second) and get_max_matching() == len(first)
first = [np.array([20, 40]), np.array([20, 60])]
second = [np.array([19.8, 59.7]), np.array([20.1, 40.5])]
print(is_matched(first, second))
# True
first = [np.array([20, 40]), np.array([20, 60])]
second = [np.array([19.8, 59.7]), np.array([20.1, 43.5])]
print(is_matched(first, second))
# False

Resources