I am attempting to translate a MATLAB function to Python from Timothy Sauer,
Numerical Analysis Second Edition, page 546, Program 12.8. The original function
receives a square matrix and returns a matrix with the same eigenvalues but in
Upper Hessenberg form. The original function creates Householder reflectors to produce zeros in the
offdiagonals of the matrix and performs similarity transformations on the original matrix to
get it to upper hessenberg form.
My Python translation succeeds only in obtaining the eigenvalues for 3x3 matrices
but not for 4x4 matrices. Would anyone know the cause of the error? I pasted my code with success and failing cases below. Thank you.
import numpy as np
import math
norm = lambda v:math.sqrt(np.sum(v**2))
def upper_hessenberg(A):
'''
Translated from Timothy Sauer, Numerical Analysis Second Edition, page 546, Program 12.8
Input: Square Matrix, A
Output: B, a Similar Matrix with Same Eigenvalues as A except in Upper Hessenberg form
V, a matrix containing the reflectors used to produce zeros in the off diagonals
'''
rows, columns = A.shape
B = A[:,:].astype(np.float) #will store the similar matrix
V = np.zeros(shape=(rows,columns),dtype=float) #will store the reflectors
for column in range(columns-2): #start from the 1st column end at the third to last column
row = column
x = B[row+1: ,column] #decapitate the column
reflection_of_x = np.zeros(len(x)) #first entry is the norm, followed by 0s
if abs(norm(x)) <= np.finfo(float).eps: #if there are already 0s inthe offdiagonals skip this column
continue
reflection_of_x[0] = norm(x)
v = reflection_of_x - x # v, (the difference vector) represents the line connecting the original column to the reflection of the column (see Timothy Sauer Num Analysis 2nd Edition Figure 4.11 Householder reflector)
v = v/norm(v) #normalize to length of 1 (unit vector)
V[:len(v), column] = v #save the reflector in an upper triangular matrix called V
#verify with x-2*(x # v * v) should equal a vector with all zeros except the leading entry
column_projections = np.outer(v , v # B[row+1:, column:]) #project each col onto difference vector
B[row+1:, column:] = B[row+1:, column:] - (2 * column_projections)
row_projections = np.outer(v, B[row:, column + 1:] # v).T #project each row onto difference vector
B[row:, column + 1:] = B[row:, column + 1:] - (2 * row_projections)
return V, B
# Algorithm succeeds only with 3x3 matrices
eigvectors = np.array([
[1,3,2],
[4,5,6],
[7,8,9],
])
eigvalues = np.array([
[4,0,0],
[0,3,0],
[0,0,2]
])
M = eigvectors # eigvalues # np.linalg.inv(eigvectors)
print("The expected eigvals :", np.linalg.eigvals(M))
V,B = upper_hessenberg(M)
print("For 3x3 matrices, The function successfully produces these eigvals",np.linalg.eigvals(B))
#But with 4x4 matrices it fails
eigvectors = np.array([
[1,3,2,4],
[4,5,6,2],
[7,8,9,5],
[5,2,7,8]
])
eigvalues = np.array([
[4,0,0,0],
[0,3,0,0],
[0,0,2,0],
[0,0,0,1]
])
M = eigvectors # eigvalues # np.linalg.inv(eigvectors)
print("The expected eigvals :", np.linalg.eigvals(M))
V,B = upper_hessenberg(M)
print("For 4x4 matrices, The function fails to obtain correct eigvals",np.linalg.eigvals(B))
Your error is that you try to be too efficient. While the last rows are indeed increasingly reduced with leading zeros, this is not the case for the last columns. So in row_projections you need to remove the limiter row:, change to B[:, column + 1:].
You are using the unstable variant of the "improved" Householder reflector. The older version would use the larger of x_refl - x and x_refl + x by setting reflection_of_x[0] = -np.sign(x[0])*norm(x) (or remove all minus signs there).
The stable variant of the improved reflector would use the binomial trick in the normalization of x_refl - x if this difference becomes too small.
x_refl - x = [ norm(x) - x[0], - x[1:] ]
= [ norm(x[1:])^2/(norm(x) + x[0]), - x[1:] ]
(x_refl - x)/norm(x_refl - x)
[ norm(x[1:]), - (norm(x)+x[0])*(x[1:]/norm(x[1:])) ]
= -----------------------------------------------------
sqrt(2*norm(x)*(norm(x)+x[0]))
While the parts may have wildly different scales, no catastrophic cancellation happens for x[0]>0.
See the discussion about the same algorithm from Golub/van Loan 4th ed. in for further details and opinions and the code from that book.
Related
I am trying to put together a numerical simulation (specifically, Beta cell dynamics) based on Betram et al. 2007 (https://www.sciencedirect.com/science/article/pii/S0006349507709621). The model itself works fine but it is very slow since the simulation step must be around 0.1 ms and python is not the fastest language around. It takes approximately 12 real seconds for every simulation second with only 15 coupled beta cells in the system. In the end, I will need around 1000 beta cells to simulate an entire islet of Langerhans so you can see why I need to speed things up.
Each beta cell is implemented as a class instance which inherits from the CellParameters and ModelParameters class.
#jitclass(spec)
class BetaCell:
def __init__(self, cell_num: int, neighbours: list, G: float):
##sets initial conditions (23 parameters - floats and lists)).
def w_ijkl(self, ii, jj, kk, ll, f6p):
###calculates and returns a specific parameter
def run_model_step(self, Ge: float):
###runs one time step (dt=0.1 ms) for the cell.
###has to calculate/update around 55 parameters
class ModelParameters:
###Contains all model parameters
###time step, the intensity of glucose stimulation, the start of stimulation etc.
###also contains when to save a time step for later visualization
#staticmethod
def external_glucose(time):
###calculates and returns the current level of external glucose
###uses a simple equation
class CellParameters:
###Contains approx. 70 parameters (floats) that the the model needs for execution.
###Some of these parameters are changed (once) after initialization
###to introduce some cell heterogeneity
The simulation looks like this:
some data is imported with cell parameters (locations, coupling, coupling weights).
each cell is initialized with its cell number (0, 1, 2, 3...), neighbours and starting glucose
Cells are stored into a list named "cells".
if required, heterogeneity is introduced into cellular parameters
each step of the simulation is executed
Simulation step execution:
def run_step(cell):
cell.run_model_step(glc)
if __name__ == '__main__':
for step, current_time in enumerate(time):
###time array is pre-calculated based on provided end_time and simulation step (dt)
glc = ModelParameters.external_glucose(current_time)
cells = calculate_gj_coupling(cells) #calculates gap-jounction coupling between connected cells
cells = list(map(run_step, cells))
The above for-loop is repeated until the end of the simulation is reached. Ofcourse this is a slow process taking around 10-12 seconds for each simulation second (10000 loop iterations # 0.1 ms steps)
I really need to speed things up, preferably around 10-fold or more would be great.
Sofar I tried to use the Pool class from the multiprocessing module.
I created a pool: pool = Pool(processes=NUMBER_OF_WORKERS)
I used the pools map function to run a simulation step for each cell
pool = Pool(processes=NUMBER_OF_WORKERS)
.
.
.
for step, current_time in enumerate(time):
###time array is pre-calculated based on provided end_time and simulation step (dt)
glc = ModelParameters.external_glucose(current_time)
cells = calculate_gj_coupling(cells) #calculates gap-jounction coupling between connected cells
cells = pool.map(run_step, cells)
pool.terminate()
The rest is the same as before, because the slow part is the calculation of individual time steps for every beta cell.
The problem with the above solution is that it makes things worse. I am guessing that the shifting of the class instances around in memory for each process is the culprit, because the same solution worked wonders for a simplyfied version of the problem (below)
def task_function(dummy_object):
dummy_object.sum_ab()
return dummy_object
class DummyObject:
def __init__(self, a, b):
self.a = a
self.b = b
self.ab = 0.0
def sum_ab(self):
time.sleep(2) #simulates long running task
self.ab += self.a + self.b
if __name__ == '__main__':
pool = Pool(processes=NUMBER_OF_WORKERS)
cells = [DummyObject(i, randint(1,20), randint(1,20)) for i in range(NUMBER_OF_CELLS)]
for i in range(NUMBER_OF_STEPS):
pool.map(task_function, cells)
pool.terminate()
The above simple example speeds things up quite a bit. If sequential execution is implemented (the standard way) the "simulation" takes 400 seconds # NUMBER_OF_CELLS=200 for one iteration of the for-loop (each cell takes 2 seconds * 200 = 400 s). If I implement the above solution one iteration of the for-loop takes only 8 seconds with NUMBER_OF_CELLS=200 and NUMBER_OF_WORKERS=60. But these DummyObjects are ofcourse very small and simple so the shifting around in memory goes quickly.
Any ideas to implement some version of the above dummy solution would be greatly appreciated.
EDIT 16. 2. 2023
Thanks to Fanchen Bao I have found the remaining bottleneck in my code. It is the coupling function that calculated coupling currents between connected cells.
The coupling function looks like this:
#jit(nopython=True)
def calculate_gj_coupling(cells, cells_neighbours):
for i, cell in enumerate(cells):
ca_current = 0.0
voltage_current = 0.0
g6p_current = 0.0
adp_current = 0.0
for neighbour, weight in cells_neighbours[i]:
voltage_current += (cell.Cgjv*weight)*(cells[neighbour].V-cell.V)
ca_current += (cell.Cgjca*weight)*(cells[neighbour].C-cell.C)
g6p_current += (cell.Cgjg6p*weight)*(0.3*cells[neighbour].G6P-0.3*cell.G6P)
adp_current += (cell.Cgjadp*weight)*(cells[neighbour].ADPm - cell.ADPm)
cell.couplingV = voltage_current
cell.couplingCa = ca_current
cell.couplingG6P = g6p_current
cell.couplingADP = adp_current
return cells
It is basically a nested for-loop because each connection between two cells is weighted (weight parameter).
What would be a more pythonic (and faster) way of writing this up? Keep in mind that this function runs in every simulation step.
EDIT 18. 2 2023
I rewrote the BetaCell class. It now contains all cell parameters (instead of inheriting from the CellParameters class) and all necessary model parameters are provided at initialization (dt, save_step). This allowed me to add the Numba jitclass decorator with corresponding specifications. It threw an error before, because the appears to be a problem with inheritance during compilation, I guess. I also use Numba List() class instead of the Python built-in list.
This is not strictly a solution, but a suggestion on how the coupling function could be optimized.
Caveat
Without source data, the pseudocode below is not tested. The implementation may or may not function correctly, but the core idea shall be sound.
Core Idea
Look at this code snippet in particular
for neighbour, weight in cells_neighbours[i]:
voltage_current += (cell.Cgjv*weight)*(cells[neighbour].V-cell.V)
If we write neighboring cells' weight as w1, w2, ..., wm, neighboring cells' voltage as V1, V2, ..., Vm, then
voltage_current
= cell.Cgjv * w1 * (V1 - cell.V) + cell.Cgjv * w2 * (V2 - cell.V) + ... + cell.Cgjv * wm * (Vm - cell.V)
= cell.Cgjv * (w1 * V1 + w2 * V2 + ... + wm * Vm - cell.V * (w1 + w2 + ... + wm))
= cell.Cgjv * (np.dot(W, V) - cell.V * np.sum(W))
Here W and V refers to vectors [w1, w2, ..., wm] and [V1, V2, ..., Vm] (represented in Python as numpy arrays). We can leverage numpy's vectorization to speed up the inner loop.
The same logic applies to the other three computations, as they follow the same logic. This optimization requires that all V, C, G6P and ADPm be stored as its own numpy array outside the cell object. We are basically striping down OOP to favor speed. It might make the code base a bit harder to maintain, but it surely will get a performance boost.
Optimized Pseudocode
import numpy as np
def calculate_gj_coupling(
cells: List[Cell],
cells_neighbours_indices: np.ndarray,
cells_neighbours_weights: np.ndarray,
cells_attrb: Dict[str, np.ndarray],
) -> None:
"""Calculate GJ coupling.
Notice that we don't have to return anything, because the update to cells
are done in-place
:param cells: a list of cells.
:type cells: List[Cell]
:param cells_neighbours_indices: index of the outer list is the index to
each cell in cells. Value of the inner list is the index of the
neighboring cell.
e.g. given cells_neighbours_indices = [
[1, 2, 3],
[0, 3],
[0, 3],
[0, 1, 2],
]
We say
cells[0] is neighbouring with cells[1], cells[2], and cells[3].
cells[1] is neighbouring cells[0], cells[3]
cells[2] is neighbouring cells[0], cells[3]
cells[3] is neighbouring cells[0], cells[1], cells[2]
:type cells_neighbours_indices: np.ndarray
:param cells_neighbours_weights: index of the outer list is the index to
each cell in cells. Value of the inner list is the weight of the
neighboring cell.
e.g. given cells_neighbours_weights = [
[1, 2, 3],
[4, 5],
[6, 7],
[8, 9, 10],
]
We say
cells[0]'s three neighbors have weights [1, 2, 3]
cells[1]'s two neighbors have weights [4, 5]
cells[2]'s three neighbors have weights [6, 7]
cells[3]'s three neighbors have weights [8, 9, 10]
:type cells_neighbours_weights: np.ndarray
:param cells_attrib: a central place to store the attributes of all cells.
It has the following shape
{
'V': np.array([v0, v1, ..., vn]), # all cells' V
'C': np.array([c0, c1, ..., cn]), # all cells' C
'G6P': np.array([g0, g1, ..., gn]), # all cells' G6P
'ADPm': np.array([a0, a1, ..., an]), # all cells' ADPm
}
"""
for i, cell in enumerate(cells):
idx = cells_neighbours_indices[i]
ws = cells_neighbours_weights[i]
sum_w = np.sum(ws)
cell.couplingV = cell.Cgjv * (np.dot(ws, cells_attrb['V'][idx]) - cell.V * sum_w)
cell.couplingCa = cell.Cgjca * (np.dot(ws, cells_attrb['C'][idx]) - cell.C * sum_w)
cell.couplingG6P = cell.Cgjg6p * 0.3 * (np.dot(ws, cells_attrb['G6P'][idx]) - cell.G6P * sum_w)
cell.couplingADP = cell.Cgjadp * (np.dot(ws, cells_attrb['ADPm'][idx]) - cell.ADPm * sum_w)
Can We Do Even Better?
Notice that the optimized pseudocode above only deals with the inner loop. There is not that much going on on the outer loop as well, so is it possible that we vectorize the outer loop as well? Let's take cell.couplingV as an example. What we want for calculate_gj_coupling to accomplish is the following:
couplingV0 = Cgjv0 * (np.dot(nei_W0, nei_V0) - V0 * np.sum(W0))
couplingV1 = Cgjv1 * (np.dot(nei_W1, nei_V1) - V1 * np.sum(W1))
.
.
.
couplingVn = Cgjvn * (np.dot(nei_Wn, nei_Vn) - Vn * np.sum(Wn))
where couplingV0, couplingV1, ..., couplingVn are the couplingV values of each cell, Cgjv0, Cgjv1, ..., Cgjvn are the Cgjv values of each cell, and V0, V1, ..., Vn are V values of each cell.
nei_W0, nei_W1, ..., nei_Wn are a list of vectors, each vector being a list of weights of a cell's neighboring cells. Similarly, nei_V0, nei_V1, ..., nei_Vn are a list of vectors, each vector being a list of V values of a cell's neighboring cells.
We can rewrite that as
where . is dot product of vectors and * is element-wise multiplication.
This equation tells us that if we use a vector to represent all cell's couplingV, we are able to vectorize the entire process of calculate_gj_coupling.
One more thing that needs addressing is the handling of nei_W and nei_V. We can turn them into matrices W_mat and V_mat. For example, W_mat[i][j] represent the weight of cells[j] when it neighbors cells[i]. If they do not neighbor, set W_mat[i][j] to zero.
Below is another piece of pseudocode to implement the full vectorization idea. Note that we strip down OOP even more. Also, W_mat, V_mat, C_mat, G6P_mat, and ADPm_mat each contains repeated data and is sparse. We are essentially sacrificing space for better time performance.
import numpy as np
def calculate_gj_coupling(
W_mat: np.ndarray,
V_mat: np.ndarray,
C_mat: np.ndarray,
G6P_mat: np.ndarray,
ADPm_mat: np.ndarray,
cells_attrb: Dict[str, np.ndarray],
) -> Dict[str, np.ndarray]:
"""_summary_
:param W_mat: W_mat is a matrix of ALL weights of cells neighboring all
other cells. e.g. W_mat[i][j] is the weight of cells[j] when it is
neighboring cells[i]. If cells[i] and cells[j] do not neighbor,
set W_mat[i][j] = 0. It has shape (n, n), where n is the total number
of cells.
:type W_mat: np.ndarray
:param V_mat: V_mat is a matrix of ALL Vs of cells neighboring all other
cells. e.g. V_mat[i][j] is the V of cells[j] when it is neighboring
cells[i]. If cells[i] and cells[j] do not neighbor, set V_mat[i][j] = 0
It has shape (n, n), where n is the total number of cells.
:type V_mat: np.ndarray
:param C_mat: C_mat is a matrix of ALL Cs of cells neighboring all other
cells. e.g. C_mat[i][j] is the C of cells[j] when it is neighboring
cells[i]. If cells[i] and cells[j] do not neighbor, set C_mat[i][j] = 0
It has shape (n, n), where n is the total number of cells.
:type C_mat: np.ndarray
:param G6P_mat: G6P_mat is a matrix of ALL G6Ps of cells neighboring all
other cells. e.g. G6P_mat[i][j] is the G6P of cells[j] when it is
neighboring cells[i]. If cells[i] and cells[j] do not neighbor, set
G6P_mat[i][j] = 0. It has shape (n, n), where n is the total number of
cells.
:type G6P_mat: np.ndarray
:param ADPm_mat: ADPm_mat is a matrix of ALL ADPms of cells neighboring all
other cells. e.g. ADPm_mat[i][j] is the ADPm of cells[j] when it is
neighboring cells[i]. If cells[i] and cells[j] do not neighbor, set
ADPm_mat[i][j] = 0. It has shape (n, n), where n is the total number of
cells.
:type ADPm_mat: np.ndarray
:param cells_attrib: a central place to store the attributes of all cells.
It has the following shape
{
'V': np.array([v0, v1, ..., vn]), # all cells' V
'C': np.array([c0, c1, ..., cn]), # all cells' C
'G6P': np.array([g0, g1, ..., gn]), # all cells' G6P
'ADPm': np.array([a0, a1, ..., an]), # all cells' ADPm
'Cgjv': np.array([cv0, cv1, ..., cvn]), # all cells' Cgjv
'Cgjca': np.array([cca0, cca1, ..., ccan]), # all cells' Cgjca
'Cgjg6p': np.array([cg6p0, cg6p1, ..., cg6pn]), # all cells' Cgjg6p
'Cgjadp': np.array([cadp0, cadp1, ..., cadpn]), # all cells' Cgjadp
}
:return: a dictionary of the following shape
{
'couplingV': np.ndarray,
'couplingCa': np.ndarray,
'couplingG6P': np.ndarray,
'couplingADP': np.ndarray,
}
Each array has length n, recording the coupling values of each cell.
:rtype: Dict[str, np.ndarray]
"""
sum_w = np.sum(W_mat, axis=1) # a vector of weight sums for each cell
# `*` is element-wise multiplication, `#` is matrix multiplication
return {
'couplingV': cells_attrb['Cgjv'] * (np.diag(W_mat # V_mat) - cells_attrb['V'] * sum_w),
'couplingCa': cells_attrb['Cgjvca'] * (np.diag(W_mat # C_mat) - cells_attrb['C'] * sum_w),
'couplingG6P': cells_attrb['Cgjg6p'] * 0.3 * (np.diag(W_mat # G6P_mat) - cells_attrb['G6P'] * sum_w),
'couplingADP': cells_attrb['Cgjadp'] * (np.diag(W_mat # ADPm_mat) - cells_attrb['ADPm'] * sum_w),
}
I have loaded a huge image as Numpy array of dimensions H x W x 3. I want to split this single image into 15 x 15 grid and transform it into 225 x H/15 x W/15 x 3 NumPy array where the ordering happens either row-wise or column-wise. Note that H and W are perfect multiples of 15.
I know that this can be done using two for loops as shown below,
for row in range(15):
for col in range(15):
count+=1
subimage[count,:,:,:] = img[h1:h2, w1:w2, :]
but this takes time (I have to repeat this process for 100,000 images which are very huge).
Is there a faster NumPy code to re-organize a single image into 225 sub-images as illustrated above?
It looks like most of the time is spent in copying the hugeimage array values in the subimages array. The only solution I've found to speed up your process is to get the resulted subimages as a list of subarray references instead of a numpy array. This enables to speed up the subimage creation a lot but has 2 drawbacks:
You'll need to adapt the following code to the new format.
The elements of the list are references to the hugeimage so modifying subimageslist2[i] array will also alter hugeimage array values.
Here is a small script that compares your version and the list version:
import numpy as np
import time
# Preparation of testdata
R, C = 15, 15
H, W, D = 400*R, 400*C, 3
hugeimage = np.random.randint(0,255,(H,W,D))
# For loop verion
t_start = time.time()
subimages = np.zeros((R*C,H//R,W//C,D),dtype='int')
count = -1
for row in range(R):
for col in range(C):
count+=1
h1, h2, w1, w2 = row*(H//R), (row+1)*(H//R), col*(W//C), (col+1)*(W//C)
subimages[count,:,:,:] = hugeimage[h1:h2, w1:w2, :]
print(f'Timer 1: {time.time()-t_start}s')
# For loop list (no copy)
t_start = time.time()
subimageslist2 = []
for row in range(R):
for col in range(C):
h1, h2, w1, w2 = row*(H//R), (row+1)*(H//R), col*(W//C), (col+1)*(W//C)
subimageslist2.append(hugeimage[h1:h2, w1:w2, :])
print(f'Timer 2: {time.time()-t_start}s')
subimages2 = np.array(subimageslist2)
print(f'Timer 2 bis: {time.time()-t_start}s')
print('Results 1&2 are equal' if np.linalg.norm(subimages-subimages2)==0 else 'Results 1&2 differ')
Output:
% python3 script.py
Timer 1: 0.38389086723327637s
Timer 2: 0.0003371238708496094s
Timer 2 bis: 0.3779451847076416s
Results 1&2 are equal
As you can see, adapting your code to work with the list subimageslist2 speeds up this portion of code. You can then run subimages2 = np.array(subimageslist2) to transform the list of subarray references to a numpy array but this will perform a copy and you'll lose the performance improvement (Timer 2 bis).
I have this similarity matrix plot of some documents. I want to sort the values of the matrix, which is a numpynd array, to group colors, while maintaining their relative position (diagonal yellow line), and labels as well.
path = "C:\\Users\\user\\Desktop\\texts\\dataset"
text_files = os.listdir(path)
#print (text_files)
tfidf_vectorizer = TfidfVectorizer()
documents = [open(f, encoding="utf-8").read() for f in text_files if f.endswith('.txt')]
sparse_matrix = tfidf_vectorizer.fit_transform(documents)
labels = []
for f in text_files:
if f.endswith('.txt'):
labels.append(f)
pairwise_similarity = sparse_matrix * sparse_matrix.T
pairwise_similarity_array = pairwise_similarity.toarray()
fig, ax = plt.subplots(figsize=(20,20))
cax = ax.matshow(pairwise_similarity_array, interpolation='spline16')
ax.grid(True)
plt.title('News articles similarity matrix')
plt.xticks(range(23), labels, rotation=90);
plt.yticks(range(23), labels);
fig.colorbar(cax, ticks=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
plt.show()
Here is one possibility.
The idea is to use the information in the similarity matrix and put elements next to each other if they are similar. If two items are similar they should also be similar with respect to other elements ie have similar colors.
I start with the element which has the most in common with all other elements (this choice is a bit arbitrary) [a] and as next element I choose from the remaining elements the one which is closest to the current [b].
import numpy as np
import matplotlib.pyplot as plt
def create_dummy_sim_mat(n):
sm = np.random.random((n, n))
sm = (sm + sm.T) / 2
sm[range(n), range(n)] = 1
return sm
def argsort_sim_mat(sm):
idx = [np.argmax(np.sum(sm, axis=1))] # a
for i in range(1, len(sm)):
sm_i = sm[idx[-1]].copy()
sm_i[idx] = -1
idx.append(np.argmax(sm_i)) # b
return np.array(idx)
n = 10
sim_mat = create_dummy_sim_mat(n=n)
idx = argsort_sim_mat(sim_mat)
sim_mat2 = sim_mat[idx, :][:, idx] # apply reordering for rows and columns
# Plot results
fig, ax = plt.subplots(1, 2)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat2)
def ticks(_ax, ti, la):
_ax.set_xticks(ti)
_ax.set_yticks(ti)
_ax.set_xticklabels(la)
_ax.set_yticklabels(la)
ticks(_ax=ax[0], ti=range(n), la=range(n))
ticks(_ax=ax[1], ti=range(n), la=idx)
After meTchaikovsky's answer I also tested my idea on a clustered similarity matrix (see first image) this method works but is not perfect (see second image).
Because I use the similarity between two elements as approximation to their similarity to all other elements, it is quite clear why this does not work perfectly.
So instead of using the initial similarity to sort the elements one could calculate a second order similarity matrix which measures how similar the similarities are (sorry).
This measure describes better what you are interested in. If two rows / columns have similar colors they should be close to each other. The algorithm to sort the matrix is the same as before
def add_cluster(sm, c=3):
idx_cluster = np.array_split(np.random.permutation(np.arange(len(sm))), c)
for ic in idx_cluster:
cluster_noise = np.random.uniform(0.9, 1.0, (len(ic),)*2)
sm[ic[np.newaxis, :], ic[:, np.newaxis]] = cluster_noise
def get_sim_mat2(sm):
return 1 / (np.linalg.norm(sm[:, np.newaxis] - sm[np.newaxis], axis=-1) + 1/n)
sim_mat = create_dummy_sim_mat(n=100)
add_cluster(sim_mat, c=4)
sim_mat2 = get_sim_mat2(sim_mat)
idx = argsort_sim_mat(sim_mat)
idx2 = argsort_sim_mat(sim_mat2)
sim_mat_sorted = sim_mat[idx, :][:, idx]
sim_mat_sorted2 = sim_mat[idx2, :][:, idx2]
# Plot results
fig, ax = plt.subplots(1, 3)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(sim_mat_sorted2)
The results with this second method are quite good (see third image)
but I guess there exist cases where this approach also fails, so I would be happy about feedback.
Edit
I tried to explain it and did also link the ideas to the code with [a] and [b], but obviously I did not do a good job, so here is a second more verbose explanation.
You have n elements and a n x n similarity matrix sm where each cell (i, j) describes how similar element i is to element j. The goal is to order the rows / columns in such a way that one can see existing patterns in the similarity matrix. My idea to achieve this is really simple.
You start with an empty list and add elements one by one. The criterion for the next element is the similarity to the current element. If element i was added in the last step, I chose the element argmax(sm[i, :]) as next, ignoring the elements already added to the list. I ignore the elements by setting the values of those elements to -1.
You can use the function ticks to reorder the labels:
labels = np.array(labels) # make labels an numpy array, to index it with a list
ticks(_ax=ax[0], ti=range(n), la=labels[idx])
#scleronomic's solution is very elegant, but it also has one shortage, which is we cannot set the number of clusters in the sorted correlation matrix. Assume we are working with a set of variables, in which some of them are weakly correlated
import string
import numpy as np
import pandas as pd
n_variables = 20
n_clusters = 10
n_samples = 100
np.random.seed(100)
names = list(string.ascii_lowercase)[:n_variables]
belongs_to_cluster = np.random.randint(0,n_clusters,n_variables)
latent = np.random.randn(n_clusters,n_samples)
variables = np.random.rand(n_variables,n_samples)
for ind in range(n_clusters):
mask = belongs_to_cluster == ind
# weakening the correlation
if ind % 2 == 0:variables[mask] += latent[ind]*0.1
variables[mask] += latent[ind]
df = pd.DataFrame({key:val for key,val in zip(names,variables)})
corr_mat = np.array(df.corr())
As you can see, there are 10 clusters of variables by construction, however, variables within clusters that has an even index are weakly correlated. If we only want to see roughly 5 clusters in the sorted correlation matrix, maybe we need to find another way.
Based on this post, which is the accepted answer to the question "Clustering a correlation matrix", to sort a correlation matrix into blocks, what we need to find are blocks, where correlations within blocks are high and correlations between blocks are low. However, the solution provided by this accepted answer works best when we know how many blocks are there in the first place, and more importantly, the sizes of the underlying blocks are the same, or at least similar. Therefore, I improved the solution with a new function sort_corr_mat
def sort_corr_mat(corr_mat,clusters_guess):
def _swap_rows(corr_mat, var1, var2):
rs = corr_mat.copy()
rs[var2, :],rs[var1, :]= corr_mat[var1, :],corr_mat[var2, :]
cs = rs.copy()
cs[:, var2],cs[:, var1] = rs[:, var1],rs[:, var2]
return cs
# analysis
max_iter = 500
best_score,current_score,best_count = -1e8,-1e8,0
num_minimua_to_visit = 20
best_corr = corr_mat
best_ordering = np.arange(n_variables)
for i in range(max_iter):
for row1 in range(n_variables):
for row2 in range(n_variables):
if row1 == row2: continue
option_ordering = best_ordering.copy()
option_ordering[row1],option_ordering[row2] = best_ordering[row2],best_ordering[row1]
option_corr = _swap_rows(best_corr,row1,row2)
option_score = score(option_corr,n_variables,clusters_guess)
if option_score > best_score:
best_corr = option_corr
best_ordering = option_ordering
best_score = option_score
if best_score > current_score:
best_count += 1
current_corr = best_corr
current_ordering = best_ordering
current_score = best_score
if best_count >= num_minimua_to_visit:
return best_corr#,best_ordering
return best_corr#,best_ordering
With this function and the corr_mat constructed in the first place, I compared the result obtained with my function (on the right) with that obtained with #scleronomic's solution (in the middle)
sim_mat_sorted = corr_mat[argsort_sim_mat(corr_mat), :][:, argsort_sim_mat(corr_mat)]
corr_mat_sorted = sort_corr_mat(corr_mat,clusters_guess=5)
# Plot results
fig, ax = plt.subplots(1,3,figsize=(18,6))
ax[0].imshow(corr_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(corr_mat_sorted)
Clearly, #scleronomic's solution works much better and faster, but my solution offers more control to the pattern of the output.
I have a model which is defined as:
m(x,z) = C1*x^2*sin(z)+C2*x^3*cos(z)
I have multiple data sets for different z (z=1, z=2, z=3), in which they give me m(x,z) as a function of x.
The parameters C1 and C2 have to be the same for all z values.
So I have to fit my model to the three data sets simultaneously otherwise I will have different values of C1 and C2 for different values of z.
It this possible to do with scipy.optimize.
I can do it for just one value of z, but can't figure out how to do it for all z's.
For one z I just write this:
def my_function(x,C1,C1):
z=1
return C1*x**2*np.sin(z)+ C2*x**3*np.cos(z)
data = 'some/path/for/data/z=1'
x= data[:,0]
y= data[:,1]
from lmfit import Model
gmodel = Model(my_function)
result = gmodel.fit(y, x=x, C1=1.1)
print(result.fit_report())
How can I do it for multiple set of datas (i.e different z values?)
So what you want to do is fit a multi-dimensional fit (2-D in your case) to your data; that way for the entire data set you get a single set of C parameters that bests describes your data. I think the best way to do this is using scipy.optimize.curve_fit().
So your code would look something like this:
import scipy.optimize as optimize
import numpy as np
def my_function(xz, *par):
""" Here xz is a 2D array, so in the form [x, z] using your variables, and *par is an array of arguments (C1, C2) in your case """
x = xz[:,0]
z = xz[:,1]
return par[0] * x**2 * np.sin(z) + par[1] * x**3 * np.cos(z)
# generate fake data. You will presumable have this already
x = np.linspace(0, 10, 100)
z = np.linspace(0, 3, 100)
xx, zz = np.meshgrid(x, z)
xz = np.array([xx.flatten(), zz.flatten()]).T
fakeDataCoefficients = [4, 6.5]
fakeData = my_function(xz, *fakeDataCoefficients) + np.random.uniform(-0.5, 0.5, xx.size)
# Fit the fake data and return the set of coefficients that jointly fit the x and z
# points (and will hopefully be the same as the fakeDataCoefficients
popt, _ = optimize.curve_fit(my_function, xz, fakeData, p0=fakeDataCoefficients)
# Print the results
print(popt)
When I do this fit I get precisely the fakeDataCoefficients I used to generate the function, so the fit works well.
So the conclusion is that you don't do 3 fits independently, setting the value of z each time, but instead you do a 2D fit which takes the values of x and z simultaneously to find the best coefficients.
Your code is incomplete and has a few syntax errors.
But I think that you want to build a model that concatenates the models for the different data sets, and then fit the concatenated data to that model. Within the context of lmfit (disclosure: author and maintainer), I often find it easier to use minimize() and an objective function for multiple data set fits rather than the Model class. Perhaps start with something like this:
import lmfit
import numpy as np
# define the model function for each dataset
def my_function(x, c1, c2, z=1):
return C1*x**2*np.sin(z)+ C2*x**3*np.cos(z)
# Then write an objective function like this
def f2min(params, x, data2d, zlist):
ndata, npts = data2d.shape
residual = 0.0*data2d[:]
for i in range(ndata):
c1 = params['c1_%d' % (i+1)].value
c2 = params['c2_%d' % (i+1)].value
residual[i,:] = data[i,:] - my_function(x, c1, c2, z=zlist[i])
return residual.flatten()
# now build that `data2d`, `zlist` and build the `Parameters`
data2d = []
zlist = []
x = None
for fname in dataset_names:
d = np.loadtxt(fname) # or however you read / generate data
if x is None: x = d[:, 0]
data2d.append(d[:, 1])
zlist.append(z_for_dataset(fname)) # or however ...
data2d = np.array(data2d) # turn list into nd array
ndata, npts = data2d.shape
params = lmfit.Parameters()
for i in range(ndata):
params.add('c1_%d' % (i+1), value=1.0) # give a better starting value!
params.add('c2_%d' % (i+1), value=1.0) # give a better starting value!
# now you're ready to do the fit and print out the results:
result = lmfit.minimize(f2min, params, args=(x, data2d, zlist))
print(results.fit_report())
That code really a sketch and is all untested, but hopefully will give you a good starting foundation.
This is a financial engineering problem for asset allocation. There are four asset class: stock, fixed income, CTA strategy and relative value strategy. Their return and covariance matrix are given. And for the result, it is expected to allocation more weight for fixed income asset and less weight for stock, not the initial weight.
The covariance matrix (4*4 matrix) is as follows(C in the code below):
sigma = [ [0.019828564,0.002498922,0.003100164,0.001272493],[0.002498922,0.005589884,0.000511829,0.000184773],[0.003100164,0.000511829,0.001559972,0.00019131],[0.001272493,0.000184773,0.00019131,0.0001306]]
sigma_p = np.matrix(sigma)
as 0,1,2,3 are 'stock_idx','CTA_idx','RelativeValue_idx','bond_idx' respectively
I am trying to find their optimal weight using the 'Risk - Parity' method, which is finally to solve the equation:
![The Risk Parity aim equation]https://i.imgur.com/9nxx7xU.png
I used the scipy.optimize in python, and the method "SLSQP" which is the only method that can apply the bounds and constraints in the solving progress. However, the mechanism did not work and always returned the initial guess, no matter how the initial guess were chosen. Codes are as follows:
def calculate_portfolio_var(W,C):
# function that calculates portfolio risk
sigma_p = np.sqrt(np.dot(np.dot(W.T,C),W))
return sigma_p
def calculate_risk_contribution(W,C):
MRC = np.dot(C,W)# Marginal Risk
RC = np.multiply(W,MRC)# Total Risk
return RC
def solve_weight(C,N): #C is the covariance matrix, and given as sigma_p before
def risk_budget_objective(W,C,N):
W = np.matrix(W).T
sig_p = calculate_portfolio_var(W,C) # portfolio sigma
total_RC = calculate_risk_contribution(W,C)
risk_target = sig_p / N
# sum of squared error
J = sum(np.square(total_RC / sig_p - risk_target))
print("SSE",J[0,0])
return J[0,0]
def total_weight_constraint(x):
return np.sum(x)-1.0
def long_only_constraint(x):
return
w0 = [0.1, 0.2, 0.3, 0.4]
w0 = np.matrix(w0).T
print('w0',w0,w0.shape)
b_ = [(0., 1.) for i in range(N)]
c_ = ({'type': 'eq', 'fun': lambda W: np.sum(W) - 1.})
optimized = scipy.optimize.minimize(risk_budget_objective, w0, (C,N), method='SLSQP', constraints=c_, bounds=b_)
if not optimized.success: raise BaseException(optimized.message)
w_rb = np.asmatrix(optimized.x)
return w_rb
It seems to be a numerical precision issue , as the value of the cost function calculated is pretty small. Two ways to solve this issue. Either multiply the cost function by some scalar so that it returns a bigger value .
for example J = sum(np.square(total_RC / sig_p - risk_target))*100 or set tolerance for convergence to a smaller value. The default value is 1e-6 .
optimized = minimize(risk_budget_objective, w0, (C,N), method='SLSQP', constraints=c_, bounds=b_ , options ={'ftol':1e-8})
The code works as expected after making the changes. Following is the output
matrix([[0.04780104, 0.12432431, 0.19918203, 0.62869262]])