Cython: make prange parallelization thread-safe

Cython: make prange parallelization thread-safe - multithreading

Cython starter here. I am trying to speed up a calculation of a certain pairwise statistic (in several bins) by using multiple threads. In particular, I am using prange from cython.parallel, which internally uses openMP.
The following minimal example illustrates the problem (compilation via Jupyter notebook Cython magic).
Notebook setup:
%load_ext Cython
import numpy as np
Cython code:
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
double[:] Z = np.zeros(nbins,dtype=np.float64)
int i,j,b
with nogil, parallel(num_threads=num_threads):
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z[b] += Xij*Yij
return np.asarray(Z)
mock data and bins
X = np.random.rand(10000)
bin_edges = np.linspace(0.,1,11)
bins = np.array([bin_edges[:-1],bin_edges[1:]]).T
bins = bins.copy(order='C')
Timing via
%timeit my_parallel_statistic(X,bins,1)
%timeit my_parallel_statistic(X,bins,4)
yields
1 loop, best of 3: 728 ms per loop
1 loop, best of 3: 330 ms per loop
which is not a perfect scaling, but that is not the main point of the question. (But do let me know if you have suggestions beyond adding the usual decorators or fine-tuning the prange arguments.)
However, this calculation is apparently not thread-safe:
Z1 = my_parallel_statistic(X,bins,1)
Z4 = my_parallel_statistic(X,bins,4)
np.allclose(Z1,Z4)
reveals a significant difference between the two results (up to 20% in this example).
I strongly suspect that the problem is that multiple threads can do
Z[b] += Xij*Yij
at the same time. But what I don't know is how to fix this without sacrificing the speed-up.
In my actual use case, the calculation of Xij and Yij is more expensive, hence I would like to do them only once per pair. Also, pre-computing and storing Xij and Yij for all pairs and then simply looping through bins is not a good option either because N can get very large, and I can't store 100,000 x 100,000 numpy arrays in memory (this was actually the main motivation for rewriting it in Cython!).
System info (added following suggestion in comments):
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4790K CPU # 4.00GHz
OS: Red Hat Linux v6.8
Memory: 16 GB

Yes, Z[b] += Xij*Yij is indeed a race condition.
There are a couple of options of making this atomic or critical. Implementation issues with Cython aside, you would in any case have bad performance due to false sharing on the shared Z vector.
So the better alternative is to reserve a private array for each thread. There are a couple of (non-)options again. One could use a private malloc'd pointer, but I wanted to stick with np. Memory slices cannot be assigned as private variables. A two dimensional (num_threads, nbins) array works, but for some reason generates very complicated inefficient array index code. This works but is slower and does not scale.
A flat numpy array with manual "2D" indexing works well. You get a little bit extra performance by avoiding padding the private parts of the array to 64 byte, which is a typical cache line size. This avoids false sharing between the cores. The private parts are simply summed up serially outside of the parallel region.
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
cimport openmp
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
# pad local data to 64 byte avoid false sharing of cache-lines
int nbins_padded = (((nbins - 1) // 8) + 1) * 8
double[:] Z_local = np.zeros(nbins_padded * num_threads,dtype=np.float64)
double[:] Z = np.zeros(nbins)
int i,j,b, bb, tid
with nogil, parallel(num_threads=num_threads):
tid = openmp.omp_get_thread_num()
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z_local[tid * nbins_padded + b] += Xij*Yij
for tid in range(num_threads):
for bb in range(nbins):
Z[bb] += Z_local[tid * nbins_padded + bb]
return np.asarray(Z)
This performs quite well on my 4 core machine, with 720 ms / 191 ms, a speedup of 3.6. The remaining gap may be due to turbo mode. I don't have access to a proper machine for testing right now.

You are right that the access to Z is under a race condition.
You might be better off defining num_threads copies of Z, as cdef double[:] Z = np.zeros((num_threads, nbins), dtype=np.float64) and perform a sum along axis 0 after the prange loop.
return np.sum(Z, axis=0)
Cython code can have a with gil statement in a parallel region but it is only documented for error handling. You could have a look at the general C code to see whether that would trigger an atomic OpenMP operation but I doubt it.

Related

Numpy.linalg.norm performance apparently doesn't scale with the number of dimensions

I have the following snippets of code which is a subroutine of the K-means clustering algorithm; specifically, it tries to assign each point to the closest centroid.
import numpy as np
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
for i in range(n):
distances = np.apply_along_axis(lambda x: np.linalg.norm(x, ord=2), 1, centroids - points[i])
membership[i] = np.argmin(distances)
The running time here should be O(NKD) where D is the dimension of the data points, so naturally I expect when D increases or decreases, the running time would change proportionally as well. To my surprise, I see very little time being changed when changing D, for example when testing on my local machine:
D = 1
python3 benchmark.py 12.10s user 0.39s system 118% cpu 10.564 total
D = 30
python3 benchmark.py 12.17s user 0.36s system 117% cpu 10.703 total
D = 300
python3 benchmark.py 13.30s user 0.31s system 115% cpu 11.784 total
D = 1000
python3 benchmark.py 16.51s user 1.76s system 110% cpu 16.524 total
Is there something that I'm missing here?
Edit: per #Warren's suggestion, I modified the code to use np.linalg.norm with axis parameter directly; the performance is following:
D = 1
python3 benchmark.py 1.45s user 0.37s system 634% cpu 0.287 total
D = 30
python3 benchmark.py 1.67s user 0.29s system 592% cpu 0.331 total
D = 300
python3 benchmark.py 3.03s user 0.32s system 234% cpu 1.428 total
D = 1000
python3 benchmark.py 6.32s user 2.73s system 126% cpu 7.177 total
so the performance was better.

This is due to the overhead of Numpy functions.
Indeed, np.apply_along_axis is called 20_000 times and each call to this function internally does a loop calling the target Python function 250 times (ie. it is not vectorized), and so np.linalg.norm. In the end, np.linalg.norm is called, 20_000 * 250 = 5000000 times. The thing is each call to a Numpy function takes typically about 1 µs. On my machine, np.linalg.norm takes 4-5 µs on an array of size 1. This time is due to many internal checks (types and values), allocations, functions calls, conversion, etc.
There are two simple ways to reduce this overhead: vectorization and using a JIT compiler like Numba. The later is often more efficient as it avoid creating expensive big temporary arrays.
Here is a much faster implementation:
import numpy as np
import numba as nb
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])')
def compute(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
distances[j] = np.linalg.norm(centroids[j] - points[i], ord=2)
membership[i] = np.argmin(distances)
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
compute(points, centroids, membership)
In fact, while this code is much faster, it still have a similar issue: the cost of allocating the temporary arrays centroids[j] - points[i] is significant compared to the actual time required to compute the norm. In fact, each allocations takes only few hundred of nanoseconds, but the number of loop iteration is huge. One solution is simply to compute the norm manually:
from math import sqrt
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])', fastmath=True)
def compute_fast(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
s = 0.0
for k in range(D):
tmp = centroids[j,k] - points[i,k]
s += tmp * tmp
distances[j] = sqrt(s)
membership[i] = np.argmin(distances)
Here are results on my i5-9600KF processor:
D=1:
initial code: 26.56 seconds
compute: 1.44 seconds
compute_fast: 0.02 seconds (x1328)
D=30:
initial code: 27.09 seconds
compute: 1.65 seconds
compute_fast: 0.13 seconds (x208)
D=1000:
initial code: 39.34 seconds
compute: 3.74 seconds
compute_fast: 4.57 seconds (x8.6)
The last implementation is much faster for small values of D since the Numpy overhead are the main bottleneck in this case and the implementation can almost completely remove such overheads (thanks to the JIT compilation).

It is probably O(NKD).
But the thing is you are iterating 3 loops here. One explicitly. One semi-explicitly. And the last one implicitly, inside numpy functions.
The outer one is your explicit for loop, for N.
The middle one is the np.apply_along_axis one, which applies on the K rows of centroids-points[i] (btw, there is another one here, with some broadcasting. But we don't need to count all of them for big-O consideration)
And the inner one is the one on the D columns that occur inside norm.
The inner one is obviously the most important to optimized, and that's good, because it is the only one that is vectorized here.
But that means that for small enough value of D, what we really see is more some constant overhead (times N×K, since it is inside a double for loop). Your inefficient outer for loops drive most of the cost, which, then, looks like O(NK).
Note that np.apply_along_axis is just a for loop by another name. It is not as bad. But almost so. It is still calling several times some python code. It is not vectorization.
But, well, I bet that with D big enough, you'll see that it is O(NKD)
Edit:
Here is what I get when I increase D (with smaller n, so that it remains computable in realistic time)
You see that it looks really linear (affine, to be accurate, since it doesn't pass through 0, which is the reason why it doesn't look very linear to you; and which is explained by my previous comment: most of the inner cost inside the for/along_axis double loop is mainly constant overhead of those loops, when D is small. The "proportional to D" part begins to show when the overhead become negligible)

Two level multiprocessing in Python

Let's consider a function whose elaboration time depends on one of its parameter, size, and depending on its value the running time can take from hours to (few) days. In a script I launch multiple instances of this function in parallel to use all the cores and save time.
Let's say that the total elaboration time is limited by the longest running time of function instance with biggest size, and that I can also parallelize some parts of the function: this would probably not be beneficial if all the function instances are running, but it could be beneficial when only one remains (or if I have tons of cores). How would you do this in Python, that is, organize a two-level multiprocessing (one at the script level, the other inside the function)? Or would you just parallelize the function and launch multiple scripts with different configurations?
I provide a MWE, but I understand the answer on the (absolute) fastest execution is probably problem dependent. Here the function consists of nested loops, but you can adapt it as long as it verifies the preceding assumptions, or change the parameters. I am not interested in this particular MWE, but on how to set an inner parallelization.
import numpy as np
import time
import timeit
import multiprocessing as mp
import copy
def function_wrapper(matrix):
"""Puts to zero the multiples of 3."""
side = matrix.shape[0]
# consider to parallelize the following
# NOTE that I am not interested in parallelizing this particular example
# (you can change this part as long as by parallelizing it you better use the cores)
# with list comprehensions or built-in functions, but on how to setup a second level of multiprocessing
for x in range(side):
for y in range(side):
for z in range(side):
matrix[x, y, z] = timeit.timeit("numpy.linalg.eig(numpy.random.randint(0, 10, (L, L)))", setup='import numpy; L='+str(matrix[x, y, z]), number=10)
return matrix
num_cores = mp.cpu_count()
if __name__ == "__main__":
matrices_number = 20 # depending on the values of matrices_number and max_side the
max_side = 10 # parallelization setup is more or less important
matrices = [np.random.randint(1, 101, (side, side, side))
for side in np.random.randint(2, max_side, matrices_number)]
# sorts matrices by decreasing shape
args_generator = sorted([(m,) for m in matrices], key=lambda x: x[0].shape[0], reverse=True)
iterations = 10 # iterations to reduce caching effects
start = time.time()
for k in range(iterations):
results = [function_wrapper(*args) for args in copy.deepcopy(args_generator)]
elapsed = time.time() - start
print(f"loop: elapsed={elapsed} sec.")
start = time.time()
for k in range(iterations):
with mp.Pool(num_cores) as pool:
results = pool.starmap_async(function_wrapper, copy.deepcopy(args_generator)).get()
elapsed = time.time() - start
print(f"mp.Pool: elapsed={elapsed} sec.")

You're not taking advantage of the built-in functions in the numpy library. You are iterating over the array instead of broadcasting your logic across the whole matrix at once. When you take advantage of the built-in numpy functions, you take advantage of the underlying code written in C. The wrapper function should be the following.
def broadcaster(matrix):
return np.where(matrix % 3 != 0, matrix, 0)
start = time.time()
for k in range(iterations):
results = [broadcaster(*args) for args in copy.deepcopy(args_generator)]
elapsed = time.time() - start
print(f"Broadcast: elapsed={elapsed} sec.")
When I add that snippet to your code I get the following.
loop: elapsed=15.675757884979248 sec.
mp.Pool: elapsed=14.439897060394287 sec.
Broadcast: elapsed=0.6325647830963135 sec.
As you can see, in terms of performance, it is not even close.

multithreaded iteration over numpy array indices

I have a piece of code which iterates over a three-dimensional array and writes into each cell a value based on the indices and the current value itself:
import numpy as np
nx = ny = nz = 100
array = np.zeros((nx, ny, nz))
def fun(val, k):
# Do something with the indices
return val + (k[0] * k[1] * k[2])
with np.nditer(array, flags=['multi_index'], op_flags=['readwrite']) as it:
for x in it:
x[...] = fun(x, it.multi_index)
Note, that fun might do something more sophisticated, which takes most of the total runtime, and that the input arrays might have different lengths per axis.
However, this code could run in multiple threads, as fun can be assumed to be threadsafe (Only the value and index of the current cell are required). But finding a method to iterate over all cells and have the current index available seems to be hard.
A possible solution might be https://stackoverflow.com/a/58012407/446140, where the array is split by the x-axis into chunks and passed to a Pool.
However, the solution is not universally applicable and I wonder if there is a more general solution for this problem (which could also work with nD arrays)?
The first issue is to split up the 3D array into equally sized chunks. np.array_split can be used, but the offset of each of the splits has to be stored to get the correct indices again.

An interesting question, with a few possible solutions. As you indicated, it is possible to use np.array_split, but since we are only interested in the indices, we can also use np.unravel_index, which would mean that we only have to loop over all the indices (the size) of the array to get the index.
Now there are two great ideas for multiprocessing:
Create a (thread safe) shared memory of the array and splitting the indices across the different processes.
Only update the array in a main thread, but provide a copy of the required data to the processes and let them return the value that has to be updated.
Both solutions will work for any np.ndarray, but have different advantages. Creating a shared memory doesn't create copies, but can have a large insertion penalty if it has to wait on other processes (the computational time, is small compared to the write time.)
There are probably many more solutions, but I will work out the first solution, where a Shared Memory object is created and a range of indices is provided to every process.
Required imports:
import itertools
import numpy as np
import multiprocessing as mp
from multiprocessing import shared_memory
Shared Numpy arrays
The main problem with applying multiprocessing on np.ndarray's is that memory sharing between processes can be difficult. For this the following class can be used:
class SharedNumpy:
__slots__ = ('arr', 'shm', 'name', 'shared',)
def __init__(self, arr: np.ndarray = None):
if arr is not None:
self.shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
self.arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=self.shm.buf)
self.name = self.shm.name
np.copyto(self.arr, arr)
def __getattr__(self, item):
if hasattr(self.arr, item):
return getattr(self.arr, item)
raise AttributeError(f"{self.__class__.__name__}, doesn't have attribute {item!r}")
def __str__(self):
return str(self.arr)
#classmethod
def from_name(cls, name, shape, dtype):
memory = cls(arr=None)
memory.shm = shared_memory.SharedMemory(name)
memory.arr = np.ndarray(shape, dtype=dtype, buffer=memory.shm.buf)
memory.name = name
return memory
#property
def dtype(self):
return self.arr.dtype
#property
def shape(self):
return self.arr.shape
This makes it possible to create a shared memory object in the main process and then use SharedNumpy.from_name to get it in other processes.
Simple test
A quick (non threaded) test would be:
def simple_test():
data = np.array(np.zeros((5,) * 2))
mem_primary = SharedNumpy(arr=data)
mem_second = SharedNumpy.from_name(name=mem_primary.name, shape=data.shape, dtype=data.dtype)
assert mem_primary.name == mem_second.name, "Different memory names"
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
mem_primary.arr[2] = 5
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
print("Completed 3/3 tests...")
A threaded test will follow later!
Distribution
The next part is focused on providing the processes with the necessary data. In this case we will provide every process with a range of indices that it has to calculate and all the data that is required to load the shared memory.
The input of this function is a dim the number of numpy axis, and the size, which are the number of elements per axis.
def distributed(size, dim):
memory = SharedNumpy(arr=np.zeros((size,) * dim))
split_size = np.int64(np.ceil(memory.arr.size / mp.cpu_count()))
settings = dict(
memory=itertools.repeat(memory.name),
shape=itertools.repeat(memory.arr.shape),
dtype=itertools.repeat(memory.arr.dtype),
start=np.arange(mp.cpu_count()),
num=itertools.repeat(split_size)
)
with mp.Pool(mp.cpu_count()) as pool:
pool.starmap(fun, zip(*settings.values()))
print(f"\n\nDone {dim}D, size: {size}, elements: {size ** dim}")
return memory
Notes:
By using starmap instead of map, it is possible to provide multiple input arguments (a list of arguments for every process).
(also see docs starmap)
itertools.repeat is used to add constants to the starmap
(also see: zip() in python, how to use static values)
By using np.unravel_index, we only need to have a start index and the chunk size per process.
The start and num tell the chunks of indices that have to be converted per process, by applying range(start * num, (start + 1) * num).
Testing
For the testing I am using different input sizes and dimensions. Since the data increases with the formula sizes ^ dimensions, I limited the test to a size of 128 and 3 dimensions (that is 2,097,152 points, and already start taking quit a bit of time.)
Code
fun
def fun(name, shape, dtype, start, num):
memory = SharedNumpy.from_name(name, shape=shape, dtype=dtype)
for idx in range(start * num, min((start + 1) * num, memory.arr.size)):
# Do something with the indices
indices = np.unravel_index([idx], shape)
memory.arr[indices] += np.product(indices)
memory.shm.close() # Closes the shared memory for this process.
Running the example
if __name__ == '__main__':
for size in [5, 10, 15]:
for dim in [1, 2, 3]:
memory = distributed(size, dim)
print(memory)
memory.shm.unlink()
For the OP's code, I used his code with a small addition that I allow the array to have different sizes and dimensions, in any case I use:
def sequential(size, dim):
array = np.zeros((size,) * dim)
...
And looking at the output array of both codes, will result in the same outcomes.
Plots
The code for the graphs have been taken from the reply in:
https://codereview.stackexchange.com/questions/165245/plot-timings-for-a-range-of-inputs
With the minor alteration that labels was changed to codes in
empty_multi_index = pd.MultiIndex(levels=[[], []], codes=[[], []], names=['func', 'result'])
Where the 1d, 2d and 3d reference the dimensions and the input is the size.
Sequentially (OP code):
Distributed (this code):
Results
This method works on an arbitrary sized numpy array, and is able to perform an operation on the indices of the array. It provides you with full access of the whole numpy array, so it can also be used to perform different kind of statistical analysis, which do not change the array.
From the timings it can be seen that for small data shapes the distributed version has no to little advantages, because of the extra complexity of creating the processes. However for larger amount of data it starts to become more effective.
I only timed it on short delays in the computational time (simple fun), but on more complex calculations, it should outperform the sequential version much sooner.
Extra
If you are only interested in operations that are performed over or along axis, these numpy functions might help to vectorize your solutions instead of using multiprocessing:
np.apply_over_axes
np.apply_along_axis

Having thread-local arrays in cython so that I can resize them?

I have an interval-treeish algorithm I would like to run in parallel for many queries using threads. Problem is that then each thread would need its own array, since I cannot know in advance how many hits there will be.
There are other questions like this, and the solution suggested is always to have an array of size (K, t) where K is output length and t is number of threads. This does not work for me as K might be different for each thread and each thread might need to resize the array to fit all the results it gets.
Pseudocode:
for i in prange(len(starts)):
qs, qe, qx = starts[i], ends[i], index[i]
results = t.search(qs, qe)
if len(results) + nfound < len(output):
# add result to output
else:
# resize array
# then add results

An usual pattern is that every thread gets its own container, which is a trade-off between speed/complexity and memory-overhead:
there is no need to lock for access to this container, because only one thread accesses it.
there is much less overhead compared to "own container for every task (i.e. every i-value)".
After the parallel section, the data must be either collected in a final container in a post processing step (which also could happen in parallel) or the subsequent algorithms should be able to handle a collection of containers.
Here is an example using c++-vector (which already has memory management and increasing size built-in):
%%cython -+ -c=/openmp --link-args=/openmp
from cython.parallel import prange, threadid
from libcpp.vector cimport vector
cimport openmp
def calc_in_parallel(N):
cdef int i,k,tid
cdef int n = N
cdef vector[vector[int]] vecs
# every thread gets its own container
vecs.resize(openmp.omp_get_max_threads())
for i in prange(n, nogil=True):
tid = threadid()
for k in range(i):
# use container of the thread
vecs[tid].push_back(k) # dummy for calculation
return vecs
Using omp_get_max_threads() for the number of threads will overestimate the real number of threads in many cases. It is probably more robust to set the number of threads explicitly in prange, i.e.
...
NUM_THREADS = 2
vecs.resize(NUM_THREADS)
for i in prange(n, nogil=True, num_threads = NUM_THREADS):
...
A similar approach can be applied using pure C, but more boiler plate code (memory management) will be needed in this case.

Multithreaded sparse matrix multiplication in Matlab

I am performing several matrix multiplications of an NxN sparse (~1-2%) matrix, let's call it B, with an NxM dense matrix, let's call it A (where M < N). N is large, as is M; on the order of several thousands. I am running Matlab 2013a.
Now, usually, matrix multiplications and most other matrix operations are implicitly parallelized in Matlab, i.e. they make use of multiple threads automatically.
This appears NOT to be the case if either of the matrices are sparse (see e.g. this StackOverflow discussion - with no answer for the intended question - and this largely unanswered MathWorks thread).
This is a rather unhappy surprise for me.
We can verify that multithreading has no effects for sparse matrix operations by the following code:
clc; clear all;
N = 5000; % set matrix sizes
M = 3000;
A = randn(N,M); % create dense random matrices
B = sprand(N,N,0.015); % create sparse random matrix
Bf = full(B); %create a dense form of the otherwise sparse matrix B
for i=1:3 % test for 1, 2, and 4 threads
m(i) = 2^(i-1);
maxNumCompThreads(m(i)); % set the thread count available to Matlab
tic % starts timer
y = B*A;
walltime(i) = toc; % wall clock time
speedup(i) = walltime(1)/walltime(i);
end
% display number of threads vs. speed up relative to just a single thread
[m',speedup']
This produces the following output, which illustrates that there is no difference between using 1, 2, and 4 threads for sparse operations:
threads speedup
1.0000 1.0000
2.0000 0.9950
4.0000 1.0155
If, on the other hand, I replace B by its dense form, refered to as Bf above, I get significant speedup:
threads speedup
1.0000 1.0000
2.0000 1.8894
4.0000 3.4841
(illustrating that matrix operations for dense matrices in Matlab are indeed implicitly parallelized)
So, my question: is there any way at all to access a parallelized/threaded version of matrix operations for sparse matrices (in Matlab) without converting them to dense form?
I found one old suggestion involving .mex files at MathWorks, but it seems the links are dead and not very well documented/no feedback? Any alternatives?
It seems to be a rather severe restriction of implicit parallelism functionality, since sparse matrices are abound in computationally heavy problems, and hyperthreaded functionality highly desirable in these cases.

MATLAB already uses SuiteSparse by Tim Davis for many of its operation on sparse matrices (for example see here), but neither of which I believe are multithreaded.
Usually computations on sparse matrices are memory-bound rather than CPU-bound. So even you use a multithreaded library, I doubt you will see huge benefits in terms of performance, at least not comparable to those specialized in dense matrices...
After all that the design of sparse matrices have different goals in mind than regular dense matrices, where efficient memory storage is often more important.
I did a quick search online, and found a few implementations out there:
sparse BLAS, spBLAS, PSBLAS. For instance, Intel MKL and AMD ACML do have some support for sparse matrices
cuSPARSE, CUSP, VexCL, ViennaCL, etc.. that run on the GPU.

I ended up writing my own mex file with OpenMP for multithreading. Code as follows. Don't forget to use -largeArrayDims and /openmp (or -fopenmp) flags when compiling.
#include <omp.h>
#include "mex.h"
#include "matrix.h"
#define ll long long
void omp_smm(double* A, double*B, double* C, ll m, ll p, ll n, ll* irs, ll* jcs)
{
for (ll j=0; j<p; ++j)
{
ll istart = jcs[j];
ll iend = jcs[j+1];
#pragma omp parallel for
for (ll ii=istart; ii<iend; ++ii)
{
ll i = irs[ii];
double aa = A[ii];
for (ll k=0; k<n; ++k)
{
C[i+k*m] += B[j+k*p]*aa;
}
}
}
}
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *A, *B, *C; /* pointers to input & output matrices*/
size_t m,n,p; /* matrix dimensions */
A = mxGetPr(prhs[0]); /* first sparse matrix */
B = mxGetPr(prhs[1]); /* second full matrix */
mwIndex * irs = mxGetIr(prhs[0]);
mwIndex * jcs = mxGetJc(prhs[0]);
m = mxGetM(prhs[0]);
p = mxGetN(prhs[0]);
n = mxGetN(prhs[1]);
/* create output matrix C */
plhs[0] = mxCreateDoubleMatrix(m, n, mxREAL);
C = mxGetPr(plhs[0]);
omp_smm(A,B,C, m, p, n, (ll*)irs, (ll*)jcs);
}

On matlab central the same question was asked, and this answer was given:
I believe the sparse matrix code is implemented by a few specialized TMW engineers rather than an external library like BLAS/LAPACK/LINPACK/etc...
Which basically means, that you are out of luck.
However I can think of some tricks to achieve faster computations:
If you need to do several multiplications: do multiple multiplications at once and process them in parallel?
If you just want to do one multiplication: Cut the matrix into pieces (for example top half and bottom half), do the calculations of the parts in parallel and combine the results afterwards
Probably these solutions will not turn out to be as fast as properly implemented multithreading, but hopefully you can still get a speedup.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string