In my python code, I have to diagonalize hundreds of arrays, each of size around ~1000*1000. However, each array is independent, so it would seem that I can accelerate this process using parallel programming. A minimal (pseudo-code) example would be something of the form
arr_list = [a0, a1, a2, ..., a199] # each arr is of shape 1000*1000
for idx, arr in enumerate(arr_list):
evals, evecs = np.linalg.eigh(arr)
arr_list[idx] = evals
I'm not very familiar with CUDA, numba, CuPy or multiprocessing, but some quick research seems to tell me that CuPy is mainly used for accelerating basic operations such as addition, multiplication, diagonalization, etc. and only really has a significant jump in time relative to numpy if the array size is much larger than 1000. Multiprocessing, in contrast, seems to utilize the multiple cores (6-8) on a CPU, but it seems that numpy diagonalization is already a multi-core process (correct me if wrong), so it may not have a larger decrease in time.
I'm not very familiar with parallel programming, so I wondering if someone with more experience could give a few pointers on such a problem. Maybe a direction to research.
EDIT. Unfortunately, I'm on Windows so jax doesn't seem to work.
Related
I need to compute a high dimension dataset, with clustering on Orange3 app. So, there's too many time spent to calculate the Distance Matrix between the objects. If I could use a graphic card for this tasks it will take much less time to complete the task. Anyone know, let's say, a workaround to do this?
No. Orange uses numpy arrays and computes distances on CPU. Short of reimplementing the routine for calculation of distances (which in itself is rather short and simple), there's nothing you can do about it.
Orange will start using Dask in some not too distant future, but until then try reducing your data set. You may not need all dimensions and/or objects for your clustering.
I need to repeat N times a scientific simulation based on a random sampling, easily:
results = [mysimulation() for i in range(N)]
Since every simulation require minutes, I'd like to parallelize them in order to reduce the execution time. Some weeks ago I successfully analyzed some simpler cases, for which I wrote my code in C using OpenMP and functions like rand_r() for avoiding seed overlapping. How could I obtain a similar effect in Python?
I tried reading more about python3 multithreading/parallelization, but I found no results concerning the random generation. Conversely, numpy.random does not suggest anything in this direction (as far as I found).
There is this code of divide and conquer matrix multiplication which is taking quite a lot of time for the operations, when the input size for the matrix is nearly 4k. Multi-threading can be considered as a solution for reducing the running time. But should we create Thread as a object and pass or just implement a Runnable class? both the perspective seem to be working but when we create a thread more than a particular number the running time seems to be more worse.
Please some explain why is it so with an implementation of the same in java or python?
I have been writing code to multiply matrices in parallel using POSIX threads and I have been seeing great speedup when operating on large matrices; however, as I shrink the size of the matrices the naive sequential O(n^3) matrix multiplication algorithm begins to overtake the performance of the parallel implementation.
Is this normal or does it indicate a poor quality algorithm? Is it simply me noticing the extra overhead of creating and handling threads and that past a certain point that extra time dominates the computation?
Note that this is for homework, so I won't be posting my code as I don't want to breach my University's Academic Integrity Policies.
It is not possible to give an exact answer without seeing the code(or a detailed description of an algorithm, at least), but in general it is normal for simple algorithms to perform better on small inputs because of a smaller constant factor. Moreover, thread creation/context switches are not free so it can take longer to create a thread then to perform some simple computations. So if your algorithm works much faster than a naive one on large inputs, there should be no reasons to worry about it.
I am running the partial SVD of a large (120k x 600k) and sparse (0.1 of non-zero values) matrix on a 3,5GHz/3,9GHz (6 cores / 12 threads) server with 128GB of RAM using SVDLIBC.
Is it possible to speed up the process a little bit using multithreading so as to take full advantage of my server configuration?
I have no experience of multithreading; therefore I am asking for friendly advices and/or pointer to manuals/tutorials.
[EDIT] I am open to alternatives too (matlab/octave, r, etc.)
In Matlab, for sparse matrices, you have svds. This implementation benefits from multithreaded computation (1)
See irlba: Fast partial SVD by implicitly-restarted Lanczos bidiagonalization in R. It just calculates the first user-specified no. of dimensions. Had good experience with it in past. But, then I used on commercial version of R which was complied to take advantage of multi-threading so can't vouch for speed-improvement due because of multi-threading.