Multithreading in divide and conquer matrix multiplication - multithreading

There is this code of divide and conquer matrix multiplication which is taking quite a lot of time for the operations, when the input size for the matrix is nearly 4k. Multi-threading can be considered as a solution for reducing the running time. But should we create Thread as a object and pass or just implement a Runnable class? both the perspective seem to be working but when we create a thread more than a particular number the running time seems to be more worse.
Please some explain why is it so with an implementation of the same in java or python?

Related

Is there a simple way to use Oange3 with an Nvidia GPU?

I need to compute a high dimension dataset, with clustering on Orange3 app. So, there's too many time spent to calculate the Distance Matrix between the objects. If I could use a graphic card for this tasks it will take much less time to complete the task. Anyone know, let's say, a workaround to do this?
No. Orange uses numpy arrays and computes distances on CPU. Short of reimplementing the routine for calculation of distances (which in itself is rather short and simple), there's nothing you can do about it.
Orange will start using Dask in some not too distant future, but until then try reducing your data set. You may not need all dimensions and/or objects for your clustering.

Accelerating diagonalization with GPU and multiprocessing

In my python code, I have to diagonalize hundreds of arrays, each of size around ~1000*1000. However, each array is independent, so it would seem that I can accelerate this process using parallel programming. A minimal (pseudo-code) example would be something of the form
arr_list = [a0, a1, a2, ..., a199] # each arr is of shape 1000*1000
for idx, arr in enumerate(arr_list):
evals, evecs = np.linalg.eigh(arr)
arr_list[idx] = evals
I'm not very familiar with CUDA, numba, CuPy or multiprocessing, but some quick research seems to tell me that CuPy is mainly used for accelerating basic operations such as addition, multiplication, diagonalization, etc. and only really has a significant jump in time relative to numpy if the array size is much larger than 1000. Multiprocessing, in contrast, seems to utilize the multiple cores (6-8) on a CPU, but it seems that numpy diagonalization is already a multi-core process (correct me if wrong), so it may not have a larger decrease in time.
I'm not very familiar with parallel programming, so I wondering if someone with more experience could give a few pointers on such a problem. Maybe a direction to research.
EDIT. Unfortunately, I'm on Windows so jax doesn't seem to work.

Multiplying small matrices in parallel

I have been writing code to multiply matrices in parallel using POSIX threads and I have been seeing great speedup when operating on large matrices; however, as I shrink the size of the matrices the naive sequential O(n^3) matrix multiplication algorithm begins to overtake the performance of the parallel implementation.
Is this normal or does it indicate a poor quality algorithm? Is it simply me noticing the extra overhead of creating and handling threads and that past a certain point that extra time dominates the computation?
Note that this is for homework, so I won't be posting my code as I don't want to breach my University's Academic Integrity Policies.
It is not possible to give an exact answer without seeing the code(or a detailed description of an algorithm, at least), but in general it is normal for simple algorithms to perform better on small inputs because of a smaller constant factor. Moreover, thread creation/context switches are not free so it can take longer to create a thread then to perform some simple computations. So if your algorithm works much faster than a naive one on large inputs, there should be no reasons to worry about it.

Singular value decomposition (SVD) using multithreading

I am running the partial SVD of a large (120k x 600k) and sparse (0.1 of non-zero values) matrix on a 3,5GHz/3,9GHz (6 cores / 12 threads) server with 128GB of RAM using SVDLIBC.
Is it possible to speed up the process a little bit using multithreading so as to take full advantage of my server configuration?
I have no experience of multithreading; therefore I am asking for friendly advices and/or pointer to manuals/tutorials.
[EDIT] I am open to alternatives too (matlab/octave, r, etc.)
In Matlab, for sparse matrices, you have svds. This implementation benefits from multithreaded computation (1)
See irlba: Fast partial SVD by implicitly-restarted Lanczos bidiagonalization in R. It just calculates the first user-specified no. of dimensions. Had good experience with it in past. But, then I used on commercial version of R which was complied to take advantage of multi-threading so can't vouch for speed-improvement due because of multi-threading.

Why are GPU threads in CUDA and OpenCL allocated in a grid?

I'm just learning OpenCL, and I'm at the point when trying to launch a kernel. Why is it that the GPU threads are managed in a grid?
I'm going to read more about this in detail, but it would be nice with a simple explanation. Is it always like this when working with GPGPUs?
This is a common approach, which is used in CUDA, OpenCL and I think ATI stream.
The idea behind the grid is to provide a simple, but flexible, mapping between the data being processed and the threads doing the data processing. In the simple version of the GPGPU execution model, one GPU thread is "allocated" for each output element in a 1D, 2D or 3D grid of data. To process this output element, the thread will read one (or more) elements from the corresponding location or adjacent locations in the input data grid(s). By organizing the threads in a grid, it's easier for the threads to figure out which input data elements to read and where to store the output data elements.
This contrasts with the common multi-core, CPU threading model where one thread is allocated per CPU core and each thread processes many input and output elements (e.g. 1/4 of the data in a quad-core system).
The simple answer is that GPUs are designed to process images and textures that are 2D grids of pixels. When you render a triangle in DirectX or OpenGL, the hardware rasterizes it into a grid of pixels.
I will invoke the classic analogy of putting a square peg in a round hole. Well, in this case the GPU is a very square hole and not as well rounded as GP (general purpose) would suggest.
The above explanations put forward the ideas of 2D textures, etc. The architecture of the GPU is such that all processing is done in streams with the pipeline being identical in each stream, so the data being processed need to be segmented like that.
One reason why this is a nice API is that typically you are working with an algorithm that has several nested loops. If you have one, two or three loops then a grid of one, two or three dimensions maps nicely to the problem, giving you a thread for the value of each index.
So values that you need in your kernel (index values) are naturally expressed in the API.

Resources