Multiplying small matrices in parallel - multithreading

I have been writing code to multiply matrices in parallel using POSIX threads and I have been seeing great speedup when operating on large matrices; however, as I shrink the size of the matrices the naive sequential O(n^3) matrix multiplication algorithm begins to overtake the performance of the parallel implementation.
Is this normal or does it indicate a poor quality algorithm? Is it simply me noticing the extra overhead of creating and handling threads and that past a certain point that extra time dominates the computation?
Note that this is for homework, so I won't be posting my code as I don't want to breach my University's Academic Integrity Policies.

It is not possible to give an exact answer without seeing the code(or a detailed description of an algorithm, at least), but in general it is normal for simple algorithms to perform better on small inputs because of a smaller constant factor. Moreover, thread creation/context switches are not free so it can take longer to create a thread then to perform some simple computations. So if your algorithm works much faster than a naive one on large inputs, there should be no reasons to worry about it.

Related

How does the Needleman Wunsch algorithm compare to brute force?

I'm wondering how you can quantify the results of the Needleman-Wunsch algorithm (typically used for aligning nucleotide/protein sequences).
Consider some fixed scoring scheme and two sequences of varying length S1 and S2. Say we calculate every possible alignment of S1 and S2 by brute force, and the highest scoring alignment has a score x. And of course, this has considerably higher complexity than the Needleman-Wunsch approach.
When using the Needleman-Wunsch algorithm to find a sequence alignment, say that it has a score y.
Consider r to be the score generated via Needleman-Wunsch for two random sequences R1 and R2.
How does x compare to y? Is y always greater than r for two sequences of known homology?
In general, I do understand that we use the Needleman-Wunsch algorithm to significantly speed up sequence alignment (vs a brute-force approach), but don't understand the cost in accuracy (if any) that comes with it. I had a go at reading the original paper (Needleman & Wunsch, 1970) but am still left with this question.
Needlman-Wunsch always produces an optimal answer - it's much faster than brute force and doesn't sacrifice accuracy in the process. The key insight it uses is that it's not actually necessary to generate all possible alignments, since most of them contain bad sub-alignments and couldn't possibly be optimal. The Needleman-Wunsch algorithm works by instead slowly building up optimal alignments for fragments of the original strands and then slowly growing those smaller alignments into larger alignments using the guarantee that any optimal alignment must contain an optimal alignment for a slightly smaller case.
I think your question boils down to whether dynamic programming finds the optimal solution ie, garantees that y >= x. For a discussion on this I would refer to people who are likely smarter than me:
https://cs.stackexchange.com/questions/23599/how-is-dynamic-programming-different-from-brute-force
Basically, it says that dynamic programming will likely produce optimal result ie, same as brute force, but only for particular problems that satisfy the Bellman principle of optimality.
According to Wikipedia page for Needleman-Wunsch, the problem does satisfy Bellman principle of optimality:
https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
Specifically:
The Needleman–Wunsch algorithm is still widely used for optimal global
alignment, particularly when the quality of the global alignment is of
the utmost importance. However, the algorithm is expensive with
respect to time and space, proportional to the product of the length
of two sequences and hence is not suitable for long sequences.
There is also mention of optimality elsewhere in the same Wikipedia page.

Multithreading in divide and conquer matrix multiplication

There is this code of divide and conquer matrix multiplication which is taking quite a lot of time for the operations, when the input size for the matrix is nearly 4k. Multi-threading can be considered as a solution for reducing the running time. But should we create Thread as a object and pass or just implement a Runnable class? both the perspective seem to be working but when we create a thread more than a particular number the running time seems to be more worse.
Please some explain why is it so with an implementation of the same in java or python?

Singular value decomposition (SVD) using multithreading

I am running the partial SVD of a large (120k x 600k) and sparse (0.1 of non-zero values) matrix on a 3,5GHz/3,9GHz (6 cores / 12 threads) server with 128GB of RAM using SVDLIBC.
Is it possible to speed up the process a little bit using multithreading so as to take full advantage of my server configuration?
I have no experience of multithreading; therefore I am asking for friendly advices and/or pointer to manuals/tutorials.
[EDIT] I am open to alternatives too (matlab/octave, r, etc.)
In Matlab, for sparse matrices, you have svds. This implementation benefits from multithreaded computation (1)
See irlba: Fast partial SVD by implicitly-restarted Lanczos bidiagonalization in R. It just calculates the first user-specified no. of dimensions. Had good experience with it in past. But, then I used on commercial version of R which was complied to take advantage of multi-threading so can't vouch for speed-improvement due because of multi-threading.

Why is naive string search algorithm faster?

I'm testing string search algorithms from this site: EXACT STRING MATCHING ALGORITHMS. Christian Charras, Thierry Lecroq. Test text is a random sequence of DNA bases (ACGT) of 1 GByte size. Test patterns are a list of random sequences of random size (1kB max). Test system is a AMD Phenom II x4 955 at 3.2 GHz, 4 GB of RAM and Windows 7 64 bits. Code witten in C and compiled with MinGW with -O3 flag.
Naive search algorithm takes 4 seconds for short patterns to 8 seconds for 1kB patterns. Deterministic finite state machine takes 2 seconds for short patterns to 4 seconds for 1kB patterns. Boyer-Moore algorithm takes 4 seconds for very short patters, about 1/2 second for short pattherns and 2 seconds for 1kB patterns. The remaining algorithm performance is worst than naive search algorithm.
How can be naive search algorithm search algorithm faster than most other algorithms?
How can a deterministic finite state machine implemented with a transition table (O(n) execution time always) be 2 to 8 times slower than Boyer-Moore algorithm? Yes, BM best case is O(n/m), but his average case is O(n) and worst case is O(nm).
There is no perfect string matching algorithm which is best for all circumstances.
Boyer-Moore (and Horspool, Sunday etc.) work by creating jump tables ('How far can I move the search pointer when the characters do not match? The more distinct letters in the strings, the better the positive impact. You can imagine, that a string with only 4 distinct letters creates a jump table with a maximum of 3 shifts per mismatch. Whereas searching an english word with case sensitive may result in a jumptable with (A-Z + a-z + punctiation) max. approx 55 shifts per mismatch.
On the other hand, there is a negative impact on both preparation (i.e. calculating the jump tables) and looping itself. So these algorithms perform poor on short strings (preparation creates an overhead) and strings with only a few distict letters (as mentioned before)
The naive search algorithm is very compact and there are very little operations inside the loop, so loop runs fast. As there is no overhead it performs better when searching short strings.
The (compared to the naive search) quite complex loop operations of a BM algorithm take much longer per loop run. This (partly) compensates for the positive performance impact of the jump tables.
So although you are using long strings, the small alphabet (=small jump tables) makes BM perform poorly. A KMP has less overhead in the loop (the jump table is smaller in general, but is similar to the BM with small alphabets) and so the KMP performs so well.
Theoretically good algorithms (lower time complexity) often have high bookkeeping costs that can overwhelm that of a naive algorithm for small problem sizes. Also implementation details matter. By optimizing an implementation you can sometimes improve runtime by factors of 2 or more.
The naive implementation actually has a linear expected running time (same as BM/KMP, etc) for random input data. I could not write a full proof here but it's accessible from Algorithms Design Techniques and Analysis.
Most exact matching algorithms are optimized version of the naive implementation to prevent being slowed down by certain patterns. For instance, suppose we are searching for:
aaaaaaaaaaaaaaaaaaaaaaaab
on a stream of:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
It fails at the b for lots of times. KMP/BM implementations are contrived to prevent repeatedly comparing the as. However, if the sequence is random by itself, such conditions are almost impossible to appear and the naive implementation is likely to work better due to its lower overhead in bookkeeping or possibly better spatial/temporal locality.
And, yeah, I'm not sure DNA sequences are random. Or alternatively are repetitions common in them. Anyway there's no way to examine this carefully without representative data.

Compare and Contrast Monte-Carlo Method and Evolutionary Algorithms

What's the relationship between the Monte-Carlo Method and Evolutionary Algorithms? On the face of it they seem to be unrelated simulation methods used to solve complex problems. Which kinds of problems is each best suited for? Can they solve the same set of problems? What is the relationship between the two (if there is one)?
"Monte Carlo" is, in my experience, a heavily overloaded term. People seem to use it for any technique that uses a random number generator (global optimization, scenario analysis (Google "Excel Monte Carlo simulation"), stochastic integration (the Pi calculation that everybody uses to demonstrate MC). I believe, because you mentioned evolutionary algorithms in your question, that you are talking about Monte Carlo techniques for mathematical optimization: You have a some sort of fitness function with several input parameters and you want to minimize (or maximize) that function.
If your function is well behaved (there is a single, global minimum that you will arrive at no matter which inputs you start with) then you are best off using a determinate minimization technique such as the conjugate gradient method. Many machine learning classification techniques involve finding parameters that minimize the least squares error for a hyperplane with respect to a training set. The function that is being minimized in this case is a smooth, well behaved, parabaloid in n-dimensional space. Calculate the gradient and roll downhill. Easy peasy.
If, however, your input parameters are discrete (or if your fitness function has discontinuties) then it is no longer possible to calculate gradients accurately. This can happen if your fitness function is calculated using tabular data for one or more variables (if variable X is less than 0.5 use this table else use that table). Alternatively, you may have a program that you got from NASA that is made up of 20 modules written by different teams that you run as a batch job. You supply it with input and it spits out a number (think black box). Depending on the input parameters that you start with you may end up in a false minimum. Global optimization techniques attempt to address these types of problems.
Evolutionary Algorithms form one class of global optimization techniques. Global optimization techniques typically involve some sort of "hill climbing" (accepting a configuration with a higher (worse) fitness function). This hill climbing typically involves some randomness/stochastic-ness/monte-carlo-ness. In general, these techniques are more likely to accept less optimal configurations early on and, as the optimization progresses, they are less likely to accept inferior configurations.
Evolutionary algorithms are loosely based on evolutionary analogies. Simulated annealing is based upon analogies to annealing in metals. Particle swarm techniques are also inspired by biological systems. In all cases you should compare results to a simple random (a.k.a. "monte carlo") sampling of configurations...this will often yield equivalent results.
My advice is to start off using a deterministic gradient-based technique since they generally require far fewer function evaluations than stochastic/monte-carlo techniques. When you hear hoof steps think horses not zebras. Run the optimization from several different starting points and, unless you are dealing with a particularly nasty problem, you should end up with roughly the same minimum. If not, then you might have zebras and should consider using a global optimization method.
well I think Monte Carlo methods is the general name for these methods which
use random numbers in order to solve optimization problems. In this ways,
even the evolutionary algorithms are a type of Monte Carlo methods if they
use random numbers (and in fact they do).
Other Monte Carlo methods are: metropolis, wang-landau, parallel tempering,etc
OTOH, Evolutionary methods use 'techniques' borrowed from nature such as
mutation, cross-over, etc.

Resources