MapReduce - Anything else except word-counting? - multithreading

I have been looking at MapReduce and reading through various papers about it and applications of it, but, to me, it seems that MapReduce is only suitable for a very narrow class of scenarios that ultimately result in word-counting.
If you look at the original paper Google's employees provide "various" potential use cases, like "distributed grep", "distributed sort", "reverse web-link graph", "term-vector per host", etc.
But if you look closer, all those problems boil down to simply "counting words" - that is counting the number occurrences of something in a chunk of data, then aggregating/filtering and sorting that list of occurrences.
There also are some cases where MapReduce has been used for genetic algorithms or relational databases, but they don't use the "vanilla" MapReduce published by Google. Instead they introduce further steps along the Map-Reduce chain, like Map-Reduce-Merge, etc.
Do you know of any other (documented?) scenarios where "vanilla" MapReduce has been used to perform more than mere word-counting? (Maybe for ray-tracing, video-transcoding, cryptography, etc. - in short anything "computation heavy" that is parallelizable)

Atbrox had been maintaining mapreduce hadoop algorithms in academic papers. Here is the link. All of these could be applied for practical purpose.

MapReduce is good for problems that can be considered to be embarrassingly parallel. There are a lot of problems that MapReduce is very bad at, such as those that require lots of all-to-all communication between nodes. E.g., fast Fourier transforms and signal correlation.

There are projects using MapReduce for parallel computations in statistics. For instance, Revolutions Analytics has started a RHadoop project for use with R. Hadoop is also used in computational biology and in other fields with large datasets that can be analyzed by many discrete jobs.

I am the author of one of the packages in RHadoop and I wrote the several examples distributed with the source and used in the Tutorial, logistic regression, linear least squares, matrix multiplication etc. There is also a paper I would like to recommend http://www.mendeley.com/research/sorting-searching-simulation-mapreduce-framework/
that seems to strongly support the equivalence of mapreduce with classic parallel programming models such as PRAM and BSP. I often write mapreduce algorithms as ports from PRAM algorithms, see for instance blog.piccolboni.info/2011/04/map-reduce-algorithm-for-connected.html. So I think the scope of mapreduce is clearly more than "embarrassingly parallel" but not infinite. I have myself experienced some limitations for instance in speeding up some MCMC simulations. Of course it could have been me not using the right approach. My rule of thumb is the following: if the problem can be solved in parallel in O(log(N)) time on O(N) processors, then it is a good candidate for mapreduce, with O(log(N)) jobs and constant time spent in each job. Other people and the paper I mentioned seem to focus more on the O(1) jobs case. When you go beyond O(log(N)) time the case for MR seems to get a little weaker, but some limitations may be inherent in the current implementation (high job overhead) rather the fundamental. It's a pretty fascinating time to be working on charting the MR territory.

Related

MC-Stan on Spark?

I hope to use MC-Stan on Spark, but it seems there is no related page searched by Google.
I wonder if this approach is even possible on Spark, therefore I would appreciate if someone let me know.
Moreover, I also wonder what is the widely-used approach to use MCMC on Spark. I heard Scala is widely used, but I need some language that has a decent MCMC library such as MC-Stan.
Yes it's certainly possible but requires a bit more work. Stan (and popular MCMC tools that I know of) are not designed to be run in a distributed setting, via Spark or otherwise. In general, distributed MCMC is an area of active research. For a recent review, I'd recommend section 4 of Patterns of Scalable Bayesian Inference (PoFSBI). There are multiple possible ways you might want to split up a big MCMC computation but I think one of the more straightforward ways would be splitting up the data and running an off-the-shelf tool like Stan, with the same model, on each partition. Each model will produce a subposterior which can be reduce'd together to form a posterior. PoFSBI discusses several ways of combining such subposteriors.
I've put together a very rough proof of concept using pyspark and pystan (python is the common language with the most Stan and Spark support). It's a rough and limited implementation of the weighted-average consensus algorithm in PoFSBI, running on the tiny 8-schools dataset. I don't think this example would be practically very useful but it should provide some idea of what might be necessary to run Stan as a Spark program: partition data, run stan on each partition, combine the subposteriors.

What statistical distribution is used to benchmark an algorithm?

I have benchmarked my algorithm, it run for 1000 times. Now I have all time values and at this point it would be interesting to know the mean, standard deviation, median. The problem is that I don't know what is correct statistics to use to estimate these parameters. I'm not sure about using Normal distribution.
Learn about statistics. There are lots of books, guides, papers and introductions out there (1,2,3, 4)
There are also lots of libraries which implements default statistical methods:
Java Commons Math,
C++ Libs,
and there are certainly lots of others for the language you use...
And also one last hint: For a quick (initial) result I often use excel and its diagram functions. It supports some statistical methods with which you can play around a bit to see in which direction you may continue....
That really depends on what distribution your workload experiences, so you would not be able to answer generically to this.
But there is a trick: if you go one step forward, and do several iterations, each consisting of N calls, and compute, say, average time/throughput for the entire iteration. Then, for a large N and consistent workload behavior across the calls, the iteration scores may be subject to Central Limit Theorem, which can turn them into normally distributed.

Difference between Threading and Map-Reduce processing?

One of my collegue is arguing with me for introducing map-reduce concept in our application(text processing). His opinion is why we should not use threading concepts instead.We both are new to this map-reduce paradigm. I thought that using map-reduce concept helps the developer from the overhead of handling thread synchronisation,dead lock,shared data. Is there anything other than this for going to map-reduce concept rather than threading?
You can find related paper for this, Comparing Fork/Join and MapReduce.
The paper compares the performance, scalability and programmability of three parallel paradigms: fork/join, MapReduce, and a hybrid approach.
What they find is basically that Java fork/join has low startup latency and scales well for small inputs (<5MB), but it cannot process larger inputs due to the size restrictions of shared-memory,
single node architectures. On the other hand, MapReduce has significant startup latency (tens of seconds), but scales well for much larger inputs (>100MB) on a compute cluster.
Threading offers facilities to partition a task into several subtasks, in a recursive-looking fashion; more tiers, possibility of 'inter-fork' communication at this stage, much more traditional programming. Does not extend (at least in the paper) beyond a single machine. Great for taking advantage of your eight-core.
M-R only does one big split, with the mapped splits not talking between each other at all, and then reduces everything together. A single tier, no inter-split communication until reduce, and massively scalable. Great for taking advantage of your share of the cloud.
Map-reduce adds tons of overhead, but can work to coordinate a large fleet of machines for an "embarrassingly parallel" use case. Threading is only worth it if you have multiple cores and only a single host, but there are many frameworks which add layers of abstraction above raw threads (e.g. Concurrent, Akka) that are easier in general to work with.

Examples of simple stats calculation with hadoop

I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my data, i.e. arithmetic mean and variance.
I've been googling for a while, but maybe I'm not using the right keywords and I haven't really found anything which is a good primer for doing this sort of calculation, so I thought I would ask here.
Can anyone point me to some good samples of how to calculate mean and variance using hadoop, and/or provide some sample code.
Thanks
Pig latin has an associated library of reusable code called PiggyBank that has numerous handy functions. Unfortunately it didn't have variance last time I checked, but maybe that has changed. If nothing else, it might provide examples to get you started on your own implementation.
I should note that variance is difficult to implement in a stable way over huge data sets, so take care!
You might double check and see if your clustering code can drop into Cascading. Its quite trivial to add new functions, do joins, etc with your existing java libraries.
http://www.cascading.org/
And if you are into Clojure, you might watch these github projects:
http://github.com/clj-sys
They are layering new algorithms implemented in Clojure over Cascading (which in turn is layered over Hadoop MapReduce).

What's the opposite of "embarrassingly parallel"?

According to Wikipedia, an "embarrassingly parallel" problem is one for which little or no effort is required to separate the problem into a number of parallel tasks. Raytracing is often cited as an example because each ray can, in principle, be processed in parallel.
Obviously, some problems are much harder to parallelize. Some may even be impossible. I'm wondering what terms are used and what the standard examples are for these harder cases.
Can I propose "Annoyingly Sequential" as a possible name?
Inherently sequential.
Example: The number of women will not reduce the length of pregnancy.
There's more than one opposite of an "embarrassingly parallel" problem.
Perfectly sequential
One opposite is a non-parallelizable problem, that is, a problem for which no speedup may be achieved by utilizing more than one processor. Several suggestions were already posted, but I'd propose yet another name: a perfectly sequential problem.
Examples: I/O-bound problems, "calculate f1000000(x0)" type of problems, calculating certain cryptographic hash functions.
Communication-intensive
Another opposite is a parallelizable problem which requires a lot of parallel communication (a communication-intensive problem). An implementation of such a problem will scale properly only on a supercomputer with high-bandwidth, low-latency interconnect. Contrast this with embarrassingly parallel problems, implementations of which run efficiently even on systems with very poor interconnect (e.g. farms).
Notable example of a communication-intensive problem: solving A x = b where A is a large, dense matrix. As a matter of fact, an implementation of the problem is used to compile the TOP500 ranking. It's a good benchmark, as it emphasizes both the computational power of individual CPUs and the quality of interconnect (due to intensity of communication).
In more practical terms, any mathematical model which solves a system of partial differential equations on a regular grid using discrete time stepping (think: weather forecasting, in silico crash tests), is parallelizable by domain decomposition. That means, each CPU takes care of a part of the grid, and at the end of each time step the CPUs exchange their results on region boundaries with "neighbour" CPUs. These exchanges render this class of problems communication-intensive.
Im having a hard time to not post this... cause I know it don't add anything to the discussion.. but for all southpark fans out there
"Super serial!"
"Stubbornly serial"?
The opposite of embarassingly parallel is Amdahl's Law, which says that some tasks cannot be parallel, and that the minimum time a perfectly parallel task will require is dictated by the purely sequential portion of that task.
"standard examples" of sequential processes:
making a baby: “Crash programs fail because they are based on theory that, with nine women pregnant, you can get a baby a month.” -- attributed to Werner von Braun
calculating pi, e, sqrt(2), and other irrational numbers to millions of digits: most algorithms sequential
navigation: to get from point A to point Z, you must first go through some intermediate points B, C, D, etc.
Newton's method: you need each approximation in order to calculate the next, better approximation
challenge-response authentication
key strengthening
hash chain
Hashcash
P-complete (but that's not known for sure yet).
I use "Humiliatingly Sequential"
Paul
If ever one should speculate what it would be like to have natural, incorrigibly sequential problems, try
blissfully sequential
to counter 'embarrassingly parallel'.
"Gladdengly Sequential"
It all has to do with data dependencies. Embarrassingly parallel problems are ones for which the solution is made up of many independent parts. Problems with the opposite of this nature would be ones that have massive data dependencies, where there is little to nothing that can be done in parallel. Degeneratively dependent?
The term I've heard most often is "tightly-coupled", in that each process must interact and communicate often in order to share intermediate data. Basically, each process depends on others to complete their computation.
For example, matrix processing often involves sharing boundary values at the edges of each array partition.
This is in contrast to embarassingly parallel (or loosely-coupled) problems where each part of the problem is completely self-contained, and no (or very little) IPC is needed. Think master/worker parallelism.
Boastfully sequential.
I've always preferred 'sadly sequential' ala the partition step in quicksort.
"Completely serial?"
It shouldn't really surprise you that scientists think more about what can be done than what cannot be done. Especially in this case, where the alternative to parallelizing is doing everything as one normally would.
Completely non-parallelizable?
Pessimally parallelizable?
The opposite is "disconcertingly serial".
taking into acount that parallelism is the act of doing many jobs in the same time step t. the opposite could be time-sequential problems
An example inherently sequential problem.
This is common in CAD packages and some kinds of engineering analysis.
Tree traversal with data dependencies between nodes.
Imagine traversing a graph and adding up weights of nodes.
You just can't parallelise it.
CAD software represents parts as a tree, and to render to object you have to traverse the tree.
For this reason, cad workstations use less cores and faster, rather than many cores.
Thanks for reading.
You could of course, however I think that both 'names' are a non-issue.
From a functional programming perspective you could say that the 'annoyingly sequential' part is the smallest more or less independent part of an algorithm.
While the 'embarrassingly parallel' if not indeed taking into a parallel approach is bad coding practice.
Thus I don't see a point in given these things a name if good coding practice is always to brake up your solution into independent pieces, even if you at that moment don't take advantage of parallelism.

Resources