What are the common use cases for which Spark MMLIB should not be used - apache-spark

I am interested in knowing, use cases in which Spark MMLIB shouldn't be used.

As a rule of thumb you should reconsider you choice when:
You need an exact solution or well defined error. Spark MLlib typically use heuristics additionally adjusted for Spark architecture. Some give very good results in general, other may require complex tuning.
You have thin data / low number of dimensions (up to few thousands) or data fits in a memory of a single node (easily 256GB - 512GB as today). Optimized machine learning / linear algebra library usually performs much better than Spark with this conditions.
You want to collect detailed diagnostic information during training process. MLlib algorithms are usually black boxes.
Model is to be used outside Spark. Export options are fairly limited.

Related

Is there a way to use fastText's word representation process in parallel?

I am new to fastText, a library for efficient learning of word representations and sentence classification. I am trying to generate word-vector for huge data set. But in single process it's taking significantly long time.
So let me put my questions clearly:
Are there any options which I can use to speedup the single fastText process?
Is there any way to generate word-vector in parallel fastText processes?
Are there any other implementation or workaround available which can solve the problem, as I read caffe2 implementation is available, but I am unable to find it.
Thanks
I understand your questions that you like to distribute fastText and do parallel training.
As mentioned in Issue #144
... a future feature we might consider implementing. For now it's not on our list of priorities, but it might very well soon.
Except for the there also mentioned Word2Vec Spark implementation, I am not aware of any other implementations.
The original FastText release by Facebook includes a command-line option thread, default 12, which controls the number of worker threads which will do parallel training (on a single machine). If you have more CPU cores, and haven't yet tried increasing it, try that.
The gensim implementation (as gensim.models.fasttext.FastText) includes an initialization parameter, workers, which controls the number of worker threads. If you haven't yet tried increasing it, up to the number of cores, it may help. However, due to extra multithreading bottlenecks in its Python implementation, if you have a lot of cores (especially 16+), you might find maximum throughput with fewer workers than cores – often something in the 4-12 range. (You have to experiment & watch the achieved rates via logging to find the optimal value, and all cores won't be maxed.)
You'll only get significant multithreading in gensim if your installation is able to make use of its Cython-optimized routines. If you watch the logging when you install gensim via pip or similar, there should be a clear error if this fails. Or, if you are watching logs/output when loading/using gensim classes, there will usually be a warning if the slower non-optimized versions are being used.
Finally, often in the ways people use gensim, the bottleneck can be in their corpus iterator or IO rather than the parallelism. To minimize this slowdown:
Check to see how fast your corpus can iterate over all examples separate from passing it to the gensim class.
Avoid doing any database-selects or complicated/regex preprocessing/tokenization in the iterator – do it once, and save the easy-to-read-as-tokens resulting corpus somewhere.
If the corpus is coming from a network volume, test if streaming it from a local volume helps. If coming from a spinning HD, try an SSD.
If the corpus can be made to fit in RAM, perhaps on a special-purpose giant-RAM machine, try that.

How can I know if Apache Spark is the right tool?

Just wondering, is there somewhere a list of questions to ask ourselves in order to know whether Spark is the right tool or not ?
Once again I spent part of the week implementing a POC with Apache Spark in order to compare the performance against pure python code and I was baffled when I saw the 1/100 ratio (in favor of python).
I know that Spark is a "big data" tool and everyone keeps saying "Spark is the right tool to process TB/PB of data" but I think that is not the only thing to take into account.
In brief, my question is, when given small data as input, how can I know if the computing will be consuming enough so that Spark can actually improve things ?
I'm not sure if there is such a list, but if there was, the first question would probably be
Does your data fit on a single machine?
And if the answer is 'Yes', you do not need Spark.
Spark is designed to process lots of data such that it cannot be handled by a single machine, as an alternative to Hadoop, in a fault-tolerant manner.
There are lots of overheads, such as fault-tolerance and network, associated with operating in a distributed manner that cause the apparent slow-down when compared to traditional tools on a single machine.
Because Spark can be used as a parallel processing framework on a small dataset, does not mean that it should be used in such a way. You will get faster results and less complexity by using, say, Python, and parallelizing your processing using threads.
Spark excels when you have to process a dataset that does not fit onto a single machine, when the processing is complex and time-consuming and the probability of encountering an infrastructure issue is high enough and a failure would result in starting again from scratch.
Comparing Spark to native Python is like comparing a locomotive to a bicycle. A bicycle is fast and agile, until you need to transport several tonnes of steel from one end of the country to the other: then - not so fun.

MC-Stan on Spark?

I hope to use MC-Stan on Spark, but it seems there is no related page searched by Google.
I wonder if this approach is even possible on Spark, therefore I would appreciate if someone let me know.
Moreover, I also wonder what is the widely-used approach to use MCMC on Spark. I heard Scala is widely used, but I need some language that has a decent MCMC library such as MC-Stan.
Yes it's certainly possible but requires a bit more work. Stan (and popular MCMC tools that I know of) are not designed to be run in a distributed setting, via Spark or otherwise. In general, distributed MCMC is an area of active research. For a recent review, I'd recommend section 4 of Patterns of Scalable Bayesian Inference (PoFSBI). There are multiple possible ways you might want to split up a big MCMC computation but I think one of the more straightforward ways would be splitting up the data and running an off-the-shelf tool like Stan, with the same model, on each partition. Each model will produce a subposterior which can be reduce'd together to form a posterior. PoFSBI discusses several ways of combining such subposteriors.
I've put together a very rough proof of concept using pyspark and pystan (python is the common language with the most Stan and Spark support). It's a rough and limited implementation of the weighted-average consensus algorithm in PoFSBI, running on the tiny 8-schools dataset. I don't think this example would be practically very useful but it should provide some idea of what might be necessary to run Stan as a Spark program: partition data, run stan on each partition, combine the subposteriors.

Spark on a single node: speed improvement

Is there any sense to use Spark (in particular, MLlib) on a single node (besides the goal of learning this technology)?
Is there any improvement in speed?
Are you comparing this to using a non-Spark machine learning system?
It really depends what the capabilities are of the other library you might use.
If, for example, you've got all your training data stored in Parquet files, then Spark makes it very easy to read in those files and work with, whether that's on 1 machine or 100.

MapReduce - Anything else except word-counting?

I have been looking at MapReduce and reading through various papers about it and applications of it, but, to me, it seems that MapReduce is only suitable for a very narrow class of scenarios that ultimately result in word-counting.
If you look at the original paper Google's employees provide "various" potential use cases, like "distributed grep", "distributed sort", "reverse web-link graph", "term-vector per host", etc.
But if you look closer, all those problems boil down to simply "counting words" - that is counting the number occurrences of something in a chunk of data, then aggregating/filtering and sorting that list of occurrences.
There also are some cases where MapReduce has been used for genetic algorithms or relational databases, but they don't use the "vanilla" MapReduce published by Google. Instead they introduce further steps along the Map-Reduce chain, like Map-Reduce-Merge, etc.
Do you know of any other (documented?) scenarios where "vanilla" MapReduce has been used to perform more than mere word-counting? (Maybe for ray-tracing, video-transcoding, cryptography, etc. - in short anything "computation heavy" that is parallelizable)
Atbrox had been maintaining mapreduce hadoop algorithms in academic papers. Here is the link. All of these could be applied for practical purpose.
MapReduce is good for problems that can be considered to be embarrassingly parallel. There are a lot of problems that MapReduce is very bad at, such as those that require lots of all-to-all communication between nodes. E.g., fast Fourier transforms and signal correlation.
There are projects using MapReduce for parallel computations in statistics. For instance, Revolutions Analytics has started a RHadoop project for use with R. Hadoop is also used in computational biology and in other fields with large datasets that can be analyzed by many discrete jobs.
I am the author of one of the packages in RHadoop and I wrote the several examples distributed with the source and used in the Tutorial, logistic regression, linear least squares, matrix multiplication etc. There is also a paper I would like to recommend http://www.mendeley.com/research/sorting-searching-simulation-mapreduce-framework/
that seems to strongly support the equivalence of mapreduce with classic parallel programming models such as PRAM and BSP. I often write mapreduce algorithms as ports from PRAM algorithms, see for instance blog.piccolboni.info/2011/04/map-reduce-algorithm-for-connected.html. So I think the scope of mapreduce is clearly more than "embarrassingly parallel" but not infinite. I have myself experienced some limitations for instance in speeding up some MCMC simulations. Of course it could have been me not using the right approach. My rule of thumb is the following: if the problem can be solved in parallel in O(log(N)) time on O(N) processors, then it is a good candidate for mapreduce, with O(log(N)) jobs and constant time spent in each job. Other people and the paper I mentioned seem to focus more on the O(1) jobs case. When you go beyond O(log(N)) time the case for MR seems to get a little weaker, but some limitations may be inherent in the current implementation (high job overhead) rather the fundamental. It's a pretty fascinating time to be working on charting the MR territory.

Resources