Spark on a single node: speed improvement - apache-spark

Is there any sense to use Spark (in particular, MLlib) on a single node (besides the goal of learning this technology)?
Is there any improvement in speed?

Are you comparing this to using a non-Spark machine learning system?
It really depends what the capabilities are of the other library you might use.
If, for example, you've got all your training data stored in Parquet files, then Spark makes it very easy to read in those files and work with, whether that's on 1 machine or 100.

Related

Binary file conversion in distributed manner - Spark, Flume or any other option?

We have a scenario where there will be a continuous incoming set of binary files (ASN.1 type to be exact). We want to convert these binary files to different format, say XML or JSON, and write to a different location. I was wondering what would be the best architectural design to handle this kind of problem? I know we could use Spark cluster for CSV, JSON, parquet kind of files but I'm not sure we could use it for binary file processing, or we could use Apache Flume to move files from one place to another and even use interceptor to convert the contents.
It's ideal if we can switch the ASN.1 decoder whenever we have performance considerations without changing the underlying framework of distributed processing (ex: to use C++ based or python based or Java based decoder library).
In terms of scalability, reliability and future-proofing your solution, I'd look at Apache NiFi rather than Flume. You can start by developing your own ASN.1 Processor or try using the patch thats already available but not part of a released version yet.

Design application using Apache Spark

Its a bit architectural kind of question. I need to design an application using Spark and Scala as the primary tool. I want to minimise the manual intervention as much as possible.
I will receive a zip with multiple files having different structures as an input at a regular interval of time say, daily. I need to process it using Spark . After transformations need to move the data to a back-end database.
Wanted to understand the best way I can use to design the application.
What would be the best way to process the zip ?
Is the Spark Streaming can be considered as an option looking at the frequency of the file ?
What other options should I take into consideration?
Any guidance will be really appreciable.
Its a broad question, there are batch options and stream options not sure your exact requirements. you may start your research here: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html

How can I know if Apache Spark is the right tool?

Just wondering, is there somewhere a list of questions to ask ourselves in order to know whether Spark is the right tool or not ?
Once again I spent part of the week implementing a POC with Apache Spark in order to compare the performance against pure python code and I was baffled when I saw the 1/100 ratio (in favor of python).
I know that Spark is a "big data" tool and everyone keeps saying "Spark is the right tool to process TB/PB of data" but I think that is not the only thing to take into account.
In brief, my question is, when given small data as input, how can I know if the computing will be consuming enough so that Spark can actually improve things ?
I'm not sure if there is such a list, but if there was, the first question would probably be
Does your data fit on a single machine?
And if the answer is 'Yes', you do not need Spark.
Spark is designed to process lots of data such that it cannot be handled by a single machine, as an alternative to Hadoop, in a fault-tolerant manner.
There are lots of overheads, such as fault-tolerance and network, associated with operating in a distributed manner that cause the apparent slow-down when compared to traditional tools on a single machine.
Because Spark can be used as a parallel processing framework on a small dataset, does not mean that it should be used in such a way. You will get faster results and less complexity by using, say, Python, and parallelizing your processing using threads.
Spark excels when you have to process a dataset that does not fit onto a single machine, when the processing is complex and time-consuming and the probability of encountering an infrastructure issue is high enough and a failure would result in starting again from scratch.
Comparing Spark to native Python is like comparing a locomotive to a bicycle. A bicycle is fast and agile, until you need to transport several tonnes of steel from one end of the country to the other: then - not so fun.

What are the common use cases for which Spark MMLIB should not be used

I am interested in knowing, use cases in which Spark MMLIB shouldn't be used.
As a rule of thumb you should reconsider you choice when:
You need an exact solution or well defined error. Spark MLlib typically use heuristics additionally adjusted for Spark architecture. Some give very good results in general, other may require complex tuning.
You have thin data / low number of dimensions (up to few thousands) or data fits in a memory of a single node (easily 256GB - 512GB as today). Optimized machine learning / linear algebra library usually performs much better than Spark with this conditions.
You want to collect detailed diagnostic information during training process. MLlib algorithms are usually black boxes.
Model is to be used outside Spark. Export options are fairly limited.

MC-Stan on Spark?

I hope to use MC-Stan on Spark, but it seems there is no related page searched by Google.
I wonder if this approach is even possible on Spark, therefore I would appreciate if someone let me know.
Moreover, I also wonder what is the widely-used approach to use MCMC on Spark. I heard Scala is widely used, but I need some language that has a decent MCMC library such as MC-Stan.
Yes it's certainly possible but requires a bit more work. Stan (and popular MCMC tools that I know of) are not designed to be run in a distributed setting, via Spark or otherwise. In general, distributed MCMC is an area of active research. For a recent review, I'd recommend section 4 of Patterns of Scalable Bayesian Inference (PoFSBI). There are multiple possible ways you might want to split up a big MCMC computation but I think one of the more straightforward ways would be splitting up the data and running an off-the-shelf tool like Stan, with the same model, on each partition. Each model will produce a subposterior which can be reduce'd together to form a posterior. PoFSBI discusses several ways of combining such subposteriors.
I've put together a very rough proof of concept using pyspark and pystan (python is the common language with the most Stan and Spark support). It's a rough and limited implementation of the weighted-average consensus algorithm in PoFSBI, running on the tiny 8-schools dataset. I don't think this example would be practically very useful but it should provide some idea of what might be necessary to run Stan as a Spark program: partition data, run stan on each partition, combine the subposteriors.

Resources