Can Machine learning algorithms provided by "spark mllib" like naive byes,random forest run in parallel mode across spark cluster? OR we need to change code? Kindly provide an example to run in parallel? Not sure how parallelism work (map) in MLLIB - as each processing requires entire training data set. Does computation run in parallel with subset of training data?
Thanks
These algorithms as provided by Spark MLLib do run in parallel automatically. They expect an RDD as input. An RDD is a resilient distributed dataset, spread across a cluster of computers.
Here is an example problem using a Decision Tree for classification problems.
I highly recommend exploring in depth the link provided above. The page has extensive documentation and examples of how to code these algorithms, including generating training and testing datasets, scoring, cross validation, etc.
These algorithms run in parallel by running computations on the worker nodes' subset of the data, and then sharing the results of those computations across worker nodes and with the master node. The master node collects the results of individual computations and aggregates them as necessary to make decisions based on the entire dataset. Computation heavy activities are mostly executed on the worker nodes.
Related
I know Spark has in memory capability that is very useful for iterative jobs. But what if my requirement is traditional batch processing ETL. Does Spark provide me any benefit there? Please give all the pointers related to this, it will help me a lot.
How does Spark help me in case there are no iterative work and it's a batch process?
Is there any scenario where MapReduce would perform better than Spark? Any scenario where MR will be better than Spark?
Assuming you know Map Reduce, then consider:
writing Word Counting in MR when you need to list the top N words. Far more work over multiple Steps in MR vs. 7 or 8 lines in Spark.
for those with dimension processing a la dimensional model, a lot easier to do in Spark.
Spark Structured Streaming use cases...
Certain tasks with extreme high amounts of data may well be better using MR if you cannot acquire enough hardware or Cloud compute resources, i.e. writing to disk and processing per functional step.
From the paper "GraphX: Graph Processing in a Distributed Dataflow Framework" (Gonzalez et al. 2014) I learned that GraphX modified Spark shuffle:
Memory-based Shuffle: Spark’s default shuffle implementation materializes the temporary data to disk. We modified the shuffle phase to materialize map outputs in memory and remove this temporary data using a timeout.
(The paper does not explain anything more on this point.)
It seems that this change aims at optimizing shuffles in the context of highly iterative graph processing algorithms.
How does this "Memory-based shuffle" works exactly, how it differs from the Spark Core's one and what are the pros and cons: why it is well suited for graphx use cases and not for other Spark jobs ?
I failed to understand the big picture directly from GraphX/Spark sources and I also struggled finding the information out there.
Apart from an ideal answer, comments with links to sources are welcomed too.
I failed to understand the big picture directly from GraphX/Spark sources
Because it was never included in the mainstream distribution.
Back when the first GraphX version was developed Spark used Hash based shuffle, which was rather inefficient. It was one of the main bottlenecks in Spark jobs, and there was significant research into developing of alternative shuffle strategies.
Since GraphX algorithms are iterative and join-based, improving shuffle speed was an obvious path.
Since then, pluggable shuffle manager has been introduced, as well as new sort based shuffle, which finally turned out to be fast enough to make both hash-based shuffle and ongoing work on providing generic memory-based shuffle obsolete.
Is it possible to run multiple calculations in parallel using spark?
Example cases that could benefit from that:
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
running numerous atomic tasks for small datasets. For example:
for in_path, out_path in long_ds_list:
spark.read(in_path).select('column').distinct().write(out_path)
The closest equivalents I can think of would be SparkR.lapply() or .Net Parallel.ForEach(), but for a cluster environment rather than simpler multi-threading case.
I'd say that Spark is good at scheduling distributed computing tasks and could handle your cases with ease, but you'd have to develop their solutions yourself. I'm not saying it'd take ages, but would require quite a lot of effort since it's below the developer-facing API in Spark SQL, Spark MLlib, Structured Streaming and such.
You'd have to use Spark Core API and create a custom RDD that would know how to describe such computations.
Let's discuss the first idea.
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
"column-wise tasks for large columns" seems to suggest that you think about Spark SQL's DataFrames and Spark MLlib's StringIndexer Transformer. They are higher-level APIs that don't offer such features. You're not supposed to deal with the problem using them. It's an optimization feature so you have to go deeper into Spark.
I think you'd have to rewrite the higher-level APIs in Spark SQL and Spark MLlib to use your own optimized custom code where you'd have the feature implemented.
Same with the other requirement, but this time you'd have to be concerned with Spark SQL only (leaving Spark MLlib aside).
Wrapping up, I think both are possible with some development (i.e. not available today).
I'm fitting a large number of models in Pyspark via Spark ML (see: How best to fit many Spark ML models) and I'm wondering what I can do to speed up individual fits.
My data set is a spark data frame that's approximately 50gb, read in from libsvm format, and I'm running on a dynamically allocated YARN cluster with allocated executor memory = 10gb. Fitting a logistic regression classifier, it creates about 30 steps of treeAggregate at LogisticRegression.scala:1018, with alternating shuffle reads and shuffle writes of ~340mb each.
Executors come and go but it seems like the typical stage runtime is about 5 seconds. Is there anything I can look at to improve performance on these fits?
As a general job in Spark, you can do some stuff to improve your training time.
spark.driver.memory look out for your driver memory, some algorithms do shuffle data to your driver (in order to reduce computing time), so it might be a source of enhancement or at least one point of failure to keep an eye at.
Change the spark.executor.memory so it uses the maximum needed by the job but it also uses as little as much so you can fit more executors in each node (machine) on the cluster, and as you have more workers, you'll have more computer power to handle the job.
spark.sql.shuffle.partitions since you probably use DataFrames to manipulate data, try different values on this parameter so that you can execute more tasks per executor.
spark.executor.cores use it below 5 and you're good, above that, you probably will increase the time an executor has to handle the "shuffle" of tasks inside of it.
cache/persist: try to persist your data before huge transformations, if you're afraid of your executors not being able to handle it use StorageLevel.DISK_AND_MEMORY, so you're able to use both.
Important: all of this is based on my own experience alone training algorithms using Spark ML over datasets with 1TB-5TB and 30-50 features, I've researched to improve my own jobs but I'm not qualified as a source of truth for your problem. Learn more about your data and watch the logs of your executors for further enhancements.
Apache Spark provides data-parallel implementations of machine learning algorithms. It has also started to support task parallelization of machine learning algorithms, e.g., in the context of cross-validated parameter tuning, for example using Spark's Scikit-learn integration package: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html
My question is, what would be the recommended approach(es) for combining these two modes of parallelism in Spark:
Execute the ML algorithm on distributed data, and
Execute multiple instance of the algorithm (using different tuning parameters) in the same framework?
Keep in mind that, task parallelism in this context still involves a 'gather' stage where results (e.g., cross-validated errors) from each task must be combined (e.g. minimized); so tasks are not completely independent.