Apache Spark provides data-parallel implementations of machine learning algorithms. It has also started to support task parallelization of machine learning algorithms, e.g., in the context of cross-validated parameter tuning, for example using Spark's Scikit-learn integration package: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html
My question is, what would be the recommended approach(es) for combining these two modes of parallelism in Spark:
Execute the ML algorithm on distributed data, and
Execute multiple instance of the algorithm (using different tuning parameters) in the same framework?
Keep in mind that, task parallelism in this context still involves a 'gather' stage where results (e.g., cross-validated errors) from each task must be combined (e.g. minimized); so tasks are not completely independent.
Related
I would like to confirm some aspects that from reading all blogs and Databricks sources and the experts Holden, Warren et al, seem still poorly explained. I also note some unanswered questions out there in the net regarding this topic.
The following then:
My understanding is that for RDD's we have the DAG Scheduler that creates the Stages in a simple manner.
My understanding based on reading elsewhere to-date is that for DF's and DS's that we:
Use Catalyst instead of the DAG Scheduler. But I am not convinced.
Have Tungsten.
As DAG applies to DF's and DS's as well (obviously), I am left with 1 question - just to be sure:
Is it Catalyst that creates the Stages as well?
This may seem a silly question, but I noted a question on Disable Spark Catalyst Optimizer here on SO. That would imply no.
In addition, as the Spark paradigm is Stage based (shuffle boundaries), it seems to me that deciding Stages is not a Catalyst thing.
Therefore my conclusion is that the DAG Scheduler is still used for Stages with DF's and DS's, but I am looking for confirmation.
Moreover, this picture implies that there is still a DAG Scheduler.
This picture from the Databricks 2019 summit seems in contrast to the statement found on a blog:
An important element helping Dataset to perform better is Catalyst
Optimizer (CO), an internal query optimizer. It "translates"
transformations used to build the Dataset to physical plan of
execution. Thus, it's similar to DAG scheduler used to create physical
plan of execution of RDD. ...
I see many unanswered questions on SO on the DAGs with DF's etc. and a lot of stuff is out-of-date as it RDD related. So, as a consequence I asked a round a few of my connection with Spark knowledge on this and noted they were remiss in providing a suitable answer.
Let me try to clear these terminologies for you.
Spark Scheduler is responsible for scheduling tasks for execution. It manages where the jobs will be scheduled, will they be scheduled in parallel, etc. Spark Scheduler works together with Block Manager and Cluster Backend to efficiently utilize cluster resources for high performance of various workloads. DAGScheduler is a part of this.
Catalyst is the optimizer component of Spark. It performs query optimizations and creates multiple execution plans out of which the most optimized one is selected for execution which is in terms of RDDs.
Tungsten is the umbrella project that was focused on improving the CPU and memory utilization of Spark applications.
DAGScheduler is responsible for generation of stages and their scheduling. It breaks each RDD graph at shuffle boundaries based on whether they are "narrow" dependencies or have shuffle dependencies. It also determines where each task should be executed based on current cache status.
Well, I searched a bit more and found a 'definitive' source from the Spark Summit 2019 slide from David Vrba. It is about Spark SQL and shows the DAG Scheduler. So we can conclude that Catalyst does not decide anything on Stages. Or is this wrong and is the above answer correct and the below statement correct? The picture implies differently is my take, so no.
The statement I read elsewhere on Catalyst: ...
An important element helping Dataset to perform better is Catalyst
Optimizer (CO), an internal query optimizer. It "translates"
transformations used to build the Dataset to physical plan of
execution. Thus, it's similar to DAG scheduler used to create physical
plan of execution of RDD.
...
was a little misleading. That said, checking to be sure, elsewhere revealed no clear statements until this.
Is there some rule of thumb or best practice regarding number of stages in Spark job?
When do you consider breaking job in smaller pieces?
I found smaller jobs easier to analyze and optimize, but on the other hand loading/extracting data between each job comes with a cost.
There is no hard rule about optimal number of Stages for a Spark App.
It depends on what your functionality is that dictates the number of Stages.
Certain aspects result in Stages due to the Spark Architecture - which makes sense.
But Catalyst & Tungsten optimize and fuse code but cannot obviate 'shuffle boundaries" which means a new Stage.That is also not their task. The DAG Scheduler (underwater for Dataframes) does that.
You can .cache things to reduce re-computation for subsequent Actions in a Spark App, but that has a certain cost as well.
You can use things that reduce "shuffling", e.g. reduceByKey for legacy RDDs.
For Dataframes, DataSets Spark will generate more optimum Execution Plans (in general) and indeed some extra Stages (for computing pivot vales when using pivot.)
You sort of answer partially your own question with the aspect of writing, loading, but bucketBy can help wth such an approach. However, I am not sure how why the complexity is more with a larger Spark App - unless you mean using intermediate tables with less JOINs, UNIONs in smaller pieces. But the number of Stages then is only a consequence, and not so much a deciding factor for Stages.
Is Spark using Map Reduce internally ? (his own map reduce)
The first time I hear somebody tell me, "Spark use map-reduce", I was so confused, I always learned that spark was the great adversary against Hadoop-Map Reduce.
After check in Google I just found a web-site that make some too short explanation about that : https://dzone.com/articles/how-does-spark-use-mapreduce
But the rest of Internet is about Spark vs Map Reduce.
Than somebody explain me that when spark make a RDD the data is split in different datasets and if you are using for example SPAR.SQL a query that should not be a map reduce like:
select student
from Table_students
where name = "Enrique"
Internally Spark is doing a map reduce to retrieve the Data( from the different datasets).
It´s that true ?
If I'm using Spark Mlib, to use machine learning, I always heard that machine learning is not compatible with map reduce because it need so many interactions and map reduce use batch processing..
In Spark Mlib, is Spark Internally using Map reduce too ?
Spark features an advanced Directed Acyclic Graph (DAG) engine supporting cyclic data flow. Each Spark job creates a DAG of task stages to be performed on the cluster. Compared to MapReduce, which creates a DAG with two predefined stages - Map and Reduce, DAGs created by Spark can contain any number of stages. DAG is a strict generalization of MapReduce model.
This allows some jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage, and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs.
So, Spark can write map-reduce program, but actually use DAG inside.
Reference:
Directed Acyclic Graph DAG in Apache Spark
What is Directed Acyclic Graph in Apache Spark?
What are the Apache Spark concepts around its DAGexecution engine, and its overall architecture?
How-to: Translate from MapReduce to Apache Spark
Is it possible to run multiple calculations in parallel using spark?
Example cases that could benefit from that:
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
running numerous atomic tasks for small datasets. For example:
for in_path, out_path in long_ds_list:
spark.read(in_path).select('column').distinct().write(out_path)
The closest equivalents I can think of would be SparkR.lapply() or .Net Parallel.ForEach(), but for a cluster environment rather than simpler multi-threading case.
I'd say that Spark is good at scheduling distributed computing tasks and could handle your cases with ease, but you'd have to develop their solutions yourself. I'm not saying it'd take ages, but would require quite a lot of effort since it's below the developer-facing API in Spark SQL, Spark MLlib, Structured Streaming and such.
You'd have to use Spark Core API and create a custom RDD that would know how to describe such computations.
Let's discuss the first idea.
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
"column-wise tasks for large columns" seems to suggest that you think about Spark SQL's DataFrames and Spark MLlib's StringIndexer Transformer. They are higher-level APIs that don't offer such features. You're not supposed to deal with the problem using them. It's an optimization feature so you have to go deeper into Spark.
I think you'd have to rewrite the higher-level APIs in Spark SQL and Spark MLlib to use your own optimized custom code where you'd have the feature implemented.
Same with the other requirement, but this time you'd have to be concerned with Spark SQL only (leaving Spark MLlib aside).
Wrapping up, I think both are possible with some development (i.e. not available today).
Can Machine learning algorithms provided by "spark mllib" like naive byes,random forest run in parallel mode across spark cluster? OR we need to change code? Kindly provide an example to run in parallel? Not sure how parallelism work (map) in MLLIB - as each processing requires entire training data set. Does computation run in parallel with subset of training data?
Thanks
These algorithms as provided by Spark MLLib do run in parallel automatically. They expect an RDD as input. An RDD is a resilient distributed dataset, spread across a cluster of computers.
Here is an example problem using a Decision Tree for classification problems.
I highly recommend exploring in depth the link provided above. The page has extensive documentation and examples of how to code these algorithms, including generating training and testing datasets, scoring, cross validation, etc.
These algorithms run in parallel by running computations on the worker nodes' subset of the data, and then sharing the results of those computations across worker nodes and with the master node. The master node collects the results of individual computations and aggregates them as necessary to make decisions based on the entire dataset. Computation heavy activities are mostly executed on the worker nodes.