Apache Spark Correlation only runs on driver - apache-spark

I am new to Spark and learn that transformations happen on workers and action on the driver but the intermediate action can happen(if the operation is commutative and associative) at the workers also which gives the actual parallelism.
I looked into the correlation and covariance code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
How could I find what part of the correlation has happened at the driver and what at executor?
Update 1: The setup I'm talking about to run the correlation is the cluster setup consisting of multiple VM's.
Look here for the images from the SparK web UI: Distributed cross correlation matrix computation
Update 2
I setup my cluster in standalone mode like It was a 3 Node cluster, 1 master/driver(actual machine: workstation) and 2 VM slaves/executor.
submitting the job like this
./bin/spark-submit --master spark://192.168.0.11:7077 examples/src/main/python/mllib/correlations_example.py
from master node
My correlation sample file is correlations_example.py:
data = sc.parallelize(np.array([range(10000000), range(10000000, 20000000),range(20000000, 30000000)]).transpose())
print(Statistics.corr(data, method="pearson"))
sc.stop()
I always get a sequential timeline as :
Doesn't it mean that it not happening in parallel based on timeline of events ? Am I doing something wrong with the job submission or correlation computation in Spark is not parallel?
Update 3:
I tried even adding another executor, still the same seqquential treeAggreagate.
I set the spark cluster as mentioned here:
http://paxcel.net/blog/how-to-setup-apache-spark-standalone-cluster-on-multiple-machine/

Your statement is not entirely accurate. The container[executor] for the driver is launched on the client/edge node or on the cluster, depending on the spark submit mode e.g. client or yarn. The actions are executed by the workers and the results are sent back to the driver (e.g. collect)
This has been answered already. See link below for more details.
When does an action not run on the driver in Apache Spark?

Related

How Apache Spark collects and coordinate the results from executors

Posting this question to learn how Apache Spark collects and coordinate the results from executors.
Suppose I'm running a job with 3 executors. My DataFrame is partitioned and running across these 3 executors.
So now, When I execute a count() or collect() action on the DataFrame how spark will coordinate the results from these 3 executors?
val prods = spark.read.format("csv").option("header", "true").load("testFile.csv")
prods.count(); // How spark collect data from three executors? Who will coordinate the result from different executors and give it to driver?
prods.count(); // How spark collect data from three executors? Who will coordinate the result from different executors and give it to driver?
When you do spark-submit you specify master a client program (driver) starts running on yarn ,if yarn is specified master or local if local specified. https://spark.apache.org/docs/latest/submitting-applications.html
Since you have added tag yarn in the question i am assuming you mean yarn-url,so yarn launches client program(driver) on any of the nodes of cluster and registers and assigns workers (executors) to driver so that task to be executed on each node.Each transformation/action is run parallel on each worker nodes (executor).Once each node complete the job they return back there results to the driver program.
Oki,what part are you not clear ?
Let me make it generic,the client/driver program launches and requests the master local/standalone master/yarn aka Cluster Manager that driver program wants resources to perform tasks ,so allocate driver with the workers for that.The cluster manager in return allocates workers,launches executors on worker nodes and gives the information to client program that you can you use these workers to do your job.So data is divided in each worker node and parallel tasks/transformations are done.Once collect() or count() is called (i assume this is the final part of job).Then each executor return its result back to driver.

Schedule each Apache Spark Stage to run on a specific Worker Node

Suppose, I am running a simple Wordcount application on Spark (actually Spark Streaming) with 2 worker nodes. By default each task (from any stage) is scheduled to any available resource based on a scheduling algorithm. However, I want to change the default scheduling to fix each stage to a specific worker node.
Here is what I am trying to achieve -
Worker Node 'A' should only process the first Stage (like 'map' stage). So all the data that comes in must first go to worker 'A'
and Worker Node 'B' should only process the second stage (like 'reduce' stage). Effectively, the results of Worker A are processed by Worker B.
My first question is - Is this sort of customisation possible on Spark or Spark Streaming by tuning the parameters or choosing a correct config option? (I don't think it is, but can someone confirm this?)
My second question is - Can I achieve this by making some change to the Spark scheduler code? I am ok hardcoding the IPs of the workers if necessary. Any hints or pointers to this specific problem or even understanding the Spark Scheduler code in more detail would be helpful..
I understand that this change defeats the efficiency goals of Spark to some extent but I am only looking to experiment with different setups for a project.
Thanks!

Spark Resource Allocation

We are evaluating Apache Spark (pySpark) as a framework for our machine learning pipeline.
It consists of (on a high level) two steps:
A pre-processing step (as we are working with audio data, sub-steps are for example computation of power spectrum) which is more optimized for running on CPU nodes.
There is a training step, where the model gets build and which is rather optimized for GPU nodes. We would like to distribute the work in such a way, that the first step (data pre-processing) gets run on a CPU cluster and the second step (model training) gets run on a GPU cluster without having to manually intervene between step 1 and 2.
Questions:
Is Spark the right place to organize the handling of different clusters, or would it have to be done somewhere else (e.g. in the Mesos level)
If Spark is the right place, how do we organize it with Spark so that the first steps runs on a CPU cluster and the second step runs on a GPU cluster?
My initial idea was to create multiple SparkContext, but this seems to be discouraged, e.g. here: How to create multiple SparkContexts in a console
Thank you very much for your help.

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

Measure runtime of algorithm on a spark cluster

How do I measure the runtime of an algorithm in spark, especially on a cluster? I am interested in measuring the time from when the spark job is submitted to the cluster to when the submitted job has completed.
If it is important, I am mainly interested in machine learning algorithms using dataframes.
In my experience a reasonable approach is to measure the time from the submission of job to the completion on the driver. This is achieved by surrounding the spark action with timestamps:
val myRdd = sc.textFile("hdfs://foo/bar/..")
val startt = System.currentTimeMillis
val cnt = myRdd.count() // Or any other "action" such as take(), save(), etc
val elapsed = System.currentTimeMillis - startt
Notice that the initial sc.textFile() is lazy - i.e. it does not cause spark driver to submit the job to the cluster. therefore it is not really important if you include that in the timing or not.
A consideration for the results: the approach above is susceptible to variance due to existing load on the spark scheduler and cluster. A more precise approach would include the spark job writing the
System.currentTimeMillis
inside of its closure (executed on worker nodes) to an Accumulator at the beginning of its processing. This would remove the scheduling latency from the calculation.
To calculate the runtime of an algorithm, follow this procedure-
establish a single/multi node cluster
Make a folder and save your algorithm in that folder (eg. myalgo.scala/java/pyhton)
3.build it using sbt (you can follow this link to build your program. https://www.youtube.com/watch?v=1BeTWT8ADfE)
4.Run this command: SPARK_HOME$ /bin/spark-submit --class "class name" --master "spark master URL" "target jar file path" "arguments if any"
For example- spark-submit --class "GroupByTest" --master spark://BD:7077 /home/negi/sparksample/target/scala-2.11/spark-sample_2.11-1.0.jar
After this, refresh your web UI(eg. localhost:8080) and you will get all information there about your executed program including run-time.

Resources