Measure runtime of algorithm on a spark cluster - apache-spark

How do I measure the runtime of an algorithm in spark, especially on a cluster? I am interested in measuring the time from when the spark job is submitted to the cluster to when the submitted job has completed.
If it is important, I am mainly interested in machine learning algorithms using dataframes.

In my experience a reasonable approach is to measure the time from the submission of job to the completion on the driver. This is achieved by surrounding the spark action with timestamps:
val myRdd = sc.textFile("hdfs://foo/bar/..")
val startt = System.currentTimeMillis
val cnt = myRdd.count() // Or any other "action" such as take(), save(), etc
val elapsed = System.currentTimeMillis - startt
Notice that the initial sc.textFile() is lazy - i.e. it does not cause spark driver to submit the job to the cluster. therefore it is not really important if you include that in the timing or not.
A consideration for the results: the approach above is susceptible to variance due to existing load on the spark scheduler and cluster. A more precise approach would include the spark job writing the
System.currentTimeMillis
inside of its closure (executed on worker nodes) to an Accumulator at the beginning of its processing. This would remove the scheduling latency from the calculation.

To calculate the runtime of an algorithm, follow this procedure-
establish a single/multi node cluster
Make a folder and save your algorithm in that folder (eg. myalgo.scala/java/pyhton)
3.build it using sbt (you can follow this link to build your program. https://www.youtube.com/watch?v=1BeTWT8ADfE)
4.Run this command: SPARK_HOME$ /bin/spark-submit --class "class name" --master "spark master URL" "target jar file path" "arguments if any"
For example- spark-submit --class "GroupByTest" --master spark://BD:7077 /home/negi/sparksample/target/scala-2.11/spark-sample_2.11-1.0.jar
After this, refresh your web UI(eg. localhost:8080) and you will get all information there about your executed program including run-time.

Related

What is the difference between submitting spark job to spark-submit and to hadoop directly?

I have noticed that in my project there are 2 ways of running spark jobs.
First way is submitting a job to spark-submit file
./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master local[8]
/path/to/examples.jar
100
Second way is to package java file into jar and run it via hadoop, while having Spark code inside MainClassName:
hadoop jar JarFile.jar MainClassName
`
What is the difference between these 2 ways?
Which prerequisites I need to have in order to use either?
As you stated on the second way of running a spark job, packaging a java file with Spark classes and/or syntax is essentially wrapping your Spark job within a Hadoop job. This can have its disadvantages (mainly that your job gets directly dependent on the java and scala version you have on your system/cluster, but also some growing pains about the support between the different frameworks' versions). So in that case, the developer must be careful about the setup that the job will be run on on two different platforms, even if it seems a bit simpler for users of Hadoop which have a better grasp with Java and the Map/Reduce/Driver layout instead of the more already-tweaked nature of Spark and the sort-of-steep-learning-curve convenience of Scala.
The first way of submitting a job is the most "standard" (as far as the majority of usage it can be seen online, so take this with a grain of salt), operating the execution of the job almost entirely within Spark (except if you store the output of your job or take its input from the HDFS, of course). By using this way, you are only somewhat dependent to Spark, keeping the strange ways of Hadoop (aka its YARN resource management) away from your job. And it can be significantly faster in execution time, since it's the most direct approach.

Worker Nodes not being used in GCE

While running my spark jobs on google-cloud-dataproc, I notice that only the master node is being utilized and the CPU utilization of all the worker nodes is nearly zero percent (0.8 percent or so). I have used both the GUI as well as the console to run the code. Do you know any specific reason that could be causing this and how to make the full utilization of the worker nodes?
I submit the jobs in the following manner:
gcloud dataproc jobs submit spark --properties spark.executor.cores=10 --cluster cluster-663c --class ComputeMST --jars gs://kslc/ComputeMST.jar --files gs://kslc/SIFT_full.txt -- SIFT_full.txt gs://kslc/SIFT_fu ll.txt 5.0 12
while(true){
level_counter++;
if(level_counter > (number_of_levels - 1)) break;
System.out.println("LEVEL = " + level_counter);
JavaPairRDD<ArrayList<Integer>, epsNet> distributed_msts_logn1 = distributed_msts_logn.mapToPair(new next_level());
JavaPairRDD<ArrayList<Integer>, epsNet> distributed_msts_next_level = distributed_msts_logn1.reduceByKey(new union_eps_nets());
den = den/2;
distributed_msts_logn = distributed_msts_next_level.mapValues(new unit_step_logn(den, level_counter));
}
JavaRDD<epsNet> epsNetsRDDlogn = distributed_msts_logn.values();
List<epsNet> epsNetslogn = epsNetsRDDlogn.collect();
Above is the code, I am trying to run.
You are doing a collect() in your driver program. What are you trying to achieve? Doing a collect will definitely hammer your master node resources, since driver will be collecting the results here. Generally you want to ingest data into spark (using read or parallelize on spark context), do in-memory map-reduce (transformations) and then take data out of the spark world (example, writing a parquet to hdfs) to do any collect-related stuff.
Also, ensure via spark UI that you have all the executors that you asked for with given cores and memory.

SnappyData - snappy-job - cannot run jar file

I'm trying run jar file from snappydata cli.
I'm just want to create a sparkSession and SnappyData session on beginning.
package io.test
import org.apache.spark.sql.{SnappySession, SparkSession}
object snappyTest {
def main(args: Array[String]) {
val spark: SparkSession = SparkSession
.builder
.appName("SparkApp")
.master("local")
.getOrCreate
val snappy = new SnappySession(spark.sparkContext)
}
}
From sbt file:
name := "SnappyPoc"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "io.snappydata" % "snappydata-cluster_2.11" % "1.0.0"
When I'm debuging code in IDE, it works fine, but when I create a jar file and try to run it directly on snappy I get message:
"message": "Ask timed out on [Actor[akka://SnappyLeadJobServer/user/context-supervisor/snappyContext1508488669865777900#1900831413]] after [10000 ms]",
"errorClass": "akka.pattern.AskTimeoutException",
I have Spark Standalone 2.1.1, SnappyData 1.0.0.
I added dependencies to Spark instance.
Could you help me ?. Thank in advanced.
The difference between "embedded" mode and "smart connector" mode needs to be explained first.
Normally when you run a job using spark-submit, then it spawns a set of new executor JVMs as per configuration to run the code. However in the embedded mode of SnappyData, the nodes hosting the data also host long-running Spark Executors themselves. This is done to minimize data movement (i.e. move execution rather than data). For that mode you can submit a job (using snappy-job.sh) which will run the code on those pre-existing executors. Alternative routes include the JDBC/ODBC for embedded execution. This also means that you cannot (yet) use spark-submit to run embedded jobs because that will spawn its own JVMs.
The "smart connector" mode is the normal way in which other Spark connectors work but like all those has the disadvantage of having to pull the required data into the executor JVMs and thus will be slower than embedded mode. For configuring the same, one has to specify "snappydata.connection" property to point to the thrift server running on SnappyData cluster's locator. It is useful for many cases where users want to expand the execution capacity of cluster (e.g. if cluster's embedded execution is saturated all the time on CPU), or for existing Spark distributions/deployments. Needless to say that spark-submit can work in the connector mode just fine. What is "smart" about this mode is: a) if physical nodes hosting the data and running executors are common, then partitions will be routed to those executors as much as possible to minimize network usage, b) will use the optimized SnappyData plans to scan the tables, hash aggregation, hash join.
For this specific question, the answer is: runSnappyJob will receive the SnappySession object as argument which should be used rather than creating it. Rest of the body that uses SnappySession will be exactly same. Likewise for working with base SparkContext, it might be easier to implement SparkJob and code will be similar except that SparkContext will be provided as function argument which should be used. The reason being as explained above: embedded mode already has a running SparkContext which needs to be used for jobs.
I think there were missing methods isValidJob and runSnappyJob.
When I added those to code it works, but know someone what is releation beetwen body of metod runSnappyJob and method main
Should be the same in both ?

Apache Spark Correlation only runs on driver

I am new to Spark and learn that transformations happen on workers and action on the driver but the intermediate action can happen(if the operation is commutative and associative) at the workers also which gives the actual parallelism.
I looked into the correlation and covariance code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
How could I find what part of the correlation has happened at the driver and what at executor?
Update 1: The setup I'm talking about to run the correlation is the cluster setup consisting of multiple VM's.
Look here for the images from the SparK web UI: Distributed cross correlation matrix computation
Update 2
I setup my cluster in standalone mode like It was a 3 Node cluster, 1 master/driver(actual machine: workstation) and 2 VM slaves/executor.
submitting the job like this
./bin/spark-submit --master spark://192.168.0.11:7077 examples/src/main/python/mllib/correlations_example.py
from master node
My correlation sample file is correlations_example.py:
data = sc.parallelize(np.array([range(10000000), range(10000000, 20000000),range(20000000, 30000000)]).transpose())
print(Statistics.corr(data, method="pearson"))
sc.stop()
I always get a sequential timeline as :
Doesn't it mean that it not happening in parallel based on timeline of events ? Am I doing something wrong with the job submission or correlation computation in Spark is not parallel?
Update 3:
I tried even adding another executor, still the same seqquential treeAggreagate.
I set the spark cluster as mentioned here:
http://paxcel.net/blog/how-to-setup-apache-spark-standalone-cluster-on-multiple-machine/
Your statement is not entirely accurate. The container[executor] for the driver is launched on the client/edge node or on the cluster, depending on the spark submit mode e.g. client or yarn. The actions are executed by the workers and the results are sent back to the driver (e.g. collect)
This has been answered already. See link below for more details.
When does an action not run on the driver in Apache Spark?

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

Resources