How Apache Spark collects and coordinate the results from executors - apache-spark

Posting this question to learn how Apache Spark collects and coordinate the results from executors.
Suppose I'm running a job with 3 executors. My DataFrame is partitioned and running across these 3 executors.
So now, When I execute a count() or collect() action on the DataFrame how spark will coordinate the results from these 3 executors?
val prods = spark.read.format("csv").option("header", "true").load("testFile.csv")
prods.count(); // How spark collect data from three executors? Who will coordinate the result from different executors and give it to driver?

prods.count(); // How spark collect data from three executors? Who will coordinate the result from different executors and give it to driver?
When you do spark-submit you specify master a client program (driver) starts running on yarn ,if yarn is specified master or local if local specified. https://spark.apache.org/docs/latest/submitting-applications.html
Since you have added tag yarn in the question i am assuming you mean yarn-url,so yarn launches client program(driver) on any of the nodes of cluster and registers and assigns workers (executors) to driver so that task to be executed on each node.Each transformation/action is run parallel on each worker nodes (executor).Once each node complete the job they return back there results to the driver program.

Oki,what part are you not clear ?
Let me make it generic,the client/driver program launches and requests the master local/standalone master/yarn aka Cluster Manager that driver program wants resources to perform tasks ,so allocate driver with the workers for that.The cluster manager in return allocates workers,launches executors on worker nodes and gives the information to client program that you can you use these workers to do your job.So data is divided in each worker node and parallel tasks/transformations are done.Once collect() or count() is called (i assume this is the final part of job).Then each executor return its result back to driver.

Related

Running Spark job on a single node

I'm running a simple groupby on 350GB of data. Since I'm running this on a single node (I'm on an HPC cluster), I requested computing resource of 400GB and then running the spark job by setting spark.driver.memory to 350 GB.
Since it's running on a single node, the Driver node acts as both master and slave. The job is currently taking more than 6 hours to complete. All it does is a simple groupby operation followed by merging it into a single parquet:
val data = spark.read.parquet("path_to_folder/*")
val grouped = data.groupBy("i","j").agg(sum("count").alias("count"))
grouped.write.parquet("output_folder_path")
Is there a way to make this process more optimal. Specifically, is there a way to force the driver node to make multiple slaves even as the driver node is acting as both master and slave (0 workers) so that the grouping is more efficient?

First stage of action in Spark ran by only one executor

I have a spark program running with YARN as master and in client mode with 3 executors
By reading data from ElasticSearch through a connector i'm able to load them into a dataframe.
Such dataframe is repartitioned using df = df.repartition(3) in three partitions.
Whenever i try to do an action such as count() or show() for example, the first stage, which from this thread: Why spark count action has executed in three stages i understood it's about reading the file, has only one task and it's ran by a single executor.
Is this behavior expected for this stage? shouldn't i be able to run this stage in parallel with all the executor allocated?
It depends on the replication of your data.
If your data is replicated on more Data Nodes, you potentially have more executors that are able to read from.

How does Spark Streaming schedule map tasks between driver and executor?

I use Apache Spark 2.1 and Apache Kafka 0.9.
I have a Spark Streaming application that runs with 20 executors and reads from Kafka that has 20 partitions. This Spark application does map and flatMap operations only.
Here is what the Spark application does:
Create a direct stream from kafka with interval of 15 seconds
Perform data validations
Execute transformations using drool which are map only. No reduce transformations
Write to HBase using check-and-put
I wonder if executors and partitions are 1-1 mapped, will every executor independently perform above steps and write to HBase independently, or data will be shuffled within multiple executors and operations will happen between driver and executors?
Spark jobs submit tasks that can only be executed on executors. In other words, executors are the only place where tasks can be executed. The driver is to coordinate the tasks and schedule them accordingly.
With that said, I'd say the following is true:
will every executor independently perform above steps and write to HBase independently
By the way, the answer is irrelevant to what Spark version is in use. It's always been like this (and don't see any reason why it would or even should change).

Worker Nodes not being used in GCE

While running my spark jobs on google-cloud-dataproc, I notice that only the master node is being utilized and the CPU utilization of all the worker nodes is nearly zero percent (0.8 percent or so). I have used both the GUI as well as the console to run the code. Do you know any specific reason that could be causing this and how to make the full utilization of the worker nodes?
I submit the jobs in the following manner:
gcloud dataproc jobs submit spark --properties spark.executor.cores=10 --cluster cluster-663c --class ComputeMST --jars gs://kslc/ComputeMST.jar --files gs://kslc/SIFT_full.txt -- SIFT_full.txt gs://kslc/SIFT_fu ll.txt 5.0 12
while(true){
level_counter++;
if(level_counter > (number_of_levels - 1)) break;
System.out.println("LEVEL = " + level_counter);
JavaPairRDD<ArrayList<Integer>, epsNet> distributed_msts_logn1 = distributed_msts_logn.mapToPair(new next_level());
JavaPairRDD<ArrayList<Integer>, epsNet> distributed_msts_next_level = distributed_msts_logn1.reduceByKey(new union_eps_nets());
den = den/2;
distributed_msts_logn = distributed_msts_next_level.mapValues(new unit_step_logn(den, level_counter));
}
JavaRDD<epsNet> epsNetsRDDlogn = distributed_msts_logn.values();
List<epsNet> epsNetslogn = epsNetsRDDlogn.collect();
Above is the code, I am trying to run.
You are doing a collect() in your driver program. What are you trying to achieve? Doing a collect will definitely hammer your master node resources, since driver will be collecting the results here. Generally you want to ingest data into spark (using read or parallelize on spark context), do in-memory map-reduce (transformations) and then take data out of the spark world (example, writing a parquet to hdfs) to do any collect-related stuff.
Also, ensure via spark UI that you have all the executors that you asked for with given cores and memory.

Apache Spark Correlation only runs on driver

I am new to Spark and learn that transformations happen on workers and action on the driver but the intermediate action can happen(if the operation is commutative and associative) at the workers also which gives the actual parallelism.
I looked into the correlation and covariance code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
How could I find what part of the correlation has happened at the driver and what at executor?
Update 1: The setup I'm talking about to run the correlation is the cluster setup consisting of multiple VM's.
Look here for the images from the SparK web UI: Distributed cross correlation matrix computation
Update 2
I setup my cluster in standalone mode like It was a 3 Node cluster, 1 master/driver(actual machine: workstation) and 2 VM slaves/executor.
submitting the job like this
./bin/spark-submit --master spark://192.168.0.11:7077 examples/src/main/python/mllib/correlations_example.py
from master node
My correlation sample file is correlations_example.py:
data = sc.parallelize(np.array([range(10000000), range(10000000, 20000000),range(20000000, 30000000)]).transpose())
print(Statistics.corr(data, method="pearson"))
sc.stop()
I always get a sequential timeline as :
Doesn't it mean that it not happening in parallel based on timeline of events ? Am I doing something wrong with the job submission or correlation computation in Spark is not parallel?
Update 3:
I tried even adding another executor, still the same seqquential treeAggreagate.
I set the spark cluster as mentioned here:
http://paxcel.net/blog/how-to-setup-apache-spark-standalone-cluster-on-multiple-machine/
Your statement is not entirely accurate. The container[executor] for the driver is launched on the client/edge node or on the cluster, depending on the spark submit mode e.g. client or yarn. The actions are executed by the workers and the results are sent back to the driver (e.g. collect)
This has been answered already. See link below for more details.
When does an action not run on the driver in Apache Spark?

Resources