Spark local mode, worker threads and tasks - multithreading

I use spark locally with local[2] in master and nodes conf.
On spark GUI I do see 2 cores for the worker threads.
On my code App I have added a Thread.currentThread.getName() inside a foreach action, and rather than seeing only 2 threads names I see Thread[Executor task launch worker for task 27,5,main] going up to Thread[Executor task launch worker for task 302,5,main], why is there so many threads under the hood, and what would be precisely this notion of tasks ?

Related

spark executors running multiple copies of the same application

In the spark documentation it says,
Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
From the phrase
... executor side (tasks from different applications run in different
JVMs)
Does this mean that
If you're running multiple copies(multiple spark-submits) of the same application in a cluster that has many executors, is it possible for an executor to run tasks that belong to different spark-submits in parallel?
If the above is possible, using singleton objects that are shared between tasks of an executor, can cause data collisions between different copies(different spark-submits) of the same application?
Each executor is separate JVM process, and only used for one application. No need to worry about data collision.

If the task slots in one executor can be shared by different Spark application tasks?

I have read some literature about Spark task scheduling, and found some paper mentioned that the Executor is monopolized by only one application at every moment.
So I wandering about whether the task slots in one executor can be shared by different Spark applications at the same time?

Multiple spark streaming contexts on one worker

I have single node cluster with 2 CPUs, where I want to run 2 spark streaming jobs.
I also want to use submit mode "cluster". I am using Standalone cluster manager.
When I submit one application, I see that driver consumes 1 core, and worker 1 core.
Does it mean that there are no cores available for other streaming job? Can 2 streaming jobs reuse executors?
It is totally confusing me, and I don't find it really clear in documentation.
Srdjan
Does it mean that there are no cores available for other streaming job?
If you have a single worker with 2 CPU's and you're deploying in Cluster mode, than you'll have no available cores as the worker has to use a dedicated core for tge driver process to run on your worker machine.
Can 2 streaming jobs reuse executors?
No, each job needs to allocate dedicated resources given by the cluster manager. If one job is running with all available resources, the next scheduled job will be in WAITING state until the first completes. You can see it in the Spark UI.

Role of the Executors on the Spark master machine

In a Spark stand alone cluster, does the Master node run tasks as well? I wasn't sure if there Executors processes are spun up on the Master node and do work, alongside the Worker nodes.
Thanks!
Executors would only be started on the nodes where there is at least one worker daemon on that node, i.e, No executor would be started up in a node that do not serve as Worker.
However, Where to start Master and Workers are all based on your decision, there isn't such limitations that Master and Worker cannot co-locate on a same node.
To start a worker daemon the same machine with your master, you can either edit the conf/slaves file to add the master ip in it and use start-all.sh at start time or start a worker at any time you want on the master node, start-slave.sh and supply the Spark master URL --master spark://master-host:7077
Update (based on Daniel Darabos's suggestion) :
When referring to Application Detail UI's Executors tab, you could also find a row has <driver> for its Executor ID, the driver it denotes is the process where your job is scheduled and monitored, it's running the main program you submitted to the spark cluster, slicing your transformations and actions on RDDs into stages, scheduling the stages as TaskSets and arranging executors to run the tasks.
This <driver> will be started on the node which you call spark-submit in client mode, or on one of the worker nodes in cluster mode

What is the relationship between workers, worker instances, and executors?

In Spark Standalone mode, there are master and worker nodes.
Here are few questions:
Does 2 worker instance mean one worker node with 2 worker processes?
Does every worker instance hold an executor for specific application (which manages storage, task) or one worker node holds one executor?
Is there a flow chart explaining how spark works on runtime, such as word count?
Extending to other great answers, I would like to describe with few images.
In Spark Standalone mode, there are master node and worker nodes.
If we represent both master and workers(each worker can have multiple executors if CPU and memory are available) at one place for standalone mode.
If you are curious about how Spark works with YARN? check this post Spark on YARN
1. Does two worker instance mean one worker node with two worker processes?
In general, we call worker instance as a slave as it's a process to execute spark tasks/jobs. Suggested mapping for a node(a physical or virtual machine) and a worker is,
1 Node = 1 Worker process
2. Does every worker instance hold an executor for the specific application (which manages storage, task) or one worker node holds one executor?
Yes, A worker node can be holding multiple executors (processes) if it has sufficient CPU, Memory and Storage.
Check the Worker node in the given image.
BTW, the Number of executors in a worker node at a given point of time entirely depends on workload on the cluster and capability of the node to run how many executors.
3. Is there a flow chart explaining how spark runtime?
If we look at the execution from Spark perspective over any resource manager for a program, which join two rdds and do some reduce operation then filter
HIH
I suggest reading the Spark cluster docs first, but even more so this Cloudera blog post explaining these modes.
Your first question depends on what you mean by 'instances'. A node is a machine, and there's not a good reason to run more than one worker per machine. So two worker nodes typically means two machines, each a Spark worker.
Workers hold many executors, for many applications. One application has executors on many workers.
Your third question is not clear.
I know this is an old question and Sean's answer was excellent. My writeup is about the SPARK_WORKER_INSTANCES in MrQuestion's comment. If you use Mesos or YARN as your cluster manager, you are able to run multiple executors on the same machine with one worker, thus there is really no need to run multiple workers per machine. However, if you use standalone cluster manager, currently it still only allows one executor per worker process on each physical machine. Thus in case you have a super large machine and would like to run multiple exectuors on it, you have to start more than 1 worker process. That's what SPARK_WORKER_INSTANCES in the spark-env.sh is for. The default value is 1. If you do use this setting, make sure you set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all the cores.
This standalone cluster manager limitation should go away soon. According to this SPARK-1706, this issue will be fixed and released in Spark 1.4.
As Lan was saying, the use of multiple worker instances is only relevant in standalone mode. There are two reasons why you want to have multiple instances: (1) garbage pauses collector can hurt throughput for large JVMs (2) Heap size of >32 GB can’t use CompressedOoops
Read more about how to set up multiple worker instances.

Resources