how many jvm processes are used for a spark worker? - apache-spark

Each spark executor runs in its own JVM process which means that on each worker (slave) there will be multiple JVMs running. Is it safe to say that each worker runs as many JVMs as there are executors assigned to that machine + at least one more JVM (because spark needs at least one more JVM for BlockManager on each worker). In other words, is the BlockManager on each worker run on a different JVM process ?

Which cluster manager you are using ?
Spark uses cluster managers like K8s/ Mesos/ Yarn for resource allocation. Where the JVM is to run will be decided by the cluster manager. Spark as client request for resources from these cluster managers.

Related

Is Broadcast variable in spark moved to each executor or each nodemanager in YARN?

In YARN how is broadcast variable distributed across nodes. I am confused if it goes to each executor in Nodemanager or only copy on each node manager in cluster.
Please let me know about it. Thanks in advance.
Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only.
In order to understand Broadcast variable behavior please understand how the Spark life cycle works
Spark driver is started by YARN
It creates DAG for the Job
Job contains of Mapping an reducing tasks
These mapping and reducing tasks are ran in an Executor (i.e. a separate JVM process with its own thread pool)
This executor get its own copy of Broadcast variable at the time of initialization. The broadcast variable is distributed by Spark, not Yarn.
Step 3 will be repeated task based on volume of data or if its a Spark streaming job but there will always be one copy of broadcast variable and stay with the executor till it goes down.
If by node manager you are referring to Spark Worker nodes then I don't think Spark Worker nodes needs broadcast variable so they might not keep any copy of it.
Node manager (i.e. Spark worker) is responsible for,
- allocating the required resources
- registering executor to the driver
- maintaining the Driver - Executor communication

Difference between executor and container in Spark

I am trying to clearly understand how memory allocation happens in a yarn managed cluster. I understand that there are a bunch of executors (one executor having its own JVM) and one executor can have one or more vcores during execution.
I am trying to tie up this understand in YARN configuration where things are segregated as Containers. Each container is actually a mix of some Vcores and fraction of heap memory.
Can someone confirm if one executor gets one container or one executor can have more than one containers. I read some documentation on Cloudera on YARN memory management and it appears to be saying that Container has an Executor allocated to it.
Cloudera Memory Management
Spark Executor runs within a Yarn Container, not across Containers.
A Yarn Container is provided by the YARN Resource Manager on demand - at start of Spark Application of via YARN Dynamic Resource Allocation.
A Yarn Container can have only one Spark Executor, but 1 or indeed more Cores can be assigned to the Executor.
Each Spark Executor and Driver runs as part of its own YARN Container.
Executors run on a given Worker.
Moreover, everything is in the context of an Application and as such an Application has Executors on many Workers.
When running Spark on YARN, each Spark executor runs as a YARN container.

Submitting two applications on same apache spark cluster spawns driver processes on same JVM or on different

I have a question. In case I submit two applications on same spark cluster in yarn-client mode will it spawn two driver processes for each application on same JVM or different JVM will be spawned for each driver process on client host?
Also If I submit two applications on same spark cluster in yarn-cluster mode will it create two different application master process for each application and then these master processes will take care of each application in different JVM or only one Application master process is created for all applications submitted to spark cluster and each driver process run under this one master process in a single JVM?
In the client mode each application will use separate driver process running in its own local JVM. Also each application will have its own, remote master, responsible for requesting resources.

why Spark executor needs to connect with Worker

When I kicked off one Spark job I will find the Executor startup command line as following:
bin/java -cp /opt/conf/:/opt/jars/* -Xmx1024M -Dspark.driver.port=56559
org.apache.spark.executor.CoarseGrainedExecutorBackend
--driver-url spark://CoarseGrainedScheduler#10.1.140.2:56559
--executor-id 1 --hostname 10.1.140.5 --cores 2
--app-id app-20161221132517-0000
--worker-url spark://Worker#10.1.140.5:56451
From above command we would find the line --worker-url spark://Worker#10.1.140.5:56451,that's I'm curious about, why Executor needs to communicate with Worker, in my mind executor only needs to talk with other executors and Driver.
You can see in the above image that Executors are part of worker nodes.
Application : User program built on Spark. Consists of a driver program and executors on the cluster.
Worker node : Any node that can run application code in the cluster
Executor : A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Source
Executor fate is connected with the worker fate. If worker is abnormally terminated executors have to be able to detect this fact and stop themselves. Without this process one could end up with "ghost" executors.

Is it possible to run multiple Spark applications on a mesos cluster?

I have a Mesos cluster with 1 Master and 3 slaves (with 2 cores and 4GB RAM each) that has a Spark application already up and running. I wanted to run another application on the same cluster, as the CPU and Memory utilization isn't high. Regardless, when I try to run the new Application, I get the error:
16/02/25 13:40:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I guess the new process is not getting any CPU as the old one occupies all 6.
I have tried enabling dynamic allocation, making the spark app Fine grained. Assigning numerous combinations of executor cores and number of executors. What I am missing here? Is it possible to run a Mesos Cluster with multiple Spark Frameworks at all?
You can try setting spark.cores.max to limit the number of CPUs used by each Spark driver, which will free up some resources.
Docs: https://spark.apache.org/docs/latest/configuration.html#scheduling

Resources