Connecting to remote Spark Cluster [duplicate] - apache-spark

I'm trying to host locally a spark standalone cluster. I have two heterogeneous machines connected on a LAN. Each piece of the architecture listed below is running on docker.
I have the following configuration
master on machine 1 (port 7077 exposed)
worker on machine 1
driver on machine 2
I use a test application that opens a file and counts its lines.
The application works when the file replicated on all workers and I use SparkContext.readText()
But when when the file is only present on worker while I'm using SparkContext.parallelize() to access it on workers, I have the following display :
INFO StandaloneSchedulerBackend: Granted executor ID app-20180116210619-0007/4 on hostPort 172.17.0.3:6598 with 4 cores, 1024.0 MB RAM
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now RUNNING
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now EXITED (Command exited with code 1)
INFO StandaloneSchedulerBackend: Executor app-20180116210619-0007/4 removed: Command exited with code 1
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180116210619-0007/5 on worker-20180116205132-172.17.0.3-6598 (172.17.0.3:6598) with 4 cores```
that goes on and on again without actually computing the app.
This is working when I put the driver on the same pc as the worker. So I guess there is some kind of connection to permit between so two across the network. Are you aware of a way to do that (which ports to open, which adress to add in /etc/hosts ...)

TL;DR Make sure that spark.driver.host:spark.driver.port can be accessed from each node in the cluster.
In general you have ensure that all nodes (both executors and master) can reach the driver.
In the cluster mode, where driver runs on one of the executors this is satisfied by default, as long as no ports are closed for the connections (see below).
In client mode machine, on which driver has been started, has to be accessible from the cluster. It means that spark.driver.host has to resolve to a publicly reachable address.
In both cases you have to keep in mind, that by default driver runs on a random port. It is possible to use a fixed one by setting spark.driver.port. Obviously this doesn't work that well, if you want to submit multiple applications at the same time.
Furthermore:
when when the file is only present on worker
won't work. All inputs have to be accessible from driver, as well as, from each executor node.

Related

Spark communication between executors on different nodes within a cluster

TLDR: How do executors communicate with each other? Is passwordless ssh between workers (i.e., machines hosting executors) necessary for executor-executor communication?
While setting up a Spark cluster, I only enabled passwordless ssh between driver node and worker nodes on which executors run. I am surprised that executors running on different nodes in the cluster are able to communicate (e.g. for a shuffle operation) without me having configured anything explicitly. I am following the book 'Spark - The Definitive Guide' (1st edition), and Fig 15-6 below
shows red arrows between driver and executors, but not between executors. Am I missing something in my understanding? I believe executors should communicate between themselves but the picture does not show it nor did I ever configure inter executor communication but somehow everything works fine.
I see a related question here but did not fully see how executors communicate between themselves without me setting up passwordless ssh. Thank you.

Spark standalone connection driver to worker

I'm trying to host locally a spark standalone cluster. I have two heterogeneous machines connected on a LAN. Each piece of the architecture listed below is running on docker.
I have the following configuration
master on machine 1 (port 7077 exposed)
worker on machine 1
driver on machine 2
I use a test application that opens a file and counts its lines.
The application works when the file replicated on all workers and I use SparkContext.readText()
But when when the file is only present on worker while I'm using SparkContext.parallelize() to access it on workers, I have the following display :
INFO StandaloneSchedulerBackend: Granted executor ID app-20180116210619-0007/4 on hostPort 172.17.0.3:6598 with 4 cores, 1024.0 MB RAM
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now RUNNING
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now EXITED (Command exited with code 1)
INFO StandaloneSchedulerBackend: Executor app-20180116210619-0007/4 removed: Command exited with code 1
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180116210619-0007/5 on worker-20180116205132-172.17.0.3-6598 (172.17.0.3:6598) with 4 cores```
that goes on and on again without actually computing the app.
This is working when I put the driver on the same pc as the worker. So I guess there is some kind of connection to permit between so two across the network. Are you aware of a way to do that (which ports to open, which adress to add in /etc/hosts ...)
TL;DR Make sure that spark.driver.host:spark.driver.port can be accessed from each node in the cluster.
In general you have ensure that all nodes (both executors and master) can reach the driver.
In the cluster mode, where driver runs on one of the executors this is satisfied by default, as long as no ports are closed for the connections (see below).
In client mode machine, on which driver has been started, has to be accessible from the cluster. It means that spark.driver.host has to resolve to a publicly reachable address.
In both cases you have to keep in mind, that by default driver runs on a random port. It is possible to use a fixed one by setting spark.driver.port. Obviously this doesn't work that well, if you want to submit multiple applications at the same time.
Furthermore:
when when the file is only present on worker
won't work. All inputs have to be accessible from driver, as well as, from each executor node.

Installing Spark on four machines

I want o run my spark tasks using for Amazon EC2 instances which I know all their IPs.
I want to have one computer as master and the other three could run worker nodes..can someone help me how I should configure spark for this task..should be standalone? I know how to set master node using
setMaster("SPARK://masterIP:7070");
but how to define worker nodes and assign them to the above master node?
If you are configuring you spark cluster manually you can start a standalone master server by executing :
./sbin/start-master.sh
Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.
ADDING WORKERS :
Now you can start one or more workers and connect them to the master via:
./sbin/start-slave.sh
Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).
for more info you can check spark website starting-a-cluster-manually
EDIT
TO RUN WORKERS FROM MASTER
To launch a Spark standalone cluster with the launch scripts, you should create a file called conf/slaves in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line. Note, the master machine accesses each of the worker machines via ssh (there should be password less ssh between master and worker machines).
After configuring conf/slaves file you should run two files :
sbin/start-master.sh - Starts a master instance on the machine the
script is executed on.
sbin/start-slaves.sh - Starts a slave instance on each machine
specified in the conf/slaves file.
For more info check Cluster Launch Scripts

In Spark's client mode, the driver needs network access to remote executors?

When using spark at client mode (e.g. yarn-client), does the local machine that runs the driver communicates directly with the cluster worker nodes that run the remote executors?
If yes, does it mean the machine (that runs the driver) need to have network access to the worker nodes? So the master node requests resources from the cluster, and returns the IP addresses/ports of the worker nodes to the driver, so the driver can initiating the communication with the worker nodes?
If not, how does the client mode actually work?
If yes, does it mean that the client mode won't work if the cluster is configured in a way that the work nodes are not visible outside the cluster, and one will have to use cluster mode?
Thanks!
The Driver connects to the Spark Master, requests a context, and then the Spark Master passes the Spark Workers the details of the Driver to communicate and get instructions on what to do.
The means that the driver node must be available on the network to the workers, and it's IP must be one that's visible to them (i.e. if the driver is behind NAT, while the workers are in a different network, it won't work and you'll see errors on the workers that they fail to connect to the driver)
When you run Spark in client mode, the driver process runs locally.
In cluster mode, it runs remotely on an ApplicationMaster.
In other words you will need all the nodes to see each other. Spark driver definitely needs to communicate with all the worker nodes. If this is a problem try to use the yarn-cluster mode, then the driver will run inside your cluster on one of the nodes.

How can I run an Apache Spark shell remotely?

I have a Spark cluster setup with one master and 3 workers. I also have Spark installed on a CentOS VM. I'm trying to run a Spark shell from my local VM which would connect to the master, and allow me to execute simple Scala code. So, here is the command I run on my local VM:
bin/spark-shell --master spark://spark01:7077
The shell runs to the point where I can enter Scala code. It says that executors have been granted (x3 - one for each worker). If I peek at the Master's UI, I can see one running application, Spark shell. All the workers are ALIVE, have 2 / 2 cores used, and have allocated 512 MB (out of 5 GB) to the application. So, I try to execute the following Scala code:
sc.parallelize(1 to 100).count
Unfortunately, the command doesn't work. The shell will just print the same warning endlessly:
INFO SparkContext: Starting job: count at <console>:13
INFO DAGScheduler: Got job 0 (count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 0(count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 0 (Parallel CollectionRDD[0] at parallelize at <console>:13), which has no missing parents
INFO DAGScheduler: Submitting 2 missing tasts from Stage 0 (ParallelCollectionRDD[0] at parallelize at <console>:13)
INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Following my research into the issue, I have confirmed that the master URL I am using is identical to the one on the web UI. I can ping and ssh both ways (cluster to local VM, and vice-versa). Moreover, I have played with the executor-memory parameter (both increasing and decreasing the memory) to no avail. Finally, I tried disabling the firewall (iptables) on both sides, but I keep getting the same error. I am using Spark 1.0.2.
TL;DR Is it possible to run an Apache Spark shell remotely (and inherently submit applications remotely)? If so, what am I missing?
EDIT: I took a look at the worker logs and found that the workers had trouble finding Spark:
ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor
java.io.IOException: Cannot run program "/usr/bin/spark-1.0.2/bin/compute-classpath.sh" (in directory "."): error=2, No such file or directory
...
Spark is installed in a different directory on my local VM than on the cluster. The path the worker is attempting to find is the one on my local VM. Is there a way for me to specify this path? Or must they be identical everywhere?
For the moment, I adjusted my directories to circumvent this error. Now, my Spark Shell fails before I get the chance to enter the count command (Master removed our application: FAILED). All the workers have the same error:
ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#spark02:7078] -> [akka.tcp://sparkExecutor#spark02:53633]:
Error [Association failed with [akka.tcp://sparkExecutor#spark02:53633]]
[akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor#spark02:53633]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$annon2: Connection refused: spark02/192.168.64.2:53633
As suspected, I am running into network issues. What should I look at now?
I solve this problem at my spark client and spark cluster。
Check your network,client A can ping cluster each other! Then add two line config in your spark-env.sh on client A。
first
export SPARK_MASTER_IP=172.100.102.156
export SPARK_JAR=/usr/spark-1.1.0-bin-hadoop2.4/lib/spark-assembly-1.1.0-hadoop2.4.0.jar
Second
Test your spark shell with cluster mode !
This problem can be caused by the network configuration. It looks like the error TaskSchedulerImpl: Initial job has not accepted any resources can have quite a few causes (see also this answer):
actual resource shortage
broken communication between master and workers
broken communication between master/workers and driver
The easiest way to exclude the first possibilities is to run a test with a Spark shell running directly on the master. If this works, the cluster communication within the cluster itself is fine and the problem is caused by the communication to the driver host. To further analyze the problem it helps to look into the worker logs, which contain entries like
16/08/14 09:21:52 INFO ExecutorRunner: Launch command:
"/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java"
...
"--driver-url" "spark://CoarseGrainedScheduler#192.168.1.228:37752"
...
and test whether the worker can establish a connection to the driver's IP/port. Apart from general firewall / port forwarding issues, it might be possible that the driver is binding to the wrong network interface. In this case you can export SPARK_LOCAL_IP on the driver before starting the Spark shell in order to bind to a different interface.
Some additional references:
Knowledge base entry on network connectivity issues.
Github discussion on improving the documentation of Initial job has not accepted any resources.

Resources