spark-jobserver: Worker does not connect back to the driver - apache-spark

I set up a small Spark environment on two machines. One runs a master and a worker, and the other one runs a worker only. I can use this cluster using the Spark Shell like:
spark-shell --master spark://mymaster.example.internal:7077
I can run computations in there that get distributed to the nodes correctly, so everything runs fine.
However, I am having trouble when using the spark-jobserver.
First try was to start the Docker container (with the environment variable SPARK_MASTER pointing to the correct master URL). When the job was started, the worker it was pushed to complained that it couldn't connect back to 172.18.x.y:nnnn. This was clear because this was the internal IP address of the Docker container the jobserver ran in.
So, I ran the jobserver container again with --network host so it attached itself to the host network. However, starting the job led to a Connection refused again, this time saying it couldn't connect to 172.30.10.10:nnnn. 172.30.10.10 is the IP address of the host I want to run the jobserver on and it IS reachable from both worker and master nodes (The Spark instances run in Docker containers too, but they are also attached to the host network).
Digging deeper, I tried to start a Docker container which just has a JVM and Spark inside, ran it with --network host too and launched a Spark job from inside. This worked.
What might I be missing?

It turned out that I missed starting the shuffle service. I configured my custom jobserver container to use dynamic allocation and this needs the external shuffle service to be started.

Related

How to expose Spark Driver behind dockerized Apache Zeppelin?

I am currently building a custom docker container from a plain distribution with Apache Zeppelin + Spark 2.x inside.
My Spark jobs will run in a remote cluster and I am using yarn-client as master.
When I run a notebook and try to print sc.version, the program gets stuck. If I go to the remote resource manager, an application has been created and accepted but in the logs I can read:
INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable
My understanding of the situation is that the cluster is unable to talk to the driver in the container but I don't know how to solve this issue.
I am currently using the following configuration:
spark.driver.port set to PORT1 and option -p PORT1:PORT1 passed to the container
spark.driver.host set to 172.17.0.2 (ip of the container)
SPARK_LOCAL_IP set to 172.17.0.2 (ip of the container)
spark.ui.port set to PORT2 and option -p PORT2:PORT2 passed to the container
I have the feeling I should change the SPARK_LOCAL_IP to the host ip but if I do so, SparkUI is unable to start, blocking the process a step before.
Thanks in advance for any ideas / advices !
Good question! First of all, as you know Apache Zeppelin runs interpreters in a separate processes.
In your case, Spark interpreter JVM process hosts a SparkContext and serves as aSparkDriver instance for the yarn-client deployment mode. This process inside the container, according to the Apache Spark documentation, needs to be able to communicate back and forth to\from YARN ApplicationMaster and all SparkWorkers machines of the cluster.
This implies that you have to have number of ports open and manually forwarded between the container and a host machine. Here is an example of a project at ZEPL doing similar job, where it took us 7 ports to get the job done.
Anoter aproach can be running Docker networking in a host mode (though it apparently does not work on os x, due to a recent bug)

Addressing issues with Apache Spark application run in Client mode from Docker container

I'm trying to connect to Standalone Apache Spark cluster from a dockerized Apache Spark application using Client mode.
Driver gives the Spark Master and the Workers its address. When run inside a docker container it will use some_docker_container_ip. The docker address is not visible from outside so an application won't work.
Spark has spark.driver.host property. This property is passed to Master and Workers. My initial instinct was to pass host machine address in there so the cluster would address visible machine instead.
Unfortunately the spark.driver.host is also used to set up a server by Driver. Passing a host machine address in there will cause server startup errors because a docker container cannot bind ports under host machine host.
It seems like a lose-lose situation. I cannot use neither the host machine address nor the docker container address.
Ideally I would like to have two properties. The spark.driver.host-to-bind-to used to set up the driver server and the spark.driver.host-for-master which would be used by Master and Workers. Unfortunately it seems like I'm stuck with one property only.
Another approach would be to use --net=host when running a docker container. This approach has many disadvantages (e.g. other docker containers cannot get linked to a container with the --net=host on and must be exposed outside of the docker network) and I would like to avoid it.
Is there any way I could solve the driver-addressing problem without exposing the docker containers?
This problem is fixed in https://github.com/apache/spark/pull/15120
It will be part of Apache Spark 2.1 release

Spark: How to specify the IP for the driver program to run

I am having issue to configure specific spark node as driver in my cluster. I am having standalaone mode cluster. Every time on master restart i see that one of the node in the cluster is being randomly picked to run the driver program. Due to which i am enforced to deploy my JAR on all the nodes in my cluster.
If i can specify the IP for the driver program to run, then i need to deploy the JAR only in one node.
Appreciate, any help.
If you want to run from a particular node you can use:
--deploy-mode client
With this option the the driver program will always be running on the machine from where you run spark-submit.
For more information:
http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit

Installing Spark on four machines

I want o run my spark tasks using for Amazon EC2 instances which I know all their IPs.
I want to have one computer as master and the other three could run worker nodes..can someone help me how I should configure spark for this task..should be standalone? I know how to set master node using
setMaster("SPARK://masterIP:7070");
but how to define worker nodes and assign them to the above master node?
If you are configuring you spark cluster manually you can start a standalone master server by executing :
./sbin/start-master.sh
Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.
ADDING WORKERS :
Now you can start one or more workers and connect them to the master via:
./sbin/start-slave.sh
Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).
for more info you can check spark website starting-a-cluster-manually
EDIT
TO RUN WORKERS FROM MASTER
To launch a Spark standalone cluster with the launch scripts, you should create a file called conf/slaves in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line. Note, the master machine accesses each of the worker machines via ssh (there should be password less ssh between master and worker machines).
After configuring conf/slaves file you should run two files :
sbin/start-master.sh - Starts a master instance on the machine the
script is executed on.
sbin/start-slaves.sh - Starts a slave instance on each machine
specified in the conf/slaves file.
For more info check Cluster Launch Scripts

How to run PySpark (possibly in client mode) on Mesosphere cluster?

I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.

Resources