How to expose Spark Driver behind dockerized Apache Zeppelin? - apache-spark

I am currently building a custom docker container from a plain distribution with Apache Zeppelin + Spark 2.x inside.
My Spark jobs will run in a remote cluster and I am using yarn-client as master.
When I run a notebook and try to print sc.version, the program gets stuck. If I go to the remote resource manager, an application has been created and accepted but in the logs I can read:
INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable
My understanding of the situation is that the cluster is unable to talk to the driver in the container but I don't know how to solve this issue.
I am currently using the following configuration:
spark.driver.port set to PORT1 and option -p PORT1:PORT1 passed to the container
spark.driver.host set to 172.17.0.2 (ip of the container)
SPARK_LOCAL_IP set to 172.17.0.2 (ip of the container)
spark.ui.port set to PORT2 and option -p PORT2:PORT2 passed to the container
I have the feeling I should change the SPARK_LOCAL_IP to the host ip but if I do so, SparkUI is unable to start, blocking the process a step before.
Thanks in advance for any ideas / advices !

Good question! First of all, as you know Apache Zeppelin runs interpreters in a separate processes.
In your case, Spark interpreter JVM process hosts a SparkContext and serves as aSparkDriver instance for the yarn-client deployment mode. This process inside the container, according to the Apache Spark documentation, needs to be able to communicate back and forth to\from YARN ApplicationMaster and all SparkWorkers machines of the cluster.
This implies that you have to have number of ports open and manually forwarded between the container and a host machine. Here is an example of a project at ZEPL doing similar job, where it took us 7 ports to get the job done.
Anoter aproach can be running Docker networking in a host mode (though it apparently does not work on os x, due to a recent bug)

Related

spark-jobserver: Worker does not connect back to the driver

I set up a small Spark environment on two machines. One runs a master and a worker, and the other one runs a worker only. I can use this cluster using the Spark Shell like:
spark-shell --master spark://mymaster.example.internal:7077
I can run computations in there that get distributed to the nodes correctly, so everything runs fine.
However, I am having trouble when using the spark-jobserver.
First try was to start the Docker container (with the environment variable SPARK_MASTER pointing to the correct master URL). When the job was started, the worker it was pushed to complained that it couldn't connect back to 172.18.x.y:nnnn. This was clear because this was the internal IP address of the Docker container the jobserver ran in.
So, I ran the jobserver container again with --network host so it attached itself to the host network. However, starting the job led to a Connection refused again, this time saying it couldn't connect to 172.30.10.10:nnnn. 172.30.10.10 is the IP address of the host I want to run the jobserver on and it IS reachable from both worker and master nodes (The Spark instances run in Docker containers too, but they are also attached to the host network).
Digging deeper, I tried to start a Docker container which just has a JVM and Spark inside, ran it with --network host too and launched a Spark job from inside. This worked.
What might I be missing?
It turned out that I missed starting the shuffle service. I configured my custom jobserver container to use dynamic allocation and this needs the external shuffle service to be started.

Addressing issues with Apache Spark application run in Client mode from Docker container

I'm trying to connect to Standalone Apache Spark cluster from a dockerized Apache Spark application using Client mode.
Driver gives the Spark Master and the Workers its address. When run inside a docker container it will use some_docker_container_ip. The docker address is not visible from outside so an application won't work.
Spark has spark.driver.host property. This property is passed to Master and Workers. My initial instinct was to pass host machine address in there so the cluster would address visible machine instead.
Unfortunately the spark.driver.host is also used to set up a server by Driver. Passing a host machine address in there will cause server startup errors because a docker container cannot bind ports under host machine host.
It seems like a lose-lose situation. I cannot use neither the host machine address nor the docker container address.
Ideally I would like to have two properties. The spark.driver.host-to-bind-to used to set up the driver server and the spark.driver.host-for-master which would be used by Master and Workers. Unfortunately it seems like I'm stuck with one property only.
Another approach would be to use --net=host when running a docker container. This approach has many disadvantages (e.g. other docker containers cannot get linked to a container with the --net=host on and must be exposed outside of the docker network) and I would like to avoid it.
Is there any way I could solve the driver-addressing problem without exposing the docker containers?
This problem is fixed in https://github.com/apache/spark/pull/15120
It will be part of Apache Spark 2.1 release

How to run PySpark (possibly in client mode) on Mesosphere cluster?

I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.

Spark Mesos Dispatcher

My team is deploying a new Big Data architecture on Amazon Cloud. We have Mesos up and running Spark jobs.
We are submitting Spark jobs (i.e.: jars) from a bastion host inside the same cluster. Doing so, however, the bastion host is the driver program and this is called the client mode (if I understood correctly).
We would like to try the cluster mode, but we don't understand where to start the dispatcher process.
The documentation says to start it in the cluster, but I'm confused since our masters don't have Spark installed and we use Zookeeper for master election. Starting it on a slave node is not a vailable option, since slave can fail and we don't want to expose a slave ip or public DNS to the bastion host.
Is it correct to start the dispatcher on the bastion host?
Thank you very much
Documentation is not very detailed.
However, we are quite happy with what we discovered:
according to the documentation, cluster mode is not supported for Mesos clusters (and for Python applications).
However, we started the dispatcher using --master mesos://zk://...
For submitting applications, you need the following:
spark-submit --deploy-mode cluster <other options> --master mesos://<dispatcher_ip>:7077 <ClassName> <jar>
If you run this command from a bastion machine, it won't work, because the Mesos master will look for the submitable jar in the same path as the bastion. We ended exposing the file as a downloadable URL.
Hope this helps
I haven't used cluster mode in Mesos and the cluster mode description is not very detailed. There isn't even a --help option on the script, like there should be, IMHO. However, if you don't pass the --master argument, it errors out with a help message and it turns out there is a --zk option for specifying the Zookeeper URL.
What might work is to launch this script on the bastion itself with the appropriate --master and --zk options. Would that work for you?
You could use a docker image with spark and your application.jar instead of uploading the jar to s3. I didn't try yet, but I think it should work. The environment variable is SPARK_DIST_CLASSPATH in spark-env.sh. I use spark distribution compiled without hadoop with apache hadoop 2.7.1
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath):/opt/hadoop/share/hadoop/tools/lib/*:/opt/application.jar

Cannot run two spark shells on a remote machine

I have Datastax Enterprise 4.6 installed on a cluster - and it's working fine.
Then, a have another instance of dse 4.6 installed on a separate machine (let's call it the "build_node"; it is located in the same data center as the cluster), and I would like to use it to submit jobs to the master on the cluster.
As a starting point, I run a spark shell from the build_node, with the remote master (the one on the cluster), and execute a trivial command such as sc.parallelize(1 to 100).count - and it's working fine.
With such spark shell still open, I execute in another terminal another spark shell: it complains that Service 'SparkUI' could not bind on port 4040. Attempting port 4041. (reasonable), but it seems anyway connected to the remote spark master. However, when I execute the very same trivial command as above, after a while it starts issuing warning messages such as:
15/01/12 12:37:59 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
which, as is, does not make any sense to me given that I have not changed any memory option or such and moreover the very same command succeeded in the first shell. It keeps printing multiple times such message, and after a couple of minutes I killed the process (the one in the first shell returned in a few seconds).
Does anyone can explain what's going on, and possibly suggest how (if possible at all) run multiple spark shells from a machine connected to the same remote master?
Of course any suggestion on how to debug (logs, parameters to set etc) is also of great help.
FYI, I run the shell with the command: dse spark --master spark://<master ip>:<master port>

Resources