Spark Mesos Dispatcher - apache-spark

My team is deploying a new Big Data architecture on Amazon Cloud. We have Mesos up and running Spark jobs.
We are submitting Spark jobs (i.e.: jars) from a bastion host inside the same cluster. Doing so, however, the bastion host is the driver program and this is called the client mode (if I understood correctly).
We would like to try the cluster mode, but we don't understand where to start the dispatcher process.
The documentation says to start it in the cluster, but I'm confused since our masters don't have Spark installed and we use Zookeeper for master election. Starting it on a slave node is not a vailable option, since slave can fail and we don't want to expose a slave ip or public DNS to the bastion host.
Is it correct to start the dispatcher on the bastion host?
Thank you very much

Documentation is not very detailed.
However, we are quite happy with what we discovered:
according to the documentation, cluster mode is not supported for Mesos clusters (and for Python applications).
However, we started the dispatcher using --master mesos://zk://...
For submitting applications, you need the following:
spark-submit --deploy-mode cluster <other options> --master mesos://<dispatcher_ip>:7077 <ClassName> <jar>
If you run this command from a bastion machine, it won't work, because the Mesos master will look for the submitable jar in the same path as the bastion. We ended exposing the file as a downloadable URL.
Hope this helps

I haven't used cluster mode in Mesos and the cluster mode description is not very detailed. There isn't even a --help option on the script, like there should be, IMHO. However, if you don't pass the --master argument, it errors out with a help message and it turns out there is a --zk option for specifying the Zookeeper URL.
What might work is to launch this script on the bastion itself with the appropriate --master and --zk options. Would that work for you?

You could use a docker image with spark and your application.jar instead of uploading the jar to s3. I didn't try yet, but I think it should work. The environment variable is SPARK_DIST_CLASSPATH in spark-env.sh. I use spark distribution compiled without hadoop with apache hadoop 2.7.1
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath):/opt/hadoop/share/hadoop/tools/lib/*:/opt/application.jar

Related

How are spark jobs submitted in cluster mode?

I know there is information worth 10 google pages on this but, all of them tell me to just put --master yarn in the spark-submit command. But, in cluster mode, how can my local laptop even know what that means? Let us say I have my laptop and a running dataproc cluster. How can I use spark-submit from my laptop to submit a job to this cluster?
Most of the documentation on running a Spark application in cluster mode assumes that you are already on the same cluster where YARN/Hadoop are configured (e.g. you are ssh'ed in), in which case most of the time Spark will pick up the appropriate local configs and "just work".
This is same for Dataproc: if you ssh onto the Dataproc master node, you can just run spark-submit --master yarn. More detailed instructions can be found in the documentation.
If you are trying to run applications locally on your laptop, this is more difficult. You will need to set up an ssh tunnel to the cluster, and then locally create configuration files that tell Spark how to reach the master via the tunnel.
Alternatively, you can use the Dataproc jobs API to submit jobs to the cluster without having to directly connect. The one caveat is that you will have to use properties to tell Spark to run in cluster mode instead of client mode (--properties spark.submit.deployMode=cluster). Note that when submitting jobs via the Dataproc API, the difference between client and cluster mode is much less pressing because in either case the Spark driver will actually run on the cluster (on the master or a worker respectively), not on your local laptop.

Is there a way to submit spark job on different server running master

We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.
Answer to this will be highly appreciated.
Thanks in advance.
There are 3 ways you can submit Spark jobs using Apache Airflow remotely:
(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath
(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.
(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.
I personally prefer SSHOperator :)

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

Specify spark driver for spark-submit

I'm submitting a spark job from a shell script that has a bunch of env vars and parameters to pass to spark. Strangely, the driver host is not one of these parameters (there are driver cores and memory however). So if I have 3 machines in the cluster, a driver will be chosen randomly. I don't want this behaviour since 1) the jar I'm submitting is only on one of the machines and 2) the driver machine should often be smaller than the other machines which is not the case if it's random choice.
So far, I found no way to specify this param on the command line to spark-submit. I've tried --conf SPARK_DRIVER_HOST="172.30.1.123, --conf spark.driver.host="172.30.1.123 and many other things but nothing has any effect. I'm using spark 2.1.0. Thanks.
I assume you are running with Yarn cluster. In brief yarn uses containers to launch and implement tasks. And resource manager decides where to run which container based on availability of resources. In spark case drivers and executors also launched as containers with separate jvms. Driver dedicated to splitting tasks among executors and collect the results from them. If your node from where you launch your application included in cluster then it will be also used as shared resource for launching driver/executor.
From the documentation: http://spark.apache.org/docs/latest/running-on-yarn.html
When running the cluster in standalone or in Mesos the driver host (this is the master) can be launched with:
--master <master-url> #e.g. spark://23.195.26.187:7077
When using YARN it works a little different. Here the parameter is yarn
--master yarn
The yarn is specified in Hadoop its configuration for the ResourceManager. For how to do this see this interesting guide https://dqydj.com/raspberry-pi-hadoop-cluster-apache-spark-yarn/ . Basically in the hdfs the hdfs-site.xml and in yarn the yarn-site.xml

Spark submit does automatically upload the jar to cluster?

I'm trying to submit a Spark app from local machine Terminal to my Cluster.
I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
When I provide the path to application jar which is in my local machine, would spark-submit automatically upload it to my Cluster?
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 /Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101- does not exist
In Documentation ,http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
Advanced Dependency Management When using spark-submit, the
application jar along with any jars included with the --jars option
will be automatically transferred to the cluster.
But seems like it does not !
I see you are quoting the spark-submit page from Spark Docs but I would spend a lot more time on the Running Spark on YARN page. Bottom-line, look at:
There are two deploy modes that can be used to launch Spark
applications on YARN. In yarn-cluster mode, the Spark driver runs
inside an application master process which is managed by YARN on the
cluster, and the client can go away after initiating the application.
In yarn-client mode, the driver runs in the client process, and the
application master is only used for requesting resources from YARN.
Further you note, "I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine"
So I agree with you that you are right to run --master yarn-cluster instead of --master yarn-client
(and one comment notes what might just be a syntax error where you dropped "assembly.jar" but I think this will apply as well...)
Some of the basic assumptions about non-YARN implementations change a lot when YARN is introduced, mostly related to Classpaths and the need to push jars to the workers.
From an email on the Apache Spark User list:
YARN cluster mode. Spark submit does upload your jars to the cluster.
In particular, it puts the jars in HDFS so your driver can just read
from there. As in other deployments, the executors pull the jars from
the driver.
So finally, from the Apache Spark YARN doc:
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory
which contains the (client side) configuration files for the Hadoop
cluster. These configs are used to write to HDFS and connect to the
YARN ResourceManager.
NOTE: I only see you adding a single JAR, if there's a need to add other JARs there's a special note about doing that with YARN:
In yarn-cluster mode, the driver runs on a different machine than the
client, so SparkContext.addJar won’t work out of the box with files
that are local to the client. To make files on the client available to
SparkContext.addJar, include them with the --jars option in the launch
command.
That page in the link has some examples.
And of course you downloaded or built the YARN-specific version of Spark.
Background, in a standalone cluster deployment using spark-submit and the option --deploy-mode cluster, yes you do need to make sure every worker node has access to all the dependencies, Spark will not push them to the cluster. This is because in "standalone cluster" mode with Spark as the job manager, you don't know which node the driver will run on! But that doesn't apply to your case.
But if I could, depending on the size of the jars you are uploading, I would still explicitly put the jars on each node, or "globally available" via HDFS, for another reason from the docs:
From Advanced Dependency Management, it seems to present the best of both worlds, but also a great reason for manually pushing your jars out to all nodes:
local: - a URI starting with local:/ is expected to exist as a local
file on each worker node. This means that no network IO will be
incurred, and works well for large files/JARs that are pushed to each
worker, or shared via NFS, GlusterFS, etc.
But I assume that local:/... would change to hdfs:/ ... not sure on that one.
Yes and no. It depends on what you mean. Spark deploys the .jar to the nodes in the cluster. However, it won't upload your .jar file from your local machine to the cluster.
You can find more info in the Submitting Applications page. As you can see, in the arguments you pass to spark-submit, there is one that needs to be globally visible: the application-jar.
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes.
As far as I understand, what you want is to use yarn-client, not yarn-cluster. This will run the driver in the client (e.g., the machine which you are trying to call spark-submit on, for example your laptop), without the need of copying the .jar file on the cluster. More about this here.
Try adding --jars option before your /path/to/jar/file
spark-submit --jars /tmp/test.jar

Resources