Apache Spark: JAR file not shipped on spark-submit - apache-spark

Is it normal that Spark won't ship the JAR file (containing the spark application) automatically from master to slave? In earlier versions (and used on Amazon Webservices) it worked! Did this functionality change since version 1.2.2 or is the problem caused by clusters without public dns addresses??? Or is this "copy the jar automatically" function only working in an AWS cluster?
Here my submit call:
./spark-submit --class prototype.Test --master spark://192.168.178.128:7077 --deploy-mode cluster ~/test.jar
Info: the files listed by --jars parameter are "copied" to the workers.

That was my own fault! -> don't use parameter --deploy-mode for usage of a standard cluster, where the driver process is planned to run on the master node.
See Spark documentation: https://spark.apache.org/docs/latest/submitting-applications.html
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) [...]
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
[...]

Related

Understanding Spark Submit Yarn Client vs Cluster mode [duplicate]

TL;DR: In a Spark Standalone cluster, what are the differences between client and cluster deploy modes? How do I set which mode my application is going to run on?
We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:
A master machine, which also is where our application is run using spark-submit
2 identical worker machines
From the Spark Documentation, I read:
(...) For standalone clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
However, I don't really understand the practical differences by reading this, and I don't get what are the advantages and disadvantages of the different deploy modes.
Additionally, when I start my application using start-submit, even if I set the property spark.submit.deployMode to "cluster", the Spark UI for my context shows the following entry:
So I am not able to test both modes to see the practical differences. That being said, my questions are:
1) What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?
2) How to I choose which one my application is going to be running on, using spark-submit?
What are the practical differences between Spark Standalone client
deploy mode and cluster deploy mode? What are the pro's and con's of
using each one?
Let's try to look at the differences between client and cluster mode.
Client:
Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.
Cluster:
Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.
Which one is better? Not sure, that's actually for you to experiment and decide. This is no better decision here, you gain things from the former and latter, it's up to you to see which one works better for your use-case.
How to I choose which one my application is going to be running on,
using spark-submit
The way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:
/bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Let's say you are going to perform a spark submit in EMR by doing SSH to the master node.
If you are providing the option --deploy-mode cluster, then following things will happen.
You won't be able to see the detailed logs in the terminal.
Since driver is not created in the Master itself, you won't be able to terminate the job from the terminal.
But in case of --deploy-mode client:
You will be able to see the detailed logs in the terminal.
You will be able to terminate the job from the terminal itself.
These are the basic things that I have noticed till now.
I'm also having the same scenario, here master node use a standalone ec2 cluster. In this setup client mode is appropriate. In this driver is launched directly with in the spark-submit process which acts as a client to the cluster. The Input & output of the application is attached to the console.Thus, this mode is especially suitable for applications that involve REPL.
Else if your application is submitted from a machine far from the worker machines then it is quite common to use cluster mode to minimize the network latency b/w driver & executor.

How to config a specific node as a driver node in spark-submit with cluster mode and standalone mode

I have to problem in spark-submit with cluster deploy mode and standalone mode:
How to specify a node as a driver node in spark cluster
in my case, the driver node was assigned dynamically by spark
How to distribute the app automatic from local
in my case, i must deploy the jar of app to every node,because i don't know which node will be the driver node .
PS : My submit command is :
spark-submit --master spark://master_ip:6066 --class appMainClass --deploy-mode cluster file:///tmp/spark_app/sparkrun
The --deploy-mode flag determines if the job will be submitted in cluster or client mode.
In cluster mode all the nodes will act as executors. One node will submit the JAR and then you can track the execution using web UI. That particular node will also act as an executor.
In client mode, the node where the spark-submit is invoked will act as the driver. Note that this node will not execute the DAG as this it is designated as a driver for your cluster. All the other nodes will be executors. Again, Web UI will help to see the execution of jobs and other useful information like RDD partitions, cached RDDs size etc.

How to submit spark Job from locally and connect to Cassandra cluster

Can any one please let me know how to submit spark Job from locally and connect to Cassandra cluster.
Currently I am submitting the Spark job after I login to Cassandra node through putty and submit the below dse-spark-submit Job command.
Command:
dse spark-submit --class ***** --total-executor-cores 6 --executor-memory 2G **/**/**.jar --config-file build/job.conf --args
With the above command, my spark Job able to connect to cluster and its executing, but sometimes facing issues.
So I want to submit spark job from my local machine. Can any one please guide me how to do this.
There are several things you could mean by "run my job locally"
Here are a few of my interpretations
Run the Spark Driver on a Local Machine but access a remote Cluster's resources
I would not recommend this for a few reasons, the biggest being that all of your job management will still be handled between your remote machine and the executors in the cluster. This would be equivalent of having a Hadoop Job Tracker running in a different cluster than the rest of the Hadoop distribution.
To accomplish this though you need to run a spark submit with a specific master uri. Additionally you would need to specify a Cassandra node via spark.cassandra.connection.host
dse spark-submit --master spark://sparkmasterip:7077 --conf spark.cassandra.connection.host aCassandraNode --flags jar
It is important that you keep the jar LAST. All arguments after the jar are interpreted as arguments for the application and not spark-submit parameters.
Run Spark Submit on a local Machine but have the Driver run in the Cluster (Cluster Mode)
Cluster mode means your local machine sends the jar and environment string over to the Spark Master. The Spark Master then chooses a worker to actually run the driver and the driver is started as a separate JVM by the worker. This is triggered using the --deploy-mode cluster flag. In addition to specifying the Master and Cassandra Connection Host.
dse spark-submit --master spark://sparkmasterip:7077 --deploy-mode cluster --conf spark.cassandra.connection.host aCassandraNode --flags jar
Run the Spark Driver in Local Mode
Finally there exists a Local mode for Spark which starts the entire Spark Framework in a single JVM. This is mainly used for testing. Local mode is activated by passing `--master local``
For more information check out the Spark Documentation on submitting applications
http://spark.apache.org/docs/latest/submitting-applications.html

Spark submit does automatically upload the jar to cluster?

I'm trying to submit a Spark app from local machine Terminal to my Cluster.
I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
When I provide the path to application jar which is in my local machine, would spark-submit automatically upload it to my Cluster?
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 /Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101- does not exist
In Documentation ,http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
Advanced Dependency Management When using spark-submit, the
application jar along with any jars included with the --jars option
will be automatically transferred to the cluster.
But seems like it does not !
I see you are quoting the spark-submit page from Spark Docs but I would spend a lot more time on the Running Spark on YARN page. Bottom-line, look at:
There are two deploy modes that can be used to launch Spark
applications on YARN. In yarn-cluster mode, the Spark driver runs
inside an application master process which is managed by YARN on the
cluster, and the client can go away after initiating the application.
In yarn-client mode, the driver runs in the client process, and the
application master is only used for requesting resources from YARN.
Further you note, "I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine"
So I agree with you that you are right to run --master yarn-cluster instead of --master yarn-client
(and one comment notes what might just be a syntax error where you dropped "assembly.jar" but I think this will apply as well...)
Some of the basic assumptions about non-YARN implementations change a lot when YARN is introduced, mostly related to Classpaths and the need to push jars to the workers.
From an email on the Apache Spark User list:
YARN cluster mode. Spark submit does upload your jars to the cluster.
In particular, it puts the jars in HDFS so your driver can just read
from there. As in other deployments, the executors pull the jars from
the driver.
So finally, from the Apache Spark YARN doc:
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory
which contains the (client side) configuration files for the Hadoop
cluster. These configs are used to write to HDFS and connect to the
YARN ResourceManager.
NOTE: I only see you adding a single JAR, if there's a need to add other JARs there's a special note about doing that with YARN:
In yarn-cluster mode, the driver runs on a different machine than the
client, so SparkContext.addJar won’t work out of the box with files
that are local to the client. To make files on the client available to
SparkContext.addJar, include them with the --jars option in the launch
command.
That page in the link has some examples.
And of course you downloaded or built the YARN-specific version of Spark.
Background, in a standalone cluster deployment using spark-submit and the option --deploy-mode cluster, yes you do need to make sure every worker node has access to all the dependencies, Spark will not push them to the cluster. This is because in "standalone cluster" mode with Spark as the job manager, you don't know which node the driver will run on! But that doesn't apply to your case.
But if I could, depending on the size of the jars you are uploading, I would still explicitly put the jars on each node, or "globally available" via HDFS, for another reason from the docs:
From Advanced Dependency Management, it seems to present the best of both worlds, but also a great reason for manually pushing your jars out to all nodes:
local: - a URI starting with local:/ is expected to exist as a local
file on each worker node. This means that no network IO will be
incurred, and works well for large files/JARs that are pushed to each
worker, or shared via NFS, GlusterFS, etc.
But I assume that local:/... would change to hdfs:/ ... not sure on that one.
Yes and no. It depends on what you mean. Spark deploys the .jar to the nodes in the cluster. However, it won't upload your .jar file from your local machine to the cluster.
You can find more info in the Submitting Applications page. As you can see, in the arguments you pass to spark-submit, there is one that needs to be globally visible: the application-jar.
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes.
As far as I understand, what you want is to use yarn-client, not yarn-cluster. This will run the driver in the client (e.g., the machine which you are trying to call spark-submit on, for example your laptop), without the need of copying the .jar file on the cluster. More about this here.
Try adding --jars option before your /path/to/jar/file
spark-submit --jars /tmp/test.jar

What conditions should cluster deploy mode be used instead of client?

The doc https://spark.apache.org/docs/1.1.0/submitting-applications.html
describes deploy-mode as :
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
Using this diagram fig1 as a guide (taken from http://spark.apache.org/docs/1.2.0/cluster-overview.html) :
If I kick off a Spark job :
./bin/spark-submit \
--class com.driver \
--master spark://MY_MASTER:7077 \
--executor-memory 845M \
--deploy-mode client \
./bin/Driver.jar
Then the Driver Program will be MY_MASTER as specified in fig1 MY_MASTER
If instead I use --deploy-mode cluster then the Driver Program will be shared among the Worker Nodes ? If this is true then does this mean that the Driver Program box in fig1 can be dropped (as it is no longer utilized) as the SparkContext will also be shared among the worker nodes ?
What conditions should cluster be used instead of client ?
No, when deploy-mode is client, the Driver Program is not necessarily the master node. You could run spark-submit on your laptop, and the Driver Program would run on your laptop.
On the contrary, when deploy-mode is cluster, then cluster manager (master node) is used to find a slave having enough available resources to execute the Driver Program. As a result, the Driver Program would run on one of the slave nodes. As its execution is delegated, you can not get the result from Driver Program, it must store its results in a file, database, etc.
Client mode
Want to get a job result (dynamic analysis)
Easier for developing/debugging
Control where your Driver Program is running
Always up application: expose your Spark job launcher as REST service or a Web UI
Cluster mode
Easier for resource allocation (let the master decide): Fire and forget
Monitor your Driver Program from Master Web UI like other workers
Stop at the end: one job is finished, allocated resources are freed
I think this may help you understand.In the document https://spark.apache.org/docs/latest/submitting-applications.html
It says " A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters or Python applications."
What about HADR?
In cluster mode, YARN restarts the driver without killing the executors.
In client mode, YARN automatically kills all executors if your driver is killed.

Resources