How to submit spark Job from locally and connect to Cassandra cluster - apache-spark

Can any one please let me know how to submit spark Job from locally and connect to Cassandra cluster.
Currently I am submitting the Spark job after I login to Cassandra node through putty and submit the below dse-spark-submit Job command.
Command:
dse spark-submit --class ***** --total-executor-cores 6 --executor-memory 2G **/**/**.jar --config-file build/job.conf --args
With the above command, my spark Job able to connect to cluster and its executing, but sometimes facing issues.
So I want to submit spark job from my local machine. Can any one please guide me how to do this.

There are several things you could mean by "run my job locally"
Here are a few of my interpretations
Run the Spark Driver on a Local Machine but access a remote Cluster's resources
I would not recommend this for a few reasons, the biggest being that all of your job management will still be handled between your remote machine and the executors in the cluster. This would be equivalent of having a Hadoop Job Tracker running in a different cluster than the rest of the Hadoop distribution.
To accomplish this though you need to run a spark submit with a specific master uri. Additionally you would need to specify a Cassandra node via spark.cassandra.connection.host
dse spark-submit --master spark://sparkmasterip:7077 --conf spark.cassandra.connection.host aCassandraNode --flags jar
It is important that you keep the jar LAST. All arguments after the jar are interpreted as arguments for the application and not spark-submit parameters.
Run Spark Submit on a local Machine but have the Driver run in the Cluster (Cluster Mode)
Cluster mode means your local machine sends the jar and environment string over to the Spark Master. The Spark Master then chooses a worker to actually run the driver and the driver is started as a separate JVM by the worker. This is triggered using the --deploy-mode cluster flag. In addition to specifying the Master and Cassandra Connection Host.
dse spark-submit --master spark://sparkmasterip:7077 --deploy-mode cluster --conf spark.cassandra.connection.host aCassandraNode --flags jar
Run the Spark Driver in Local Mode
Finally there exists a Local mode for Spark which starts the entire Spark Framework in a single JVM. This is mainly used for testing. Local mode is activated by passing `--master local``
For more information check out the Spark Documentation on submitting applications
http://spark.apache.org/docs/latest/submitting-applications.html

Related

How are spark jobs submitted in cluster mode?

I know there is information worth 10 google pages on this but, all of them tell me to just put --master yarn in the spark-submit command. But, in cluster mode, how can my local laptop even know what that means? Let us say I have my laptop and a running dataproc cluster. How can I use spark-submit from my laptop to submit a job to this cluster?
Most of the documentation on running a Spark application in cluster mode assumes that you are already on the same cluster where YARN/Hadoop are configured (e.g. you are ssh'ed in), in which case most of the time Spark will pick up the appropriate local configs and "just work".
This is same for Dataproc: if you ssh onto the Dataproc master node, you can just run spark-submit --master yarn. More detailed instructions can be found in the documentation.
If you are trying to run applications locally on your laptop, this is more difficult. You will need to set up an ssh tunnel to the cluster, and then locally create configuration files that tell Spark how to reach the master via the tunnel.
Alternatively, you can use the Dataproc jobs API to submit jobs to the cluster without having to directly connect. The one caveat is that you will have to use properties to tell Spark to run in cluster mode instead of client mode (--properties spark.submit.deployMode=cluster). Note that when submitting jobs via the Dataproc API, the difference between client and cluster mode is much less pressing because in either case the Spark driver will actually run on the cluster (on the master or a worker respectively), not on your local laptop.

How to config a specific node as a driver node in spark-submit with cluster mode and standalone mode

I have to problem in spark-submit with cluster deploy mode and standalone mode:
How to specify a node as a driver node in spark cluster
in my case, the driver node was assigned dynamically by spark
How to distribute the app automatic from local
in my case, i must deploy the jar of app to every node,because i don't know which node will be the driver node .
PS : My submit command is :
spark-submit --master spark://master_ip:6066 --class appMainClass --deploy-mode cluster file:///tmp/spark_app/sparkrun
The --deploy-mode flag determines if the job will be submitted in cluster or client mode.
In cluster mode all the nodes will act as executors. One node will submit the JAR and then you can track the execution using web UI. That particular node will also act as an executor.
In client mode, the node where the spark-submit is invoked will act as the driver. Note that this node will not execute the DAG as this it is designated as a driver for your cluster. All the other nodes will be executors. Again, Web UI will help to see the execution of jobs and other useful information like RDD partitions, cached RDDs size etc.

Specify spark driver for spark-submit

I'm submitting a spark job from a shell script that has a bunch of env vars and parameters to pass to spark. Strangely, the driver host is not one of these parameters (there are driver cores and memory however). So if I have 3 machines in the cluster, a driver will be chosen randomly. I don't want this behaviour since 1) the jar I'm submitting is only on one of the machines and 2) the driver machine should often be smaller than the other machines which is not the case if it's random choice.
So far, I found no way to specify this param on the command line to spark-submit. I've tried --conf SPARK_DRIVER_HOST="172.30.1.123, --conf spark.driver.host="172.30.1.123 and many other things but nothing has any effect. I'm using spark 2.1.0. Thanks.
I assume you are running with Yarn cluster. In brief yarn uses containers to launch and implement tasks. And resource manager decides where to run which container based on availability of resources. In spark case drivers and executors also launched as containers with separate jvms. Driver dedicated to splitting tasks among executors and collect the results from them. If your node from where you launch your application included in cluster then it will be also used as shared resource for launching driver/executor.
From the documentation: http://spark.apache.org/docs/latest/running-on-yarn.html
When running the cluster in standalone or in Mesos the driver host (this is the master) can be launched with:
--master <master-url> #e.g. spark://23.195.26.187:7077
When using YARN it works a little different. Here the parameter is yarn
--master yarn
The yarn is specified in Hadoop its configuration for the ResourceManager. For how to do this see this interesting guide https://dqydj.com/raspberry-pi-hadoop-cluster-apache-spark-yarn/ . Basically in the hdfs the hdfs-site.xml and in yarn the yarn-site.xml

Running spark job not shown in the UI

I have submitted my spark job as mentioned here bin/spark-submit --class DataSet BasicSparkJob-assembly-1.0.jar without mentioning the --master parameter or spark.master parameter. Instead of that job gets submitted to my 3 node spark cluster. But i was wondering where it submitted the job because it is not showing any information in the Running Applications
If you do not set the master in --master nor spark.master Spark will run locally.
You could still view the progress of your job. By default the UI will be availalbe during the running of your spark job on http://localhost:4040.
When your job finishes, this UI will be killed and you could not view the history of your application unless you configured Spark history server
It's likely that Spark is running your in local mode on your development machine.

Apache Spark: JAR file not shipped on spark-submit

Is it normal that Spark won't ship the JAR file (containing the spark application) automatically from master to slave? In earlier versions (and used on Amazon Webservices) it worked! Did this functionality change since version 1.2.2 or is the problem caused by clusters without public dns addresses??? Or is this "copy the jar automatically" function only working in an AWS cluster?
Here my submit call:
./spark-submit --class prototype.Test --master spark://192.168.178.128:7077 --deploy-mode cluster ~/test.jar
Info: the files listed by --jars parameter are "copied" to the workers.
That was my own fault! -> don't use parameter --deploy-mode for usage of a standard cluster, where the driver process is planned to run on the master node.
See Spark documentation: https://spark.apache.org/docs/latest/submitting-applications.html
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) [...]
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
[...]

Resources