What is the command to call Spark2 from Shell - apache-spark

I have two services for Spark in my cluster, one is with name of Spark(1.6 version) and another one is Spark2(2.0 Version). I am able to call Spark with below command.
spark-shell --master yarn
But not able to connect Spark2 service even after set "export SPARK_MAJOR_VERSION=2"
Can some one help me on.

I'm using CDH cluster and following command works for me.
spark2-shell --queue <queue-name-if-any> --deploy-mode client

If I remember, SPARK_MAJOR_VERSION only works with spark-submit
You would need to find the spark2 installation directory to use the other spark-shell
Sounds like you are in an HDP cluster, so look under /usr/hdp

Related

Spark Standalone how to pass local .jar file to cluster

I have a cluster with two workers and one master.
To start master & workers I use the sbin/start-master.sh and sbin/start-slaves.shin the master's machine. Then, the master UI shows me that the slaves are ALIVE (so, everything OK so far). Issue comes when I want to use spark-submit.
I execute this command in my local machine:
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster /home/user/example.jar
But the following error pops up: ERROR ClientEndpoint: Exception from cluster was: java.nio.file.NoSuchFileException: /home/user/example.jar
I have been doing some research in stack overflow and Spark's documentation and it seems like I should specify the application-jar of spark-submit command as "Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes." (as it indicates https://spark.apache.org/docs/latest/submitting-applications.html).
My question is: how can I set my .jar as globally visible inside the cluster? There is a similar question in here Spark Standalone cluster cannot read the files in local filesystem but solutions do not work for me.
Also, am I doing something wrong by initialising the cluster inside my master's machine using sbin/start-master.sh but then doing the spark-submit in my local machine? I initialise the master inside my master's terminal because I read so in Spark's documentation, but maybe this has something to do with the issue. From Spark's documentation:
Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop’s deploy scripts, and available in SPARK_HOME/sbin: [...] Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.
Thank you very much
EDIT:
I have copied the file .jar in every worker and it works. But my point is to know if there is a better way, since this method makes me copy the .jar to each worker everytime I create a new jar. (This was one of the answers from the question of the already posted link Spark Standalone cluster cannot read the files in local filesystem )
#meisan your spark-submit command is missing out on 2 things.
your jars should be added with flag --jar
file holding your driver code i.e. the main function.
Now you have not specified anywhere if you are using scala or python but in the nutshell your command will look something like:
for python :
spark-submit --master spark://<master>:7077 --deploy-mode cluster --jar <dependency-jars> <python-file-holding-driver-logic>
for scala:
spark-submit --master spark://<master>:7077 --deploy-mode cluster --class <scala-driver-class> --driver-class-path <application-jar> --jar <dependency-jars>
Also, spark takes care of sending the required files and jars to the executors when you use the documented flags.
If you want to omit the --driver-class-path flag, you can set the environmental variable SPARK_CLASSPATH to path where all your jars are placed.

submit spark job from local to emr ssh setup

I am new to spark. I want to submit a spark job from local to a remote EMR cluster.
I am following the link here to set up all the prerequisites: https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/
here is the command as below:
spark-submit --class mymain --deploy-mode client --master yarn myjar.jar
Issue: sparksession creation is not able to be finished with no error. Seems an access issue.
From the aws document, we know that by given the master with yarn, yarn uses the config files I copied from EMR to know where is the master and slaves (yarn-site.xml).
As my EMR cluster is located in a VPC, which need a special ssh config to access, how could I add this info to yarn so it can access to the remote cluster and submit the job?
I think the resolution proposed in aws link is more like - create your local spark setup with all dependencies.
If you don't want to do local spark setup, I would suggest easier way would be, you can use:
1. Livy: for this you emr setup should have livy installed. Check this, this, this and you should be able to infer from this
2. EMR ssh: this requires you to have aws-cli installed locally, cluster id and pem file used while creating emr cluster. Check this
Eg. aws emr ssh --cluster-id j-3SD91U2E1L2QX --key-pair-file ~/.ssh/mykey.pem --command 'your-spark-submit-command' (This prints command output on console though)

execute Spark jobs, with Livy, using `--master yarn-cluster` without making systemwide changes

I'd like to execute a Spark job, via an HTTP call from outside the cluster using Livy, where the Spark jar already exists in HDFS.
I'm able to spark-submit the job from shell on the cluster nodes, e.g.:
spark-submit --class io.woolford.Main --master yarn-cluster hdfs://hadoop01:8020/path/to/spark-job.jar
Note that the --master yarn-cluster is necessary to access HDFS where the jar resides.
I'm also able to submit commands, via Livy, using curl. For example, this request:
curl -X POST --data '{"file": "/path/to/spark-job.jar", "className": "io.woolford.Main"}' -H "Content-Type: application/json" hadoop01:8998/batches
... executes the following command on the cluster:
spark-submit --class io.woolford.Main hdfs://hadoop01:8020/path/to/spark-job.jar
This is the same as the command that works, minus the --master yarn-cluster params. This was verified by tailing /var/log/livy/livy-livy-server.out.
So, I just need to modify the curl command to include --master yarn-cluster when it's executed by Livy. At first glance, it seems like this should be possible by adding arguments to the JSON dictionary. Unfortunately, these aren't passed through.
Does anyone know how to pass --master yarn-cluster to Livy so that jobs are executed on YARN without making systemwide changes?
I recently tried something similar as your question. I need to send a HTTP request to Livy's API, while Livy is already installed in a cluster (YARN), and then I want to let Livy start a Spark job.
My command to call Livy did not include --master yarn-cluster, but that seems to work for me. Maybe you can try to put your JAR file in local in stead of in a cluster?
spark.master = yarn-cluster
set it in the spark conf, for me:/etc/spark2/conf/spark-defaults.conf

Understanding spark --master

I have simple spark app that reads master from a config file:
new SparkConf()
.setMaster(config.getString(SPARK_MASTER))
.setAppName(config.getString(SPARK_APPNAME))
What will happen when ill run my app with as follow:
spark-submit --class <main class> --master yarn <my jar>
Is my master going to be overwritten?
I prefer having the master provided in standard way so I don't need to maintain it in my configuration, but then the question how can I run this job directly from IDEA? this isn't my application argument but spark-submit argument.
Just for clarification my desired end product should:
when run in cluster using --master yarn, will use this configuration
when run from IDEA will run with local[*]
Do not set the master into your code.
In production you could use the option --master of spark-submit which will tell spark which master to use (yarn in you case). also the value of spark.master in spark-defaults.conf file will do the job (priority is for --master and then the property in configuration file)
In an IDEA... well I know in Eclipse you could pass a VM argument in Run Configuration -Dspark.master=local[*] for example (https://stackoverflow.com/a/24481688/1314742).
In IDEA I think it is not too much different, you could check here to add VM options

spark-submit in cluster mode. not able to transfer application jar to driver node

I have 3 node cluster node A,B,C. running master on A ,B and slave on A.B and C.
while i run spark-submit from node A using below command.
/usr/local/spark-1.3.0-bin-hadoop2.4/bin/spark-submit --class com.test.SparkExample --deploy-mode cluster --supervise --master spark://master.com:7077 file:///home/spark/sparkstreaming-0.0.1-SNAPSHOT.jar
the driver gets launched to node B and it tries to find application jar at local file system at node B. do we need to transfer application jar on each master node manually. is the the known bug ? or i am missing something.
kindly suggest
Thanks
Yes, according to the official documentation https://spark.apache.org/docs/latest/submitting-applications.html it should be present on every node.
By the way, it is possible to put file in hdfs.

Resources