Understanding spark --master - apache-spark

I have simple spark app that reads master from a config file:
new SparkConf()
.setMaster(config.getString(SPARK_MASTER))
.setAppName(config.getString(SPARK_APPNAME))
What will happen when ill run my app with as follow:
spark-submit --class <main class> --master yarn <my jar>
Is my master going to be overwritten?
I prefer having the master provided in standard way so I don't need to maintain it in my configuration, but then the question how can I run this job directly from IDEA? this isn't my application argument but spark-submit argument.
Just for clarification my desired end product should:
when run in cluster using --master yarn, will use this configuration
when run from IDEA will run with local[*]

Do not set the master into your code.
In production you could use the option --master of spark-submit which will tell spark which master to use (yarn in you case). also the value of spark.master in spark-defaults.conf file will do the job (priority is for --master and then the property in configuration file)
In an IDEA... well I know in Eclipse you could pass a VM argument in Run Configuration -Dspark.master=local[*] for example (https://stackoverflow.com/a/24481688/1314742).
In IDEA I think it is not too much different, you could check here to add VM options

Related

What is the command to call Spark2 from Shell

I have two services for Spark in my cluster, one is with name of Spark(1.6 version) and another one is Spark2(2.0 Version). I am able to call Spark with below command.
spark-shell --master yarn
But not able to connect Spark2 service even after set "export SPARK_MAJOR_VERSION=2"
Can some one help me on.
I'm using CDH cluster and following command works for me.
spark2-shell --queue <queue-name-if-any> --deploy-mode client
If I remember, SPARK_MAJOR_VERSION only works with spark-submit
You would need to find the spark2 installation directory to use the other spark-shell
Sounds like you are in an HDP cluster, so look under /usr/hdp

Spark Standalone how to pass local .jar file to cluster

I have a cluster with two workers and one master.
To start master & workers I use the sbin/start-master.sh and sbin/start-slaves.shin the master's machine. Then, the master UI shows me that the slaves are ALIVE (so, everything OK so far). Issue comes when I want to use spark-submit.
I execute this command in my local machine:
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster /home/user/example.jar
But the following error pops up: ERROR ClientEndpoint: Exception from cluster was: java.nio.file.NoSuchFileException: /home/user/example.jar
I have been doing some research in stack overflow and Spark's documentation and it seems like I should specify the application-jar of spark-submit command as "Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes." (as it indicates https://spark.apache.org/docs/latest/submitting-applications.html).
My question is: how can I set my .jar as globally visible inside the cluster? There is a similar question in here Spark Standalone cluster cannot read the files in local filesystem but solutions do not work for me.
Also, am I doing something wrong by initialising the cluster inside my master's machine using sbin/start-master.sh but then doing the spark-submit in my local machine? I initialise the master inside my master's terminal because I read so in Spark's documentation, but maybe this has something to do with the issue. From Spark's documentation:
Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop’s deploy scripts, and available in SPARK_HOME/sbin: [...] Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.
Thank you very much
EDIT:
I have copied the file .jar in every worker and it works. But my point is to know if there is a better way, since this method makes me copy the .jar to each worker everytime I create a new jar. (This was one of the answers from the question of the already posted link Spark Standalone cluster cannot read the files in local filesystem )
#meisan your spark-submit command is missing out on 2 things.
your jars should be added with flag --jar
file holding your driver code i.e. the main function.
Now you have not specified anywhere if you are using scala or python but in the nutshell your command will look something like:
for python :
spark-submit --master spark://<master>:7077 --deploy-mode cluster --jar <dependency-jars> <python-file-holding-driver-logic>
for scala:
spark-submit --master spark://<master>:7077 --deploy-mode cluster --class <scala-driver-class> --driver-class-path <application-jar> --jar <dependency-jars>
Also, spark takes care of sending the required files and jars to the executors when you use the documented flags.
If you want to omit the --driver-class-path flag, you can set the environmental variable SPARK_CLASSPATH to path where all your jars are placed.

Oozie spark action Log4j configuration

I am working on Oozie, using a Spark action on a Hortonworks2.5 cluster. I have configured this job in yarn client mode, with master=yarn mode=client.
My log4j configuration is shown below.
log4j.appender.RollingAppender.File=/opt/appName/logs/appNameInfo.log
log4j.appender.debugFileAppender.File=/opt/appName/logs/appNameDebug.log
log4j.appender.errorFileAppender.File=/opt/appName/logs/appNameError.log
The job expectation is once we trigger oozie job, in the above locations we should be able to see my application logs as Info,Debug,Error respectively.
Below is my spark-opts tag in my workflow.xml
<spark-opts>--driver-memory 4G --executor-memory 4G --num-executors 6 --executor-cores 3 --files /tmp/logs/appName/log4j.properties --conf spark.driver.extraJavaOptions='-Dlog4j.configuration=file:/tmp/logs/appName/log4j.properties' --conf spark.executor.extraJavaOptions='-Dlog4j.configuration=file:/tmp/logs/appName/log4j.properties'</spark-opts>
Once I trigger oozie coordinator, I am not able to see my application logs in /opt/appName/logs/ as configured in log4j.properties.
The same configuration is working with plain Spark-submit when I run from the node where /tmp/logs/appName/log4j.properties available in that particular node. Can some one please look in to the issue. It is not able to write to the location which is configured in log4j.properties file.
Is this log4j.properties file should be in hdfs?? if so, how to provide in spark-opts. is it would be hdfs:// ??
Can some one look in to the issue please?
Copy this log4j.properties in oozie.sharelib.path(HDFS) and the spark should be able to copy in the final yarn container.

execute Spark jobs, with Livy, using `--master yarn-cluster` without making systemwide changes

I'd like to execute a Spark job, via an HTTP call from outside the cluster using Livy, where the Spark jar already exists in HDFS.
I'm able to spark-submit the job from shell on the cluster nodes, e.g.:
spark-submit --class io.woolford.Main --master yarn-cluster hdfs://hadoop01:8020/path/to/spark-job.jar
Note that the --master yarn-cluster is necessary to access HDFS where the jar resides.
I'm also able to submit commands, via Livy, using curl. For example, this request:
curl -X POST --data '{"file": "/path/to/spark-job.jar", "className": "io.woolford.Main"}' -H "Content-Type: application/json" hadoop01:8998/batches
... executes the following command on the cluster:
spark-submit --class io.woolford.Main hdfs://hadoop01:8020/path/to/spark-job.jar
This is the same as the command that works, minus the --master yarn-cluster params. This was verified by tailing /var/log/livy/livy-livy-server.out.
So, I just need to modify the curl command to include --master yarn-cluster when it's executed by Livy. At first glance, it seems like this should be possible by adding arguments to the JSON dictionary. Unfortunately, these aren't passed through.
Does anyone know how to pass --master yarn-cluster to Livy so that jobs are executed on YARN without making systemwide changes?
I recently tried something similar as your question. I need to send a HTTP request to Livy's API, while Livy is already installed in a cluster (YARN), and then I want to let Livy start a Spark job.
My command to call Livy did not include --master yarn-cluster, but that seems to work for me. Maybe you can try to put your JAR file in local in stead of in a cluster?
spark.master = yarn-cluster
set it in the spark conf, for me:/etc/spark2/conf/spark-defaults.conf

Spark Mesos Cluster Mode using Dispatcher

I have only a single machine and want to run spark jobs with mesos cluster mode. It might make more sense to run with a cluster of nodes, but I mainly want to test out mesos first to check if it's able to utilize resources more efficiently (run multiple spark jobs at the same time without static partitioning). I have tried a number of ways but without success. Here is what I did:
Build mesos and run both mesos master and slaves (2 slaves in same machines).
sudo ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesos
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5051 --work_dir=/tmp/mesos1
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5052 --work_dir=/tmp/mesos2
Run the spark-mesos-dispatcher
sudo ./sbin/start-mesos-dispatcher.sh --master mesos://localhost:5050
The submit the app with dispatcher as master url.
spark-submit --master mesos://localhost:7077 <other-config> <jar file>
But it doesnt work:
E0925 17:30:30.158846 807608320 socket.hpp:174] Shutdown failed on fd=61: Socket is not connected [57]
E0925 17:30:30.159545 807608320 socket.hpp:174] Shutdown failed on fd=62: Socket is not connected [57]
If I use spark-submit --deploy-mode cluster, then I got another error message:
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
It work perfectly if I don't use dispatcher but using mesos master url directly: --master mesos://localhost:5050 (client mode). According to the documentation , cluster mode is not supported for Mesos clusters, but they give another instruction for cluster mode here. So it's kind of confusing? My question is:
How I can get it works?
Should I use client mode instead of cluster mode if I submit the app/jar directly from the master node?
If I have a single computer, should I spawn 1 or more mesos slave processes. Basically, I have a number of spark job and dont want to do static partitioning of resources. But when using mesos without static partitioning, it seems to be much slower?
Thanks.
There seem to be two things you're confusing: launching a Spark application in a cluster (as opposed to locally) and launching the driver into the cluster.
From the top of Submitting Applications:
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.
So, Mesos is one of the supported cluster managers and hence you can run Spark apps on a Mesos cluster.
What Mesos as time of writing does not support is launching the driver into the cluster, this is what the command line argument --deploy-mode of ./bin/spark-submitspecifies. Since the default value of --deploy-mode is client you can just omit it, or if you want to explicitly specify it, then use:
./bin/spark-submit --deploy-mode client ...
I use your scenario to try, it could be work.
One thing different , I use ip address to instead of "localhost" and "127.0.0.1"
So just try again and to check http://your_dispatcher:8081 (on browser) if exist.
This is my spark-submit command:
$spark-submit --deploy-mode cluster --master mesos://192.168.11.79:7077 --class "SimpleApp" SimpleAppV2.jar
If success, you can see as below
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "1.5.0",
"submissionId" : "driver-20151006164749-0001",
"success" : true
}
When I got your error log as yours, I reboot the machine and retry your step. It also work.
Try using the 6066 port instead of 7077. The newer versions of Spark prefer the REST api for submitting jobs.
See https://issues.apache.org/jira/browse/SPARK-5388

Resources