Spark Job running even after spark Master process is killed - apache-spark

We are working on spark cluster where spark job(s) are getting submitted successfully even after spark "Master" process is killed.
Here is the complete details about what we are doing.
process details :-
jps
19560 NameNode
18369 QuorumPeerMain
22414 Jps
20168 ResourceManager
22235 Master
and we submitted one spark job to this Master using the command like
spark-1.6.1-bin-without-hadoop/bin/spark-submit --class com.test.test --master yarn-client --deploy-mode client test.jar -incomingHost hostIP
where hostIP having correct ip address of the machine running "Master" process.
And after this we are able to see the job in RM Web UI also.
Now when we kill the "Master" Process , we can see the submitted job is running fine which is expected here as we we are using yarn mode and that job will run without any issue.
Now we killed the "Master" process.
But when we submit once again the same command "spark-submit" pointing to same Master IP which is currently down , we see once more job in RM web ui (host:8088), This we are not able to understand as Spark "Master" is killed ( and host:8080) the spark UI also does not come.
Please note that we are using "yarn-client" mode as below code
sparkProcess = new SparkLauncher()
.......
.setSparkHome(System.getenv("SPARK_HOME"))
.setMaster("yarn-client")
.setDeployMode("client")
Please some can explain me about this behaviour ? Did not found after reading many blogs (http://spark.apache.org/docs/latest/running-on-yarn.html ) and official docs .
Thanks

Please check cluster overview. As per your description you are running spark application on yarn cluster mode with driver placed in instance where you launch command. The Spark master is related to spark standalone cluster mode which on your case launch command should be similar to
spark-submit --master spark://your-spark-master-address:port

Related

How does a MasterNode fit into a Spark cluster?

I'm getting a little confused with how to setup my Spark configuration for workloads using YARN as the resource manager. I've got a small cluster spun up right now with 1 master node and 2 core nodes.
Do I include the master node when calculating the number of executors or no?
Do I leave out 1 core for every node to account for Yarn management?
Am I supposed to designate the master node for anything in particular in Spark configurations?
Master node shouldn't be taken into account to calculate number of executors
Each node is actually EC2 instance with operating system so you have to leave 1 or more cores for system tasks and yarn agents
Master node can be used to run spark driver. For this start EMR cluster in client mode from master node by adding arguments --master yarn --deploy-mode client to spark-submit command. Keep in mind following:
Cluster mode allows you to submit work using S3 URIs. Client mode requires that you put the application in the local file system on the cluster master node
To do all preparation work (copy libs, scripts etc to a master node) you can setup a separate step and then run spark-submit --master yarn --deploy-mode client command as next step.

Hot to execute "spark-submit" against remote spark master?

Suppose I've got a remote spark cluster. I can log in a remote spark cluster host with ssh and run spark-submit with an example like that:
$SPARK_HOME/bin/spark-submit /usr/lib/spark2/examples/src/main/python/pi.py
Now I've installed spark on my laptop but I don't run it.
I want to run $SPARK_HOME/bin/spark-submit on my laptop against the remote spark cluster host. How can I do it ?
Yes you can provide the remote master url in this command, e.g.
$SPARK_HOME/bin/spark-submit --master spark://url_to_master:7077 /usr/lib/spark2/examples/src/main/python/pi.py

Why does stopping Standalone Spark master fail with "no org.apache.spark.deploy.master.Master to stop"?

Stopping standalone spark master fails with the following message:
$ ./sbin/stop-master.sh
no org.apache.spark.deploy.master.Master to stop
Why? There is one Spark Standalone master up and running.
Spark master was started under different user.
/tmp/Spark-ec2-user-org.apache.spark.deploy.master.Master-1.pid
Was not accessible.Had to login under different user who actually started the stand alone cluster manager master.
In my case, I was able to open the master WebUI page on browser where it clearly mentioned that Spark Master is running on port 7077.
However, while trying to stop using stop-all.sh, was facing no org.apache.spark.deploy.master.Master to stop . So I tried a different method - to find what process is running on port 7077 using below command :
lsof -i :7077
I got the result as java with a PID of 112099
Used the below command to kill that process :
kill 112099
After this when I checked the WebUI, it had stopped working. Successfully killed the Spark Master.

Not able to launch Spark cluster in Standalone mode with start-all.sh

I am new to spark and I am trying to install Spark Standalone to a 3 node cluster. I have done password-less SSH from master to other nodes.
I have tried the following config changes
Updated the hostnames for 2 nodes in conf/slaves.sh file. Created spark-env.sh file and updated the SPARK_MASTER_IP with the master URL Also, tried
updating the spark.master value in the spark-defaults.conf file
Snapshot of conf/slaves.sh
# A Spark Worker will be started on each of the machines listed below.
Spark-WorkerNode1.hadoop.com
Spark-WorkerNode2.hadoop.com
Snapshot of spark-defaults.conf
# Example:
spark.master spark://Spark-Master.hadoop.com:7077
But when I try to start the cluster by running the start-all.sh on the master, it does not recognize the worker nodes and start the cluster as local.
It does not give any error, the log files shows Successfully started service 'sparkMaster' and Successfully started service 'sparkWorker' on the master.
I have tried to run start-master and start-slave script on individual nodes and it seems to work fine. I can see 2 workers in the web UI. I am using spark 1.6.0
Can somebody please help me with what I am missing while trying to run start-all?
Snapshot of conf/slaves.sh
The file should named slaves without extension.

Spark Mesos Cluster Mode using Dispatcher

I have only a single machine and want to run spark jobs with mesos cluster mode. It might make more sense to run with a cluster of nodes, but I mainly want to test out mesos first to check if it's able to utilize resources more efficiently (run multiple spark jobs at the same time without static partitioning). I have tried a number of ways but without success. Here is what I did:
Build mesos and run both mesos master and slaves (2 slaves in same machines).
sudo ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesos
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5051 --work_dir=/tmp/mesos1
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5052 --work_dir=/tmp/mesos2
Run the spark-mesos-dispatcher
sudo ./sbin/start-mesos-dispatcher.sh --master mesos://localhost:5050
The submit the app with dispatcher as master url.
spark-submit --master mesos://localhost:7077 <other-config> <jar file>
But it doesnt work:
E0925 17:30:30.158846 807608320 socket.hpp:174] Shutdown failed on fd=61: Socket is not connected [57]
E0925 17:30:30.159545 807608320 socket.hpp:174] Shutdown failed on fd=62: Socket is not connected [57]
If I use spark-submit --deploy-mode cluster, then I got another error message:
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
It work perfectly if I don't use dispatcher but using mesos master url directly: --master mesos://localhost:5050 (client mode). According to the documentation , cluster mode is not supported for Mesos clusters, but they give another instruction for cluster mode here. So it's kind of confusing? My question is:
How I can get it works?
Should I use client mode instead of cluster mode if I submit the app/jar directly from the master node?
If I have a single computer, should I spawn 1 or more mesos slave processes. Basically, I have a number of spark job and dont want to do static partitioning of resources. But when using mesos without static partitioning, it seems to be much slower?
Thanks.
There seem to be two things you're confusing: launching a Spark application in a cluster (as opposed to locally) and launching the driver into the cluster.
From the top of Submitting Applications:
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.
So, Mesos is one of the supported cluster managers and hence you can run Spark apps on a Mesos cluster.
What Mesos as time of writing does not support is launching the driver into the cluster, this is what the command line argument --deploy-mode of ./bin/spark-submitspecifies. Since the default value of --deploy-mode is client you can just omit it, or if you want to explicitly specify it, then use:
./bin/spark-submit --deploy-mode client ...
I use your scenario to try, it could be work.
One thing different , I use ip address to instead of "localhost" and "127.0.0.1"
So just try again and to check http://your_dispatcher:8081 (on browser) if exist.
This is my spark-submit command:
$spark-submit --deploy-mode cluster --master mesos://192.168.11.79:7077 --class "SimpleApp" SimpleAppV2.jar
If success, you can see as below
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "1.5.0",
"submissionId" : "driver-20151006164749-0001",
"success" : true
}
When I got your error log as yours, I reboot the machine and retry your step. It also work.
Try using the 6066 port instead of 7077. The newer versions of Spark prefer the REST api for submitting jobs.
See https://issues.apache.org/jira/browse/SPARK-5388

Resources