How does a MasterNode fit into a Spark cluster? - apache-spark

I'm getting a little confused with how to setup my Spark configuration for workloads using YARN as the resource manager. I've got a small cluster spun up right now with 1 master node and 2 core nodes.
Do I include the master node when calculating the number of executors or no?
Do I leave out 1 core for every node to account for Yarn management?
Am I supposed to designate the master node for anything in particular in Spark configurations?

Master node shouldn't be taken into account to calculate number of executors
Each node is actually EC2 instance with operating system so you have to leave 1 or more cores for system tasks and yarn agents
Master node can be used to run spark driver. For this start EMR cluster in client mode from master node by adding arguments --master yarn --deploy-mode client to spark-submit command. Keep in mind following:
Cluster mode allows you to submit work using S3 URIs. Client mode requires that you put the application in the local file system on the cluster master node
To do all preparation work (copy libs, scripts etc to a master node) you can setup a separate step and then run spark-submit --master yarn --deploy-mode client command as next step.

Related

Python+PySpark File locally connecting to a Remote HDFS/Spark/Yarn Cluster

I've been playing around with HDFS and Spark. I've set up a five node cluster on my network running HDFS, Spark, and managed by Yarn. Workers are running in client mode.
From the master node, I can launch the PySpark shell just fine. Running example jars, the job is split up to the worker nodes and executes nicely.
I have a few questions on whether and how to run python/Pyspark files against this cluster.
If I have a python file with a PySpark calls elsewhere else, like on my local dev laptop or a docker container somewhere, is there a way to run or submit this file locally and have it executed on the remote Spark cluster? Methods that I'm wondering about involve running spark-submit in the local/docker environment and but the file has SparkSession.builder.master() configured to the remote cluster.
Related, I see a configuration for --master in spark-submit, but the only yarn option is to pass "yarn" which seems to only queue locally? Is there a way to specify remote yarn?
If I can set up and run the file remotely, how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000, or do I submit it to one of the Yarn ports?
TIA!
way to run or submit this file locally and have it executed on the remote Spark cluster
Yes, well "YARN", not "remote Spark cluster". You set --master=yarn when running with spark-submit, and this will run against the configured yarn-site.xml in HADOOP_CONF_DIR environment variable. You can define this at the OS level, or in spark-env.sh.
You can also use SparkSession.builder.master('yarn') in code. If both options are supplied, one will get overridden.
To run fully "in the cluster", also set --deploy-mode=cluster
Is there a way to specify remote yarn?
As mentioned, this is configured from yarn-site.xml for providing resourcemanager location(s).
how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000
No - The YARN resource manager has its own RPC protocol, not hdfs:// ... You can use spark.read("hdfs://namenode:port/path") to read HDFS files, though. As mentioned, .master('yarn') or --master yarn are the only configs you need that are specific for Spark.
If you want to use Docker containers, YARN does support this, but Spark's Kubernetes master will be easier to setup, and you can use Hadoop Ozone or MinIO rather than HDFS in Kubernetes.

spark standalone running on docker cleanup not running

I'm running spark on standalone mode as a docker service where I have one master node and one spark worker. I followed the spark documentation instructions:
https://spark.apache.org/docs/latest/spark-standalone.html
to add the properties where the spark cluster cleans itself and I set those in my docker_entrypoint
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=900 -Dspark.worker.cleanup.appDataTtl=900
and verify that it was enables following the logs of the worker node service
My question is do we expect to get all directories located on SPARK_WORKER_DIR directory to be cleaned ? or does it only clean the application files
Because I still see some empty directories holding there

submit spark job from local to emr ssh setup

I am new to spark. I want to submit a spark job from local to a remote EMR cluster.
I am following the link here to set up all the prerequisites: https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/
here is the command as below:
spark-submit --class mymain --deploy-mode client --master yarn myjar.jar
Issue: sparksession creation is not able to be finished with no error. Seems an access issue.
From the aws document, we know that by given the master with yarn, yarn uses the config files I copied from EMR to know where is the master and slaves (yarn-site.xml).
As my EMR cluster is located in a VPC, which need a special ssh config to access, how could I add this info to yarn so it can access to the remote cluster and submit the job?
I think the resolution proposed in aws link is more like - create your local spark setup with all dependencies.
If you don't want to do local spark setup, I would suggest easier way would be, you can use:
1. Livy: for this you emr setup should have livy installed. Check this, this, this and you should be able to infer from this
2. EMR ssh: this requires you to have aws-cli installed locally, cluster id and pem file used while creating emr cluster. Check this
Eg. aws emr ssh --cluster-id j-3SD91U2E1L2QX --key-pair-file ~/.ssh/mykey.pem --command 'your-spark-submit-command' (This prints command output on console though)

Spark Mesos Cluster Mode using Dispatcher

I have only a single machine and want to run spark jobs with mesos cluster mode. It might make more sense to run with a cluster of nodes, but I mainly want to test out mesos first to check if it's able to utilize resources more efficiently (run multiple spark jobs at the same time without static partitioning). I have tried a number of ways but without success. Here is what I did:
Build mesos and run both mesos master and slaves (2 slaves in same machines).
sudo ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesos
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5051 --work_dir=/tmp/mesos1
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5052 --work_dir=/tmp/mesos2
Run the spark-mesos-dispatcher
sudo ./sbin/start-mesos-dispatcher.sh --master mesos://localhost:5050
The submit the app with dispatcher as master url.
spark-submit --master mesos://localhost:7077 <other-config> <jar file>
But it doesnt work:
E0925 17:30:30.158846 807608320 socket.hpp:174] Shutdown failed on fd=61: Socket is not connected [57]
E0925 17:30:30.159545 807608320 socket.hpp:174] Shutdown failed on fd=62: Socket is not connected [57]
If I use spark-submit --deploy-mode cluster, then I got another error message:
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
It work perfectly if I don't use dispatcher but using mesos master url directly: --master mesos://localhost:5050 (client mode). According to the documentation , cluster mode is not supported for Mesos clusters, but they give another instruction for cluster mode here. So it's kind of confusing? My question is:
How I can get it works?
Should I use client mode instead of cluster mode if I submit the app/jar directly from the master node?
If I have a single computer, should I spawn 1 or more mesos slave processes. Basically, I have a number of spark job and dont want to do static partitioning of resources. But when using mesos without static partitioning, it seems to be much slower?
Thanks.
There seem to be two things you're confusing: launching a Spark application in a cluster (as opposed to locally) and launching the driver into the cluster.
From the top of Submitting Applications:
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.
So, Mesos is one of the supported cluster managers and hence you can run Spark apps on a Mesos cluster.
What Mesos as time of writing does not support is launching the driver into the cluster, this is what the command line argument --deploy-mode of ./bin/spark-submitspecifies. Since the default value of --deploy-mode is client you can just omit it, or if you want to explicitly specify it, then use:
./bin/spark-submit --deploy-mode client ...
I use your scenario to try, it could be work.
One thing different , I use ip address to instead of "localhost" and "127.0.0.1"
So just try again and to check http://your_dispatcher:8081 (on browser) if exist.
This is my spark-submit command:
$spark-submit --deploy-mode cluster --master mesos://192.168.11.79:7077 --class "SimpleApp" SimpleAppV2.jar
If success, you can see as below
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "1.5.0",
"submissionId" : "driver-20151006164749-0001",
"success" : true
}
When I got your error log as yours, I reboot the machine and retry your step. It also work.
Try using the 6066 port instead of 7077. The newer versions of Spark prefer the REST api for submitting jobs.
See https://issues.apache.org/jira/browse/SPARK-5388

spark-submit in cluster mode. not able to transfer application jar to driver node

I have 3 node cluster node A,B,C. running master on A ,B and slave on A.B and C.
while i run spark-submit from node A using below command.
/usr/local/spark-1.3.0-bin-hadoop2.4/bin/spark-submit --class com.test.SparkExample --deploy-mode cluster --supervise --master spark://master.com:7077 file:///home/spark/sparkstreaming-0.0.1-SNAPSHOT.jar
the driver gets launched to node B and it tries to find application jar at local file system at node B. do we need to transfer application jar on each master node manually. is the the known bug ? or i am missing something.
kindly suggest
Thanks
Yes, according to the official documentation https://spark.apache.org/docs/latest/submitting-applications.html it should be present on every node.
By the way, it is possible to put file in hdfs.

Resources