I'm wondering why the client, using apache-spark/sbin/start-slave.sh <master's URL> has to indicate this master's URL, since the master already indicates it in : apache-spark/sbin/start-master.sh --master spark://my-master:7077e.g. ?
Is it because the client must wait for the master to receive the submit sent by the master ? If yes : then why the master must specify --master spark://.... in its submit ?
start-slave.sh <master's URL> starts a standalone Worker (formerly a slave) that the standalone Master available at <master's URL> uses to offer resources to Spark applications.
Standalone Master manages workers and it's workers to register themselves with a master and give CPU and memory for resource offering.
From Starting a Cluster Manually:
You can start a standalone master server by executing:
./sbin/start-master.sh
Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.
Similarly, you can start one or more workers and connect them to the master via:
./sbin/start-slave.sh <master-spark-URL>
since the master already indicates it in : apache-spark/sbin/start-master.sh --master spark://my-master:7077
You can specify the URL of the standalone Master that defaults to spark://my-master:7077, but that's not announced on the network so anyone could know the URL (unless specified on command line).
why the master must specify --master spark://.... in its submit
It does not. Standalone Master and submit are different "tools", i.e. the former is a cluster manager for Spark applications while the latter is to submit Spark applications to a cluster manager for execution (that could be on any of the three supported cluster managers: Spark Standalone, Apache Mesos and Hadoop YARN).
See Submitting Applications.
Related
Okay. Where to start? I am deploying a set of Spark applications to a Kubernetes cluster. I have one Spark Master, 2 Spark Workers, MariaDB, a Hive Metastore (that uses MariaDB - and it's not a full Hive install - it's just the Metastore), and a Spark Thrift Server (that talks to Hive Metastore and implements the Hive API).
So this setup is working pretty well for everything except the setup of the Thrift Server job (start-thriftserver.sh in the Spark sbin directory on the thrift server pod). By working well I say that outside my cluster I can create spark jobs and submit them to master and then using the Web UI I can see my code test app ran to completion utilizing both workers.
Now the problem. When you launch the start-thriftserver.sh it submits a job to the cluster with itself as the driver (I believe - which is correct behavior). And when I look at the related spark job via the WebUI I see it has workers and they repeatedly get hatched and then exit shortly therafter. When I look at the workers' stderr logs I see that every worker launches and tries to connect back to the thrift server pod at the spark.driver.port. This is correct behavior I believe. The gotcha is that connection fails because it says unknown host exception and it uses a kubernetes raw pod name (not a service name and with no IP in the name) of the thrift server pod to say it can't find the thrift server that initiated the connection. Now Kubernetes DNS stores service names and then only pod names as prefaced with their private IP. In other words the raw name of the pod (without an IP) is never registered with the DNS. That is not how kubernetes works.
So my question. I am struggling to figure out why the spark worker pod is using a raw pod name to try to find the thrift server. It seems it should never do this and that it should be impossible to ever satisfy that request. I have wondered if there is some spark config setting that would tell the workers that the (thrift) driver it needs to be searching for is actually spark-thriftserver.my-namespace.svc. But I can't find anything having done much searching.
There are so many settings that go into a cluster like this that I don't want to barrage you with info. One thing that might clarify my setup: the following string is dumped at the top of a worker log that fails. Notice the raw pod name of the thrift server for driver-url. If anyone has any clue what steps to take to fix this please let me know. I'll edit this post and share settings etc as people request them. Thanks for helping.
Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/jre/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx512M" "-Dspark.master.port=7077" "-Dspark.history.ui.port=18081" "-Dspark.ui.port=4040" "-Dspark.driver.port=41617" "-Dspark.blockManager.port=41618" "-Dspark.master.rest.port=6066" "-Dspark.master.ui.port=8080" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-thriftserver-6bbb54768b-j8hz8:41617" "--executor-id" "12" "--hostname" "172.17.0.6" "--cores" "1" "--app-id" "app-20220408001035-0000" "--worker-url" "spark://Worker#172.17.0.6:37369"
I would like to establish 2 connections to one Spark Thrift Server for each development and QA. These two connections should be passed through 2 independent queues.
To achieve above, I set below properties from beeline when connecting Thrift server.
1) mapred.job.queue.name
2) spark.yarn.queue
Connection URL: jdbc:hive2://host:port?mapred.job.queue.name=queue_name
And, executed queries from beeline with above URL. However, I could not able to verify that query is executed with right queue.
Please help.
Thanks,
Sravan
You can check the YARN ResourceManager UI, every application running on top of YARN will be displayed there, also you can use the yarn application -list command line to verify which queue is the app assigned to.
Do you know why this error below happens in spark shell when I try to access spark UI master:4040?
WARN amfilter.AmIpFilter: Could not find proxy-user cookie, so user will not be set
This happens, if you start spark shell with yarn.
spark-shell --master yarn
In that case, YARN will start a proxy web application to increase the security of the overall system.
The URL of the proxy will be displayed in the log, while starting the Spark shell.
Here is a sample from my log:
16/06/26 08:38:28 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> myhostname, PROXY_URI_BASES -> http://myhostname:8088/proxy/application_1466917030969_0003), /proxy/application_1466917030969_0003
You shouldn't access the standard Spark Web UI using port 4040 (or whatever you have configured).
Instead I know these 2 options (where I prefere the 2nd one):
Scan the log for the proxy application URL and use that
Open the YARN Web UI http://localhost:8088/cluster and
follow the link to the ApplicationMaster (column Tracking UI) of the
running Spark application
This is also descibed briefly in the YARN and SPark documentation.
Spark Security documentation:
https://spark.apache.org/docs/latest/security.html
Yarn Web Application Proxy documentation:
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html
spark-submit allows us to pass in Kerberos credentials via the --keytab and --principal options. If I try to add these via addSparkArg("--keytab",keytab) , I get a '--keytab' does not expect a value error - I presume this is due to lack of support as of v1.6.0.
Is there another way by which I can submit my Spark job using this SparkLauncher class, with Kerberos credentials ? - I'm using Yarn with Secured HDFS.
--principal arg is described as "Principal to be used to login to KDC, while running on secure HDFS".
So it is specific to Hadoop integration. I'm not sure you are aware of that, because your post does not mention either Hadoop, YARN or HDFS.
Now, Spark properties that are Hadoop-specific are described on the manual page Running on YARN. Surprise! Some of these properties sound familiar, like spark.yarn.principal and spark.yarn.keytab
Bottom line: the --blahblah command-line arguments are just shortcuts to properties that you can otherwise set in your code, or in the "spark-defaults" conf file.
Since Samson's answer, I thought I'd add what I've experienced with Spark 1.6.1:
You could use SparkLauncher.addSparkArg("--proxy-user", userName) to send in proxy user info.
You could use SparkLauncher.addSparkArg("--principal", kerbPrincipal) and SparkLauncher.addSparkArg("--keytab", kerbKeytab)
So, you can only use either (a) OR (b) but not both together - see https://github.com/apache/spark/pull/11358/commits/0159499a55591f25c690bfdfeecfa406142be02b
In other words, either the launched process triggers a Spark job on YARN as itself, using its Kerberos credentials (OR) the launched process impersonates an end user to trigger the Spark job on a cluster without Kerberos. On YARN, in case of the former, the job is owned by self while in case of the former, the job is owned by the proxied user.
What's the behavior when a partition is sent to a node and the node crashes right before executing a job? If a new node is introduced into the cluster, what's the entity that detects the addition of this new machine? Does the new machine get assigned the partition that didn't get processed?
The master considers the worker to be failure if it didnt receive the heartbeat message for past 60 sec (according to spark.worker.timeout). In that case the partition is assigned to another worker(remember partitioned RDD can be reconstructed even if its lost).
For the question if the new node is introduced into cluster? the spark-master will not detect the new node addition to the cluster once the slaves are started, because before application-submit in cluster the sbin/start-master.sh starts the master and sbin/start-slaves.sh reads the conf/slaves file (contains IP address of all slaves) in spark-master machine and starts a slave instance on each machine specified. The spark-master will not read this configuration file after being started. so its not possible to add a new node once all slaves started.