How to assign Spark Thrift server connection to queue - apache-spark

I would like to establish 2 connections to one Spark Thrift Server for each development and QA. These two connections should be passed through 2 independent queues.
To achieve above, I set below properties from beeline when connecting Thrift server.
1) mapred.job.queue.name
2) spark.yarn.queue
Connection URL: jdbc:hive2://host:port?mapred.job.queue.name=queue_name
And, executed queries from beeline with above URL. However, I could not able to verify that query is executed with right queue.
Please help.
Thanks,
Sravan

You can check the YARN ResourceManager UI, every application running on top of YARN will be displayed there, also you can use the yarn application -list command line to verify which queue is the app assigned to.

Related

On kubernetes my spark worker pod is trying to access thrift pod by name

Okay. Where to start? I am deploying a set of Spark applications to a Kubernetes cluster. I have one Spark Master, 2 Spark Workers, MariaDB, a Hive Metastore (that uses MariaDB - and it's not a full Hive install - it's just the Metastore), and a Spark Thrift Server (that talks to Hive Metastore and implements the Hive API).
So this setup is working pretty well for everything except the setup of the Thrift Server job (start-thriftserver.sh in the Spark sbin directory on the thrift server pod). By working well I say that outside my cluster I can create spark jobs and submit them to master and then using the Web UI I can see my code test app ran to completion utilizing both workers.
Now the problem. When you launch the start-thriftserver.sh it submits a job to the cluster with itself as the driver (I believe - which is correct behavior). And when I look at the related spark job via the WebUI I see it has workers and they repeatedly get hatched and then exit shortly therafter. When I look at the workers' stderr logs I see that every worker launches and tries to connect back to the thrift server pod at the spark.driver.port. This is correct behavior I believe. The gotcha is that connection fails because it says unknown host exception and it uses a kubernetes raw pod name (not a service name and with no IP in the name) of the thrift server pod to say it can't find the thrift server that initiated the connection. Now Kubernetes DNS stores service names and then only pod names as prefaced with their private IP. In other words the raw name of the pod (without an IP) is never registered with the DNS. That is not how kubernetes works.
So my question. I am struggling to figure out why the spark worker pod is using a raw pod name to try to find the thrift server. It seems it should never do this and that it should be impossible to ever satisfy that request. I have wondered if there is some spark config setting that would tell the workers that the (thrift) driver it needs to be searching for is actually spark-thriftserver.my-namespace.svc. But I can't find anything having done much searching.
There are so many settings that go into a cluster like this that I don't want to barrage you with info. One thing that might clarify my setup: the following string is dumped at the top of a worker log that fails. Notice the raw pod name of the thrift server for driver-url. If anyone has any clue what steps to take to fix this please let me know. I'll edit this post and share settings etc as people request them. Thanks for helping.
Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/jre/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx512M" "-Dspark.master.port=7077" "-Dspark.history.ui.port=18081" "-Dspark.ui.port=4040" "-Dspark.driver.port=41617" "-Dspark.blockManager.port=41618" "-Dspark.master.rest.port=6066" "-Dspark.master.ui.port=8080" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-thriftserver-6bbb54768b-j8hz8:41617" "--executor-id" "12" "--hostname" "172.17.0.6" "--cores" "1" "--app-id" "app-20220408001035-0000" "--worker-url" "spark://Worker#172.17.0.6:37369"

Why does start-slave.sh require master URL?

I'm wondering why the client, using apache-spark/sbin/start-slave.sh <master's URL> has to indicate this master's URL, since the master already indicates it in : apache-spark/sbin/start-master.sh --master spark://my-master:7077e.g. ?
Is it because the client must wait for the master to receive the submit sent by the master ? If yes : then why the master must specify --master spark://.... in its submit ?
start-slave.sh <master's URL> starts a standalone Worker (formerly a slave) that the standalone Master available at <master's URL> uses to offer resources to Spark applications.
Standalone Master manages workers and it's workers to register themselves with a master and give CPU and memory for resource offering.
From Starting a Cluster Manually:
You can start a standalone master server by executing:
./sbin/start-master.sh
Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.
Similarly, you can start one or more workers and connect them to the master via:
./sbin/start-slave.sh <master-spark-URL>
since the master already indicates it in : apache-spark/sbin/start-master.sh --master spark://my-master:7077
You can specify the URL of the standalone Master that defaults to spark://my-master:7077, but that's not announced on the network so anyone could know the URL (unless specified on command line).
why the master must specify --master spark://.... in its submit
It does not. Standalone Master and submit are different "tools", i.e. the former is a cluster manager for Spark applications while the latter is to submit Spark applications to a cluster manager for execution (that could be on any of the three supported cluster managers: Spark Standalone, Apache Mesos and Hadoop YARN).
See Submitting Applications.

Connecting to both master and slave in a replicated Redis cluster

I'm setting up a simple 1 Master - N Slaves Redis cluster (low write round, high read count). How to set this up is well documented on the Redis website, however, there is no information (or I missed it) about how the clients (Node.js servers in my case) handle the cluster. Do my servers need to have 2 Redis connections opened: one for the Master (writes) and one towards a Slave load-balancer for reads? Does the Redis driver handle this automatically and send reads to slaves and writes to the Master?
The only approach I found was using thunk-redis library. This library supports connecting to Redis master-slave without having a cluster configured or using a sentinel.
You just simply add multiple IP addresses to the client:
const client = redis.createClient(['127.0.0.1:6379', '127.0.0.1:6380'], {onlyMaster: false});
You don't need to specifically connect to particular instance, every instance in redis cluster has information of cluster. So even if you connect to one master, your client would to be connect to any instance in the cluster. So if you try to update a key present in different master(other than the one you connected), redis client takes care of it by using the redirection provided by the server.
To answer your second question, you can enable reads from slave by READONLY command

Passing in Kerberos keytab/principal via SparkLauncher

spark-submit allows us to pass in Kerberos credentials via the --keytab and --principal options. If I try to add these via addSparkArg("--keytab",keytab) , I get a '--keytab' does not expect a value error - I presume this is due to lack of support as of v1.6.0.
Is there another way by which I can submit my Spark job using this SparkLauncher class, with Kerberos credentials ? - I'm using Yarn with Secured HDFS.
--principal arg is described as "Principal to be used to login to KDC, while running on secure HDFS".
So it is specific to Hadoop integration. I'm not sure you are aware of that, because your post does not mention either Hadoop, YARN or HDFS.
Now, Spark properties that are Hadoop-specific are described on the manual page Running on YARN. Surprise! Some of these properties sound familiar, like spark.yarn.principal and spark.yarn.keytab
Bottom line: the --blahblah command-line arguments are just shortcuts to properties that you can otherwise set in your code, or in the "spark-defaults" conf file.
Since Samson's answer, I thought I'd add what I've experienced with Spark 1.6.1:
You could use SparkLauncher.addSparkArg("--proxy-user", userName) to send in proxy user info.
You could use SparkLauncher.addSparkArg("--principal", kerbPrincipal) and SparkLauncher.addSparkArg("--keytab", kerbKeytab)
So, you can only use either (a) OR (b) but not both together - see https://github.com/apache/spark/pull/11358/commits/0159499a55591f25c690bfdfeecfa406142be02b
In other words, either the launched process triggers a Spark job on YARN as itself, using its Kerberos credentials (OR) the launched process impersonates an end user to trigger the Spark job on a cluster without Kerberos. On YARN, in case of the former, the job is owned by self while in case of the former, the job is owned by the proxied user.

Refresh metadata of cassandra cluster

I added nodes to a cluster which initialy used the wrong network interface as listen_adress. I fixed it by changeing the listen_address to the correct IP. The cluster is running well with that configuration but clients trying to connect to that cluster still receive the wrong IPs as Metadata from cluster. Is there any way to refresh metadata of a cluster whithout decommissioning the nodes and setting up new ones again?
First of all, you may try to follow this advice: http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_gossip_purge.html
You will need to restart the entire cluster on a rolling basis - one node at a time
If this does not work, try this on each node:
USE system;
SELECT * FROM peers;
Then delete bad records from the peers and restart the node, then go to the next node and do it again.

Resources