Spark on YARN via JDBC Thrift? - apache-spark

When executing queries via the Thrift interface, how do I tell it to run the queries over YARN?
I'm trying to get Spark's JDBC/ODBC Thrift interface to run Spark-SQL calls on YARN. This combination seems to be absent from the documentation. The Spark on YARN docs give a bunch of options, but doesn't describe which configuration file in which to put them so that the Thrift server will pick them up.
I see a few of the settings mentioned in spark-env.sh (cores, executer memory, etc), but I can't figure out where to tell it to use YARN in the first place.

In order to make the Thriftserver use YARN for execution, it is necessary to start the thriftserver with the "--master yarn" parameter. This parameter can be appended to sbin/start-thriftserver.sh. Appending it here passes it through to the spark-submit script, which starts the server with that executor.
There is no equivalent in a config file.

Related

Is there a way to submit spark job on different server running master

We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.
Answer to this will be highly appreciated.
Thanks in advance.
There are 3 ways you can submit Spark jobs using Apache Airflow remotely:
(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath
(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.
(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.
I personally prefer SSHOperator :)

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

How to make sure Spark master node is using the worker nodes? (Google cluster)

I just created a Google Cloud cluster (1 master and 6 workers) and by default Spark is configured.
I have a pure python code that uses NLTK to build the dependency tree for each line from a text file. When I run this code on the master spark-submit run.py I get the same execution time when I run it using my machine.
How to make sure that the master is using the workers in order to reduce the execution time ?
You can check the spark UI. If its running on top of yarn, please open the yarn UI and click on your application id which will open the spark UI. Check under the executors tab it will have the node ip address also.
could you please share your spark submit config.
Your command 'spark-submit run.py' doesn't seem to send your job to YARN. To do such thing, you need to add the --master parameter. For example, a valid command to execute a job in YARN is:
./bin/spark-submit --master yarn python/pi.py 1000
If you execute your job from the master, this execution will be straightforward. Anyway, check this link for another parameter that spark-submit accept.
For a Dataproc cluster (Hadoop Google cluster) you have two options to check the job history including the ones that are running:
By command line from the master: yarn application -list, this option sometimes needs additional configuration. If you have troubles, this link will be useful.
By UI. Dataproc enables you to access the Spark Web UI, it improves monitoring tasks. Check this link to learn how to access the Spark UI and other Dataproc UIs. In summary, you have to create a tunnel and configure your browser to use socks proxy.
Hope the information above help you.

Spark job with explicit setMaster("local"), passed to spark-submit with YARN

If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster ?
I tried this and it looked like the job did get packaged up and executed on the YARN cluster rather than locally.
What I'm not clear on:
why does this work? According to the docs, things that you set in SparkConf explicitly have precedence over things passed in from the command line or via spark-submit (see: https://spark.apache.org/docs/latest/configuration.html). Is this different because I'm using SparkSession.getBuilder?
is there any less obvious impact of leaving setMaster("local") in code vs. removing it? I'm wondering if what I'm seeing is something like the job running in local mode, within the cluster, rather than properly using cluster resources.
It's because submitting your application to Yarn happens before SparkConf.setMaster.
When you use --master yarn --deploy-mode cluster, Spark will run its main method in your local machine and upload the jar to run on Yarn. Yarn will allocate a container as the application master to run the Spark driver, a.k.a, your codes. SparkConf.setMaster("local") runs inside a Yarn container, and then it creates SparkContext running in the local mode, and doesn't use the Yarn cluster resources.
I recommend that not setting master in your codes. Just use the command line --master or the MASTER env to specify the Spark master.
If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster
setMaster has the highest priority and as such excludes other options.
My recommendation: Don't use this (unless you convince me I'm wrong - feel challenged :))
That's why I'm a strong advocate of using spark-submit early and often. It defaults to local[*] and does its job very well. It even got improved in the recent versions of Spark where it adds a nice-looking application name (aka appName) so you don't have to set it (or even...please don't...hardcore it).
Given we are in Spark 2.2 days with Spark SQL being the entry point to all the goodies in Spark, you should always start with SparkSession (and forget about SparkConf or SparkContext as too low-level).
The only reason I'm aware of when you could have setMaster in a Spark application is when you want to run the application inside your IDE (e.g. IntelliJ IDEA). Without setMaster you won't be able to run the application.
A workaround is to use src/test/scala for the sources (in sbt) and use a launcher with setMaster that will execute the main application.

Spark pyspark vs spark-submit

The documentation on spark-submit says the following:
The spark-submit script in Spark’s bin directory is used to launch
applications on a cluster.
Regarding the pyspark it says the following:
You can also use bin/pyspark to launch an interactive Python shell.
This question may sound stupid, but when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
There is no practical difference between these two. If not configured otherwise both will execute code in a local mode. If master is configured (either by --master command line parameter or spark.master configuration) corresponding cluster will be used to execute the program.
If you are using EMR , there are three things
using pyspark(or spark-shell)
using spark-submit without using --master and --deploy-mode
using spark-submit and using --master and --deploy-mode
although using all the above three will run the application in spark cluster, there is a difference how the driver program works.
in 1st and 2nd the driver will be in client mode whereas in 3rd the
driver will also be in the cluster.
in 1st and 2nd, you will have to wait untill one application complete
to run another, but in 3rd you can run multiple applications in
parallel.
Just adding a clarification that others have not addressed (you may already know this, but it was unclear from the wording of your question):
..when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
As with spark-submit, standard Python code will run only on the driver. When you call operations through the various pyspark APIs, you will trigger transformations or actions that will be registered/executed on the cluster.
As others have pointed out, spark-submit can also launch jobs in cluster mode. In this case, driver still executes standard Python code, but the driver is a different machine to the one that you call spark-submit from.
Pyspark compare to Scala spark and Java Spark have extreme differences, for Python spark in only support YARN for scheduling the cluster.
If you are running python spark on a local machine, then you can use pyspark. If in the cluster, use the spark-submit.
If you have any dependencies in your python spark job, you need a zip file for submission.

Resources