How to setup yarn client in code? - apache-spark

I want to run my spark application on my hortonworks data platform. As in this setup I don't have a spark master standalone I want to run as a yarn client.
I am trying to create the SparkSession like this:
SparkSession
.builder()
.master("yarn-client")
.appName("my-app")
.getOrCreate())
I know I am missing some properties to let spark client where my yarn server is running but I can't seem to find those properties.
Currently the app just hangs init with no error or exception.
Any ideas what I am missing?

It looks like you're trying to run your app locally while your Hortonworks HDP is somewhere else.
Unlike Spark standalone and Mesos modes, in which the master’s address
is specified in the --master parameter, in YARN mode the
ResourceManager’s address is picked up from the Hadoop configuration.
So your app should be run from Hortonworks itself, which has all the Hadoop configuration in place.

Related

Python+PySpark File locally connecting to a Remote HDFS/Spark/Yarn Cluster

I've been playing around with HDFS and Spark. I've set up a five node cluster on my network running HDFS, Spark, and managed by Yarn. Workers are running in client mode.
From the master node, I can launch the PySpark shell just fine. Running example jars, the job is split up to the worker nodes and executes nicely.
I have a few questions on whether and how to run python/Pyspark files against this cluster.
If I have a python file with a PySpark calls elsewhere else, like on my local dev laptop or a docker container somewhere, is there a way to run or submit this file locally and have it executed on the remote Spark cluster? Methods that I'm wondering about involve running spark-submit in the local/docker environment and but the file has SparkSession.builder.master() configured to the remote cluster.
Related, I see a configuration for --master in spark-submit, but the only yarn option is to pass "yarn" which seems to only queue locally? Is there a way to specify remote yarn?
If I can set up and run the file remotely, how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000, or do I submit it to one of the Yarn ports?
TIA!
way to run or submit this file locally and have it executed on the remote Spark cluster
Yes, well "YARN", not "remote Spark cluster". You set --master=yarn when running with spark-submit, and this will run against the configured yarn-site.xml in HADOOP_CONF_DIR environment variable. You can define this at the OS level, or in spark-env.sh.
You can also use SparkSession.builder.master('yarn') in code. If both options are supplied, one will get overridden.
To run fully "in the cluster", also set --deploy-mode=cluster
Is there a way to specify remote yarn?
As mentioned, this is configured from yarn-site.xml for providing resourcemanager location(s).
how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000
No - The YARN resource manager has its own RPC protocol, not hdfs:// ... You can use spark.read("hdfs://namenode:port/path") to read HDFS files, though. As mentioned, .master('yarn') or --master yarn are the only configs you need that are specific for Spark.
If you want to use Docker containers, YARN does support this, but Spark's Kubernetes master will be easier to setup, and you can use Hadoop Ozone or MinIO rather than HDFS in Kubernetes.

What is the command to call Spark2 from Shell

I have two services for Spark in my cluster, one is with name of Spark(1.6 version) and another one is Spark2(2.0 Version). I am able to call Spark with below command.
spark-shell --master yarn
But not able to connect Spark2 service even after set "export SPARK_MAJOR_VERSION=2"
Can some one help me on.
I'm using CDH cluster and following command works for me.
spark2-shell --queue <queue-name-if-any> --deploy-mode client
If I remember, SPARK_MAJOR_VERSION only works with spark-submit
You would need to find the spark2 installation directory to use the other spark-shell
Sounds like you are in an HDP cluster, so look under /usr/hdp

How to allow pyspark to run code on emr cluster

We use python with pyspark api in order to run simple code on spark cluster.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('appName').setMaster('spark://clusterip:7077')
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x**2).collect()
It works when we setup a spark cluster locally and with dockers.
We would now like to start an emr cluster and test the same code. And seems that pyspark can't connect to the spark cluster on emr
We opened ports 8080 and 7077 from our machine to the spark master
We are getting past the firewall and just seems that nothing is listening on port 7077 and we get connection refused.
We found this explaining how to serve a job using the cli but we need to run it directly from pyspark api on the driver.
What are we missing here?
How can one start an emr cluster and actually run pyspark code locally on python using this cluster?
edit: running this code from the master itself works
As opposed to what was suggested, when connecting to the master using ssh, and running python from the terminal, the very same code (with proper adjustments for the master ip, given it's the same machine) works. No issues no problems.
How does this make sense given the documentation that clearly states otherwise?
You try to run pyspark (which calls spark-submit) form a remote computer outside the spark cluster. This is technically possible but it is not the intended way of deploying applications. In yarn mode, it will make your computer participate in the spark protocol as a client. Thus it would require opening several ports and installing exactly the same spark jars as on spark aws emr.
Form the spark submit doc :
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster)
A simple deploy strategy is
sync code to master node via rsync, scp or git
cd ~/projects/spark-jobs # on local machine
EMR_MASTER_IP='255.16.17.13'
TARGET_DIR=spark_jobs
rsync -avze "ssh -i ~/dataScienceKey.pem" --rsync-path="mkdir -p ${TARGET_DIR} && rsync" --delete ./ hadoop#${EMR_MASTER_IP}:${TARGET_DIR}
ssh to the master node
ssh -i ~/dataScienceKey.pem hadoop#${EMR_HOST}
run spark-submit on the master node
cd spark_jobs
spark-submit --master yarn --deploy-mode cluster my-job.py
# my-job.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my-job-py").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4])
res = rdd.map(lambda x: x**2).collect()
print(res)
There is a way to submit the job directly to spark emr without syncing. Spark EMR runs Apache Livy on port 8998 by default. It is a rest webservice which allows to submit jobs via a rest api. You can pass the same spark-submit parameters with a curl script from your machine. See doc
For interactive development we have also configured local running jupyter notebooks which automatically submit cell runs to livy. This is done via the spark-magic project
According to this Amazon Doc, you can't do that:
Common errors
Standalone mode
Amazon EMR doesn't support standalone mode for Spark. It's not
possible to submit a Spark application to a remote Amazon EMR cluster
with a command like this:
SparkConf conf = new SparkConf().setMaster("spark://master_url:7077”).setAppName("WordCount");
Instead, set up your local machine as explained earlier in this
article. Then, submit the application using the spark-submit command.
You can follow the above linked resource to configure your local machine in order to submit spark jobs to EMR Cluster. Or more simpler, use the ssh key you specified when you create your cluster to connect to the master node and submit spark jobs:
ssh -i ~/path/ssh_key hadoop#$<master_ip_address>

submit spark job from local to emr ssh setup

I am new to spark. I want to submit a spark job from local to a remote EMR cluster.
I am following the link here to set up all the prerequisites: https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/
here is the command as below:
spark-submit --class mymain --deploy-mode client --master yarn myjar.jar
Issue: sparksession creation is not able to be finished with no error. Seems an access issue.
From the aws document, we know that by given the master with yarn, yarn uses the config files I copied from EMR to know where is the master and slaves (yarn-site.xml).
As my EMR cluster is located in a VPC, which need a special ssh config to access, how could I add this info to yarn so it can access to the remote cluster and submit the job?
I think the resolution proposed in aws link is more like - create your local spark setup with all dependencies.
If you don't want to do local spark setup, I would suggest easier way would be, you can use:
1. Livy: for this you emr setup should have livy installed. Check this, this, this and you should be able to infer from this
2. EMR ssh: this requires you to have aws-cli installed locally, cluster id and pem file used while creating emr cluster. Check this
Eg. aws emr ssh --cluster-id j-3SD91U2E1L2QX --key-pair-file ~/.ssh/mykey.pem --command 'your-spark-submit-command' (This prints command output on console though)

Apache Zeppelin and Spark deploy mode options

I understand it directly relates to the MASTER environment variable in conf/zeppelin-env.sh and whether the value is spark://<master_ip>:7077 or yarn-client, but when is Apache Zeppelin running Spark in client mode and when in cluster mode?
Spark is supporting three cluster manager types of Standalone and Hadoop YARN and Apache Mesos
And Zeppelin is supporting 4types of master is relevant with Spark
but Unfortunately Zeppelin doesn't support yarn cluster mode.

Resources