Configuring spark-submit to a remote AWS EMR cluster - apache-spark

We are building an airflow server on an EC2 instance that communicates to an EMR cluster to run spark jobs. We are trying to submit a BashOperator DAG that runs a spark-submit command for a simple wordcount application. Here is our spark submit command below:
./spark-submit --deploy-mode client --verbose --master yarn wordcount.py s3://bucket/inputwordcount.txt s3://bucket/outputbucket/ ;
We're getting the following error: Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
So far we've set HADOOP_CONF_DIR and YARN_CONF_DIR to /etc/hadoop/ in our EC2 instance in our .bashrc and have copied the spark-env.sh from the EMR cluster to /etc/hadoop/ on the EC2 Instance
We aren't too sure what files we are supposed to copy over to HADOOP_CONF_DIR/YARN_CONF_DIR directory in the EC2 for the spark-submit command to send the job to the EMR cluster running spark. Has anyone had experience configured a server to send spark commands to a remote server, we would appreciate the help!

I think the issue it that you are running spark-submit on the EC2 machine. I would suggest you to create EMR cluster with corresponding step. Here is an example from Airflow repo itself.
Or if you prefer using BashOperator, you should use aws cli. Namely you can use aws emr command.

Related

Python+PySpark File locally connecting to a Remote HDFS/Spark/Yarn Cluster

I've been playing around with HDFS and Spark. I've set up a five node cluster on my network running HDFS, Spark, and managed by Yarn. Workers are running in client mode.
From the master node, I can launch the PySpark shell just fine. Running example jars, the job is split up to the worker nodes and executes nicely.
I have a few questions on whether and how to run python/Pyspark files against this cluster.
If I have a python file with a PySpark calls elsewhere else, like on my local dev laptop or a docker container somewhere, is there a way to run or submit this file locally and have it executed on the remote Spark cluster? Methods that I'm wondering about involve running spark-submit in the local/docker environment and but the file has SparkSession.builder.master() configured to the remote cluster.
Related, I see a configuration for --master in spark-submit, but the only yarn option is to pass "yarn" which seems to only queue locally? Is there a way to specify remote yarn?
If I can set up and run the file remotely, how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000, or do I submit it to one of the Yarn ports?
TIA!
way to run or submit this file locally and have it executed on the remote Spark cluster
Yes, well "YARN", not "remote Spark cluster". You set --master=yarn when running with spark-submit, and this will run against the configured yarn-site.xml in HADOOP_CONF_DIR environment variable. You can define this at the OS level, or in spark-env.sh.
You can also use SparkSession.builder.master('yarn') in code. If both options are supplied, one will get overridden.
To run fully "in the cluster", also set --deploy-mode=cluster
Is there a way to specify remote yarn?
As mentioned, this is configured from yarn-site.xml for providing resourcemanager location(s).
how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000
No - The YARN resource manager has its own RPC protocol, not hdfs:// ... You can use spark.read("hdfs://namenode:port/path") to read HDFS files, though. As mentioned, .master('yarn') or --master yarn are the only configs you need that are specific for Spark.
If you want to use Docker containers, YARN does support this, but Spark's Kubernetes master will be easier to setup, and you can use Hadoop Ozone or MinIO rather than HDFS in Kubernetes.

How to allow pyspark to run code on emr cluster

We use python with pyspark api in order to run simple code on spark cluster.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('appName').setMaster('spark://clusterip:7077')
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x**2).collect()
It works when we setup a spark cluster locally and with dockers.
We would now like to start an emr cluster and test the same code. And seems that pyspark can't connect to the spark cluster on emr
We opened ports 8080 and 7077 from our machine to the spark master
We are getting past the firewall and just seems that nothing is listening on port 7077 and we get connection refused.
We found this explaining how to serve a job using the cli but we need to run it directly from pyspark api on the driver.
What are we missing here?
How can one start an emr cluster and actually run pyspark code locally on python using this cluster?
edit: running this code from the master itself works
As opposed to what was suggested, when connecting to the master using ssh, and running python from the terminal, the very same code (with proper adjustments for the master ip, given it's the same machine) works. No issues no problems.
How does this make sense given the documentation that clearly states otherwise?
You try to run pyspark (which calls spark-submit) form a remote computer outside the spark cluster. This is technically possible but it is not the intended way of deploying applications. In yarn mode, it will make your computer participate in the spark protocol as a client. Thus it would require opening several ports and installing exactly the same spark jars as on spark aws emr.
Form the spark submit doc :
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster)
A simple deploy strategy is
sync code to master node via rsync, scp or git
cd ~/projects/spark-jobs # on local machine
EMR_MASTER_IP='255.16.17.13'
TARGET_DIR=spark_jobs
rsync -avze "ssh -i ~/dataScienceKey.pem" --rsync-path="mkdir -p ${TARGET_DIR} && rsync" --delete ./ hadoop#${EMR_MASTER_IP}:${TARGET_DIR}
ssh to the master node
ssh -i ~/dataScienceKey.pem hadoop#${EMR_HOST}
run spark-submit on the master node
cd spark_jobs
spark-submit --master yarn --deploy-mode cluster my-job.py
# my-job.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my-job-py").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4])
res = rdd.map(lambda x: x**2).collect()
print(res)
There is a way to submit the job directly to spark emr without syncing. Spark EMR runs Apache Livy on port 8998 by default. It is a rest webservice which allows to submit jobs via a rest api. You can pass the same spark-submit parameters with a curl script from your machine. See doc
For interactive development we have also configured local running jupyter notebooks which automatically submit cell runs to livy. This is done via the spark-magic project
According to this Amazon Doc, you can't do that:
Common errors
Standalone mode
Amazon EMR doesn't support standalone mode for Spark. It's not
possible to submit a Spark application to a remote Amazon EMR cluster
with a command like this:
SparkConf conf = new SparkConf().setMaster("spark://master_url:7077”).setAppName("WordCount");
Instead, set up your local machine as explained earlier in this
article. Then, submit the application using the spark-submit command.
You can follow the above linked resource to configure your local machine in order to submit spark jobs to EMR Cluster. Or more simpler, use the ssh key you specified when you create your cluster to connect to the master node and submit spark jobs:
ssh -i ~/path/ssh_key hadoop#$<master_ip_address>

submit spark job from local to emr ssh setup

I am new to spark. I want to submit a spark job from local to a remote EMR cluster.
I am following the link here to set up all the prerequisites: https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/
here is the command as below:
spark-submit --class mymain --deploy-mode client --master yarn myjar.jar
Issue: sparksession creation is not able to be finished with no error. Seems an access issue.
From the aws document, we know that by given the master with yarn, yarn uses the config files I copied from EMR to know where is the master and slaves (yarn-site.xml).
As my EMR cluster is located in a VPC, which need a special ssh config to access, how could I add this info to yarn so it can access to the remote cluster and submit the job?
I think the resolution proposed in aws link is more like - create your local spark setup with all dependencies.
If you don't want to do local spark setup, I would suggest easier way would be, you can use:
1. Livy: for this you emr setup should have livy installed. Check this, this, this and you should be able to infer from this
2. EMR ssh: this requires you to have aws-cli installed locally, cluster id and pem file used while creating emr cluster. Check this
Eg. aws emr ssh --cluster-id j-3SD91U2E1L2QX --key-pair-file ~/.ssh/mykey.pem --command 'your-spark-submit-command' (This prints command output on console though)

Hot to execute "spark-submit" against remote spark master?

Suppose I've got a remote spark cluster. I can log in a remote spark cluster host with ssh and run spark-submit with an example like that:
$SPARK_HOME/bin/spark-submit /usr/lib/spark2/examples/src/main/python/pi.py
Now I've installed spark on my laptop but I don't run it.
I want to run $SPARK_HOME/bin/spark-submit on my laptop against the remote spark cluster host. How can I do it ?
Yes you can provide the remote master url in this command, e.g.
$SPARK_HOME/bin/spark-submit --master spark://url_to_master:7077 /usr/lib/spark2/examples/src/main/python/pi.py

Spark submit from application running in Mesos DCOS cluster

I have a Mesos DCOS cluster running on AWS with Spark installed via the dcos package install spark command. I am able to successfully execute Spark jobs using the DCOS CLI: dcos spark run ...
Now I would like to execute Spark jobs from a Docker container running inside the Mesos cluster, but I'm not quite sure how to reach the running instance of spark. The idea would be to have a docker container execute the spark-submit command to submit a job to the Spark deployment instead of executing the same job from outside the cluster with the DCOS CLI.
Current documentation seems to be focused only on running Spark via the DCOS CLI - is there any way to reach the spark deployment from another application running inside the cluster?
DCOS IOT demo try something similar. https://github.com/amollenkopf/dcos-iot-demo
This guys run a spark docker and spark-submit in a marathon app. Check this Marathon descriptor: https://github.com/amollenkopf/dcos-iot-demo/blob/master/spatiotemporal-esri-analytics/rat01.json

Resources