I have a Mesos DCOS cluster running on AWS with Spark installed via the dcos package install spark command. I am able to successfully execute Spark jobs using the DCOS CLI: dcos spark run ...
Now I would like to execute Spark jobs from a Docker container running inside the Mesos cluster, but I'm not quite sure how to reach the running instance of spark. The idea would be to have a docker container execute the spark-submit command to submit a job to the Spark deployment instead of executing the same job from outside the cluster with the DCOS CLI.
Current documentation seems to be focused only on running Spark via the DCOS CLI - is there any way to reach the spark deployment from another application running inside the cluster?
DCOS IOT demo try something similar. https://github.com/amollenkopf/dcos-iot-demo
This guys run a spark docker and spark-submit in a marathon app. Check this Marathon descriptor: https://github.com/amollenkopf/dcos-iot-demo/blob/master/spatiotemporal-esri-analytics/rat01.json
Related
any suggestion in which library/tool should I use for plotting over time RAM,CPU and (optionally) GPU usage of a spark-app submitted to a Docker containerized Spark cluster through spark-submit?
In the documentation Apache suggests to use memory_profiler with commands like:
python -m memory_profiler profile_memory.py
but after accessing to my master node through a remote shell:
docker exec -it spark-master bash
I can't launch locally my spark apps because I need to use the spark-submit command in order to submit it to the cluster.
Any suggestion? I launch the apps w/o YARN but in cluster mode through
/opt/spark/spark-submit --master spark://spark-master:7077 appname.py
I would like also to know if I can use memory_profiler even if I need to use spark-submit
We use python with pyspark api in order to run simple code on spark cluster.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('appName').setMaster('spark://clusterip:7077')
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x**2).collect()
It works when we setup a spark cluster locally and with dockers.
We would now like to start an emr cluster and test the same code. And seems that pyspark can't connect to the spark cluster on emr
We opened ports 8080 and 7077 from our machine to the spark master
We are getting past the firewall and just seems that nothing is listening on port 7077 and we get connection refused.
We found this explaining how to serve a job using the cli but we need to run it directly from pyspark api on the driver.
What are we missing here?
How can one start an emr cluster and actually run pyspark code locally on python using this cluster?
edit: running this code from the master itself works
As opposed to what was suggested, when connecting to the master using ssh, and running python from the terminal, the very same code (with proper adjustments for the master ip, given it's the same machine) works. No issues no problems.
How does this make sense given the documentation that clearly states otherwise?
You try to run pyspark (which calls spark-submit) form a remote computer outside the spark cluster. This is technically possible but it is not the intended way of deploying applications. In yarn mode, it will make your computer participate in the spark protocol as a client. Thus it would require opening several ports and installing exactly the same spark jars as on spark aws emr.
Form the spark submit doc :
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster)
A simple deploy strategy is
sync code to master node via rsync, scp or git
cd ~/projects/spark-jobs # on local machine
EMR_MASTER_IP='255.16.17.13'
TARGET_DIR=spark_jobs
rsync -avze "ssh -i ~/dataScienceKey.pem" --rsync-path="mkdir -p ${TARGET_DIR} && rsync" --delete ./ hadoop#${EMR_MASTER_IP}:${TARGET_DIR}
ssh to the master node
ssh -i ~/dataScienceKey.pem hadoop#${EMR_HOST}
run spark-submit on the master node
cd spark_jobs
spark-submit --master yarn --deploy-mode cluster my-job.py
# my-job.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my-job-py").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4])
res = rdd.map(lambda x: x**2).collect()
print(res)
There is a way to submit the job directly to spark emr without syncing. Spark EMR runs Apache Livy on port 8998 by default. It is a rest webservice which allows to submit jobs via a rest api. You can pass the same spark-submit parameters with a curl script from your machine. See doc
For interactive development we have also configured local running jupyter notebooks which automatically submit cell runs to livy. This is done via the spark-magic project
According to this Amazon Doc, you can't do that:
Common errors
Standalone mode
Amazon EMR doesn't support standalone mode for Spark. It's not
possible to submit a Spark application to a remote Amazon EMR cluster
with a command like this:
SparkConf conf = new SparkConf().setMaster("spark://master_url:7077”).setAppName("WordCount");
Instead, set up your local machine as explained earlier in this
article. Then, submit the application using the spark-submit command.
You can follow the above linked resource to configure your local machine in order to submit spark jobs to EMR Cluster. Or more simpler, use the ssh key you specified when you create your cluster to connect to the master node and submit spark jobs:
ssh -i ~/path/ssh_key hadoop#$<master_ip_address>
We are building an airflow server on an EC2 instance that communicates to an EMR cluster to run spark jobs. We are trying to submit a BashOperator DAG that runs a spark-submit command for a simple wordcount application. Here is our spark submit command below:
./spark-submit --deploy-mode client --verbose --master yarn wordcount.py s3://bucket/inputwordcount.txt s3://bucket/outputbucket/ ;
We're getting the following error: Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
So far we've set HADOOP_CONF_DIR and YARN_CONF_DIR to /etc/hadoop/ in our EC2 instance in our .bashrc and have copied the spark-env.sh from the EMR cluster to /etc/hadoop/ on the EC2 Instance
We aren't too sure what files we are supposed to copy over to HADOOP_CONF_DIR/YARN_CONF_DIR directory in the EC2 for the spark-submit command to send the job to the EMR cluster running spark. Has anyone had experience configured a server to send spark commands to a remote server, we would appreciate the help!
I think the issue it that you are running spark-submit on the EC2 machine. I would suggest you to create EMR cluster with corresponding step. Here is an example from Airflow repo itself.
Or if you prefer using BashOperator, you should use aws cli. Namely you can use aws emr command.
I am new to spark. I want to submit a spark job from local to a remote EMR cluster.
I am following the link here to set up all the prerequisites: https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/
here is the command as below:
spark-submit --class mymain --deploy-mode client --master yarn myjar.jar
Issue: sparksession creation is not able to be finished with no error. Seems an access issue.
From the aws document, we know that by given the master with yarn, yarn uses the config files I copied from EMR to know where is the master and slaves (yarn-site.xml).
As my EMR cluster is located in a VPC, which need a special ssh config to access, how could I add this info to yarn so it can access to the remote cluster and submit the job?
I think the resolution proposed in aws link is more like - create your local spark setup with all dependencies.
If you don't want to do local spark setup, I would suggest easier way would be, you can use:
1. Livy: for this you emr setup should have livy installed. Check this, this, this and you should be able to infer from this
2. EMR ssh: this requires you to have aws-cli installed locally, cluster id and pem file used while creating emr cluster. Check this
Eg. aws emr ssh --cluster-id j-3SD91U2E1L2QX --key-pair-file ~/.ssh/mykey.pem --command 'your-spark-submit-command' (This prints command output on console though)
I have only a single machine and want to run spark jobs with mesos cluster mode. It might make more sense to run with a cluster of nodes, but I mainly want to test out mesos first to check if it's able to utilize resources more efficiently (run multiple spark jobs at the same time without static partitioning). I have tried a number of ways but without success. Here is what I did:
Build mesos and run both mesos master and slaves (2 slaves in same machines).
sudo ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesos
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5051 --work_dir=/tmp/mesos1
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5052 --work_dir=/tmp/mesos2
Run the spark-mesos-dispatcher
sudo ./sbin/start-mesos-dispatcher.sh --master mesos://localhost:5050
The submit the app with dispatcher as master url.
spark-submit --master mesos://localhost:7077 <other-config> <jar file>
But it doesnt work:
E0925 17:30:30.158846 807608320 socket.hpp:174] Shutdown failed on fd=61: Socket is not connected [57]
E0925 17:30:30.159545 807608320 socket.hpp:174] Shutdown failed on fd=62: Socket is not connected [57]
If I use spark-submit --deploy-mode cluster, then I got another error message:
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
It work perfectly if I don't use dispatcher but using mesos master url directly: --master mesos://localhost:5050 (client mode). According to the documentation , cluster mode is not supported for Mesos clusters, but they give another instruction for cluster mode here. So it's kind of confusing? My question is:
How I can get it works?
Should I use client mode instead of cluster mode if I submit the app/jar directly from the master node?
If I have a single computer, should I spawn 1 or more mesos slave processes. Basically, I have a number of spark job and dont want to do static partitioning of resources. But when using mesos without static partitioning, it seems to be much slower?
Thanks.
There seem to be two things you're confusing: launching a Spark application in a cluster (as opposed to locally) and launching the driver into the cluster.
From the top of Submitting Applications:
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.
So, Mesos is one of the supported cluster managers and hence you can run Spark apps on a Mesos cluster.
What Mesos as time of writing does not support is launching the driver into the cluster, this is what the command line argument --deploy-mode of ./bin/spark-submitspecifies. Since the default value of --deploy-mode is client you can just omit it, or if you want to explicitly specify it, then use:
./bin/spark-submit --deploy-mode client ...
I use your scenario to try, it could be work.
One thing different , I use ip address to instead of "localhost" and "127.0.0.1"
So just try again and to check http://your_dispatcher:8081 (on browser) if exist.
This is my spark-submit command:
$spark-submit --deploy-mode cluster --master mesos://192.168.11.79:7077 --class "SimpleApp" SimpleAppV2.jar
If success, you can see as below
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "1.5.0",
"submissionId" : "driver-20151006164749-0001",
"success" : true
}
When I got your error log as yours, I reboot the machine and retry your step. It also work.
Try using the 6066 port instead of 7077. The newer versions of Spark prefer the REST api for submitting jobs.
See https://issues.apache.org/jira/browse/SPARK-5388