Possible to get output from Spark App submitted in cluster mode?

Possible to get output from Spark App submitted in cluster mode? - apache-spark

Is it possible to get output from my Spark App submitted in cluster mode? If so, how?
I'm running a simple Spark application using Python. The program just sets up a Spark Context and prints This app ran successfully to the screen. When I submit this app with the following:
spark-submit --deploy-mode client --master local[*] foo.py
it runs successfully and prints out the message.
However, when I run the same app with:
spark-submit --deploy-mode cluster --master yarn-cluster foo.py
it runs successfully, but I get no output.
While I've been using Spark for a few months now, I'm relatively new to submitting apps in cluster mode, so any help/documentation would be great!

You can save This app ran successfully to external storage system such as:
sc.parallelize(['This app ran successfully'], 1).saveAsTextFile(path='hdfs:///somewhere/you/want')

Related

Spark submitted application not shown in YARN web ui

I have node where I have installed spark in yarn mode. When I run an application with
sudo ./usr/bin/spark-submit --master yarn --deploy-mode client MySparkCode.py
it runs fine.
When I connect in spark history server at http://localhost:18089/ I can see my submitted application.
But When I go at Yarn recourse manager web ui at http://localhost:8088/cluster/apps my application is not showing at all. Did I do something wrong? Shouldn't my application be shown there?

sudo ./usr/bin/spark-submit --master yarn --deploy-mode cluster MySparkCode.py
Try mode Cluster

How to setup yarn client in code?

I want to run my spark application on my hortonworks data platform. As in this setup I don't have a spark master standalone I want to run as a yarn client.
I am trying to create the SparkSession like this:
SparkSession
.builder()
.master("yarn-client")
.appName("my-app")
.getOrCreate())
I know I am missing some properties to let spark client where my yarn server is running but I can't seem to find those properties.
Currently the app just hangs init with no error or exception.
Any ideas what I am missing?

It looks like you're trying to run your app locally while your Hortonworks HDP is somewhere else.
Unlike Spark standalone and Mesos modes, in which the master’s address
is specified in the --master parameter, in YARN mode the
ResourceManager’s address is picked up from the Hadoop configuration.
So your app should be run from Hortonworks itself, which has all the Hadoop configuration in place.

Understanding spark --master

I have simple spark app that reads master from a config file:
new SparkConf()
.setMaster(config.getString(SPARK_MASTER))
.setAppName(config.getString(SPARK_APPNAME))
What will happen when ill run my app with as follow:
spark-submit --class <main class> --master yarn <my jar>
Is my master going to be overwritten?
I prefer having the master provided in standard way so I don't need to maintain it in my configuration, but then the question how can I run this job directly from IDEA? this isn't my application argument but spark-submit argument.
Just for clarification my desired end product should:
when run in cluster using --master yarn, will use this configuration
when run from IDEA will run with local[*]

Do not set the master into your code.
In production you could use the option --master of spark-submit which will tell spark which master to use (yarn in you case). also the value of spark.master in spark-defaults.conf file will do the job (priority is for --master and then the property in configuration file)
In an IDEA... well I know in Eclipse you could pass a VM argument in Run Configuration -Dspark.master=local[*] for example (https://stackoverflow.com/a/24481688/1314742).
In IDEA I think it is not too much different, you could check here to add VM options

Spark Mesos Cluster Mode using Dispatcher

I have only a single machine and want to run spark jobs with mesos cluster mode. It might make more sense to run with a cluster of nodes, but I mainly want to test out mesos first to check if it's able to utilize resources more efficiently (run multiple spark jobs at the same time without static partitioning). I have tried a number of ways but without success. Here is what I did:
Build mesos and run both mesos master and slaves (2 slaves in same machines).
sudo ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesos
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5051 --work_dir=/tmp/mesos1
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5052 --work_dir=/tmp/mesos2
Run the spark-mesos-dispatcher
sudo ./sbin/start-mesos-dispatcher.sh --master mesos://localhost:5050
The submit the app with dispatcher as master url.
spark-submit --master mesos://localhost:7077 <other-config> <jar file>
But it doesnt work:
E0925 17:30:30.158846 807608320 socket.hpp:174] Shutdown failed on fd=61: Socket is not connected [57]
E0925 17:30:30.159545 807608320 socket.hpp:174] Shutdown failed on fd=62: Socket is not connected [57]
If I use spark-submit --deploy-mode cluster, then I got another error message:
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
It work perfectly if I don't use dispatcher but using mesos master url directly: --master mesos://localhost:5050 (client mode). According to the documentation , cluster mode is not supported for Mesos clusters, but they give another instruction for cluster mode here. So it's kind of confusing? My question is:
How I can get it works?
Should I use client mode instead of cluster mode if I submit the app/jar directly from the master node?
If I have a single computer, should I spawn 1 or more mesos slave processes. Basically, I have a number of spark job and dont want to do static partitioning of resources. But when using mesos without static partitioning, it seems to be much slower?
Thanks.

There seem to be two things you're confusing: launching a Spark application in a cluster (as opposed to locally) and launching the driver into the cluster.
From the top of Submitting Applications:
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.
So, Mesos is one of the supported cluster managers and hence you can run Spark apps on a Mesos cluster.
What Mesos as time of writing does not support is launching the driver into the cluster, this is what the command line argument --deploy-mode of ./bin/spark-submitspecifies. Since the default value of --deploy-mode is client you can just omit it, or if you want to explicitly specify it, then use:
./bin/spark-submit --deploy-mode client ...

I use your scenario to try, it could be work.
One thing different , I use ip address to instead of "localhost" and "127.0.0.1"
So just try again and to check http://your_dispatcher:8081 (on browser) if exist.
This is my spark-submit command:
$spark-submit --deploy-mode cluster --master mesos://192.168.11.79:7077 --class "SimpleApp" SimpleAppV2.jar
If success, you can see as below
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "1.5.0",
"submissionId" : "driver-20151006164749-0001",
"success" : true
}
When I got your error log as yours, I reboot the machine and retry your step. It also work.

Try using the 6066 port instead of 7077. The newer versions of Spark prefer the REST api for submitting jobs.
See https://issues.apache.org/jira/browse/SPARK-5388

Difference between yarn-client mode and yarn-cluster mode

I am having a little problem while running similar code on the yarn-client mode as well as the yarn-cluster mode. My code executes perfectly when I run it in the client mode, but fails when made to run on the yarn-cluster node.
It throws a file not file exception, stating that pyspark.zip file could not be found. Any insight into this would be helpful.

In yarn-cluster mode, the driver runs in the Application Master (inside a YARN container). In yarn-client mode, it runs in the client.
In yarn-cluster mode, the spark-shell is not supported.
Coming back to your problem: which version of Spark are you using ? In version below 1.4, running pyspark in yarn is currently limited to yarn-client mode (see SPARK-5162)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Possible to get output from Spark App submitted in cluster mode? - apache-spark

You can save This app ran successfully to external storage system such as: sc.parallelize(['This app ran successfully'], 1).saveAsTextFile(path='hdfs:///somewhere/you/want')

Related

Spark submitted application not shown in YARN web ui

How to setup yarn client in code?

Understanding spark --master

Spark Mesos Cluster Mode using Dispatcher

Difference between yarn-client mode and yarn-cluster mode

Categories

Resources