Spark on Mesos RpcEndpointNotFoundException - apache-spark

I cannot launch a spark job on Mesos, when it starts automatically gives this error:
"Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find
endpoint: spark://CoarseGrainedScheduler#10.32.8.178:59737"
Could be because of mismatch between versions? If I launch an example that is brought with the distribution works perfectly.
thanks

Works now. It was application fault that I did not put correctly a data input path.
In mesos you have two options to deploy, Cluster Mode or Client Mode. I chose cluster mode and I have a spark daemon (MesosClusterDispatcher) that is always listening to spark jobs, this is why I use mesos://spark-mesos-dispatcher.marathon.mesos:7077
Thanks Jacek!

Related

How are spark jobs submitted in cluster mode?

I know there is information worth 10 google pages on this but, all of them tell me to just put --master yarn in the spark-submit command. But, in cluster mode, how can my local laptop even know what that means? Let us say I have my laptop and a running dataproc cluster. How can I use spark-submit from my laptop to submit a job to this cluster?
Most of the documentation on running a Spark application in cluster mode assumes that you are already on the same cluster where YARN/Hadoop are configured (e.g. you are ssh'ed in), in which case most of the time Spark will pick up the appropriate local configs and "just work".
This is same for Dataproc: if you ssh onto the Dataproc master node, you can just run spark-submit --master yarn. More detailed instructions can be found in the documentation.
If you are trying to run applications locally on your laptop, this is more difficult. You will need to set up an ssh tunnel to the cluster, and then locally create configuration files that tell Spark how to reach the master via the tunnel.
Alternatively, you can use the Dataproc jobs API to submit jobs to the cluster without having to directly connect. The one caveat is that you will have to use properties to tell Spark to run in cluster mode instead of client mode (--properties spark.submit.deployMode=cluster). Note that when submitting jobs via the Dataproc API, the difference between client and cluster mode is much less pressing because in either case the Spark driver will actually run on the cluster (on the master or a worker respectively), not on your local laptop.

Standalone Spark application in IntelliJ

I am trying to run a spark application (written in Scala) on a local server for debug. It seems that YARN is the default in the version of spark (2.2.1) that I have in the sbt build definitions, and according to an error I'm consistently getting, there is no spark/YARN server listening:
Client:920 - Failed to connect to server: 0.0.0.0/0.0.0.0:8032: retries get failed due to exceeded maximum allowed retries number
According to netstat indeed there is really no port 8032 on my local server, in listening state.
How would I typically accomplish running my spark application locally, in a way bypassing this problem? I only need the application to process a small amount of data for debug, and hence would like to be able to run locally, without reliance on specific SPARK/YARN installations and setups on the local server ― that would be an ideal debug setup.
Is that possible?
My sbt definitions already bring in all the necessary spark and spark.yarn jars. The problem also reproduces when running the same project in sbt, outside of IntelliJ.
You can add this property to VM options in your debug configurations instead of hardcoding inside the code
-Dspark.master=local[2]
You could submit spark application in local mode with .master("local[*]") if you have to test pipeline with miniscule data.
Full code:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")
.getOrCreate()
For spark-submit use --master local[*] as one of the arguments. Refer this: https://spark.apache.org/docs/latest/submitting-applications.html
Note: Do not hard code master in your codebase, always try to supply these variables from commandline. This makes application reusable for local/test/mesos/kubernetes/yarn/whatever.

Spark jobs not showing up in Hadoop UI in Google Cloud

I created a cluster in Google Cloud and submitted a spark job. Then I connected to the UI following these instructions: I created an ssh tunnel and used it to open the Hadoop web interface. But the job is not showing up.
Some extra information:
If I connect to the master node of the cluster via ssh and run spark-shell, this "job" does show up in the hadoop web interface.
I'm pretty sure I did this before and I could see my jobs (both running and already finished). I don't know what happened in between for them to stop appearing.
The problem was that I was running my jobs in local mode. My code had a .master("local[*]") that was causing this. After removing it, the jobs showed up in the Hadoop UI as before.

How to set up Spark cluster on laptop?

I am completely new at Spark and try to run a tutorial example, which counts the number of lines containing 'a' and 'b' in a text file in the local file system.
I am running it with SparkContext with master = "local", i.e. Spark is running in the same JVM. Now I would like to try it in "cluster mode".
So I would like to run a Spark cluster of a cluster manager and two worker nodes locally on my Mac laptop. What is the easiest way to do that ?
Quoting the official documentation about Spark Standalone Mode:
./sbin/start-master.sh
./sbin/start-slave.sh <master-spark-URL>
In other words, you should start the standalone Master first (using ./sbin/start-master.sh) followed by starting one or more standalone Workers (using ./sbin/start-slave.sh).
Quoting the docs again:
Once you have started a worker, look at the master's web UI (http://localhost:8080 by default)
You're done. Congrats!
If you are looking to learn various ways to use SPARK I would suggest you to download the CLOUDERA quick start VM's which will give a simple cluster setup.
All you need to do is download the quick start VM and play around with the settings accordingly.
The quick start VM can be found here
Reference:Cloudera VM

How to run PySpark (possibly in client mode) on Mesosphere cluster?

I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.

Resources