Spark standalone cluster doesn't accept connections - azure

I'm trying to run the simplest Spark standalone cluster on an Azure VM. I'm running a single master, with a single worker running on the same machine. I can access the Web UI perfectly, and can see that the worker is registered with the master.
But I can't connect to this cluster using spark-shell on my laptop. When I looked in the logs, I see
15/09/27 12:03:33 ERROR ErrorMonitor: dropping message [class akka.actor.ActorSelectionMessage]
for non-local recipient [Actor[akka.tcp://sparkMaster#40.113.XXX.YYY:7077/]]
arriving at [akka.tcp://sparkMaster#40.113.XXX.YYY:7077] inbound addresses
are [akka.tcp://sparkMaster#somehostname:7077]
akka.event.Logging$Error$NoCause$
Now I think the reason why this is happening is that on Azure, every virtual machine sits behind a type of firewall/load balancer. I'm trying to connect using the Public IP that Azure tells me (40.113.XXX.YYY), but Spark refuses to accept connections because this is not the IP of an interface.
Since this IP is not of the machine, I can't bind to an interface either.
How can I get Spark to accept these packets as well?
Thanks!

you can try to set the ip address to MASTER_IP which in spark-env.sh instead of hostname.

I got the same problem and was able to resolve by getting the configured --ip parameter of the command line that runs spark:
$ ps aux | grep spark
[bla bla...] org.apache.spark.deploy.master.Master --ip YOUR_CONFIGURED_IP [bla bla...]
Then I was able to connect to my cluster by using exactly the same string as YOUR_CONFIGURED_IP:
spark-shell --master spark://YOUR_CONFIGURED_IP:7077

Related

Unable to get metrics from PrometheusServlet on Databricks Spark 3.1.1

Trying to get prometheus metrics with grafana dashboard working for Databricks clusters on AWS but cannot seem to get connections on the ports as requried. I've tried a few different setups, but will focus on PrometheusServlet in this question as it seems like it should be the quickest path to glory.
PrometheusServlet - I put this in my metrics.properties file using an init script on each worker:
sudo bash -c "cat <<EOF >> /databricks/spark/conf/metrics.properties
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus
# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
EOF"
I also have "spark.ui.prometheus.enabled true" and "spark.executor.processTreeMetrics.enabled true" in the spark config options for the Databricks job
I get a connection refused when trying to hit the worker URL at anything but port 8080. On port 8080 I get a wierd binary response "P%" when I try to connect via curl, and get a bad SSL cert error when I try to connect via the browser. I've opened up the necessary ports on the security group associated with the Spark workers. Trying to add a worker in Grafana just results in a 'Bad Gateway' error.
Has anyone gotten the PrometheusServlet working on Databricks clusters? Is there another way I should be doing this? This is the blog I was following for reference, as the PrometheusServlet documentation is pretty hard to find: https://dzlab.github.io/bigdata/2020/07/03/spark3-monitoring-1/
I'm running Databricks 8.3 runtime, Spark 3.1.1.

How to debug a Spark job on Dataproc?

I have a Spark job running on a Dataproc cluster. How do I configure the environment to debug it on my local machine with my IDE?
This tutorial assumes the following:
You know how to create GCP Dataproc clusters, either by API calls, cloud shell commands or Web UI
You know how to submit a Spark Job
You have permissions to launch jobs, create clusters and use Compute Engine instances
After some attempts, I've discovered how to debug on your local machine a DataProc Spark Job running on a cluster.
As you may know, you can submit a Spark Job either by using the Web UI, sending a request to the DataProc API or using the gcloud dataproc jobs submit spark command. Whichever way, you start by adding the following key-value pair to the properties field in the SparkJob: spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=REMOTE_PORT, where REMOTE_PORT is the port on the worker where the driver will be listening.
Chances are your cluster is on a private network and you need to create a SSH tunnel to the REMOTE_PORT. If that's not the case, you're lucky and you just need to connect to the worker using the public IP and the specified REMOTE_PORT on your IDE.
Using IntelliJ it would be like this:
,
where worker-ip is the worker which is listening (I've used 9094 as port this time). After a few attempts, I realized it's always the worker number 0, but you can connect to it and check whether there is a process running using netstat -tulnp | grep REMOTE_PORT
If for whatever reason your cluster does not have a public IP, you need to set a SSH tunnel from your local machine to the worker. After specifying your ZONE and PROJECT you create a tunnel to REMOTE_PORT:
gcloud compute ssh CLUSTER_NAME-w-0 --project=$PROJECT --zone=$ZONE -- -4 -N -L LOCAL_PORT:CLUSTER_NAME-w-0:REMOTE_PORT
And you set your debug configuration on your IDE pointing to host=localhost/127.0.0.1 and port=LOCAL_PORT

How to connect Cassandra from gcloud cluster using python

We try to connect cluster using bash script using Jupyter notebook :
!gcloud compute --project "project_name" ssh --zone "us-central1-a" "cassandra-abc-m"
After that we try to connect using :
import cql
con= cql.connect(host="127.0.0.1",port=9160,keyspace="testKS")
cur=con.cursor()
result=cur.execute("select * from TestCF")
How to inter-connect both?
Kindly help me for it.
As I understand the question, you are SSHing out to a Google Compute (GCP) instance (running Cassandra) and are then trying to run a Python script to connect to the local node. I see two problems in your cql.connect line.
First, Cassandra does not use port 9160 for CQL. CQL uses port 9042. I find this point confuses people so much, that I recommend not setting port= at all. The driver will use the default, which should work.
Secondly, if you deployed Cassandra to a GCP instance, then you probably changed listen_address and rpc_address. This means Cassandra cannot bind to 127.0.0.1. You need to use the value defined in the yaml's rpc_address (or broadcast_rpc_address) property.
$ grep rpc_address cassandra.yaml
rpc_address: 10.19.17.5
In my case, I need to specify 10.19.17.5 if I want to connect either locally or remote.
tl;dr;
Don't specify the port.
Connect to your external-facing IP address, as 127.0.0.1 will never work.

Could not bind on a random free port error while trying to connect to spark master

I have a spark master running on amazon ec2.
I tried to connect to it using pyspark as follows from another ec2 instance as follows:
spark = SparkSession.builder.appName("MyApp") \
.master("spark_url_as_obtained_in_web_ui") \
.getOrCreate()
The following were the errors:
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-04-04 20:03:04 WARN Utils:66 - Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
............
java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
I tried all the solutions as described here but to no avail:
Connecting to a remote Spark master - Java / Scala
All masters are unresponsive ! ? Spark master is not responding with datastax architecture
Spark Standalone Cluster - Slave not connecting to Master
Spark master-machine:7077 not reachable
spark submit "Service 'Driver' could not bind on port" error
https://community.hortonworks.com/questions/8257/how-can-i-resolve-it.html
What could be going wrong??
Set spark.driver.bindAddress to your local IP like 127.0.0.1.
pyspark -c spark.driver.bindAddress=127.0.0.1
While creating spark session set the below configurations
spark =SparkSession.builder.appName(str(name))\
.master("local[*]").config("spark.driver.memory","5g")\
.config("spark.driver.host","10.51.4.110")\ #Machine ip
.config("spark.driver.bindAddress","10.51.4.110")\ #Machine ip
.getOrCreate()
Sometimes In addition to bind address, we need to set the host address also. In my case system host address changed to system name. Spark shows timeout error. then after setting host and bind address as same. It works fine
*10.51.4.110 - Local Machine IP
I had a similar issue on Windows 10 recently.
Resolved it by following below steps:
Fixed it by setting a user environment variable SPARK_LOCAL_IP = 127.0.0.1
Restart the command line as an admin
On a MacOS, with a standalone spark cluster, setting SPARK_MASTER_HOST to 'localhost' instead of '127.0.0.1' solved to problem for me.
export SPARK_MASTER_HOST='localhost'

Issues in using spark cluster from outside LAN

I am trying to use a spark cluster from outside the cluster itself.
The problem is that spark bind to my local machine private ip and it is able to connect to the master but then workers fail to connect to my machine (driver) because of IP problems (they see my private IP, because spark binds on my private IP).
I can see that from workers log:
"--driver-url" "spark://CoarseGrainedScheduler#PRIVATE_IP_MY_LAPTOP:34355"
any help?
Try setting spark.driver.host (search for it here for more info) to your public IP, the workers will then use that address instead of the (automatically resolved) private IP.
Try setting spark.driver.bindAddress to 0.0.0.0 so that the driver program can listen to the global.

Resources