How to debug a Spark job on Dataproc? - apache-spark

I have a Spark job running on a Dataproc cluster. How do I configure the environment to debug it on my local machine with my IDE?

This tutorial assumes the following:
You know how to create GCP Dataproc clusters, either by API calls, cloud shell commands or Web UI
You know how to submit a Spark Job
You have permissions to launch jobs, create clusters and use Compute Engine instances
After some attempts, I've discovered how to debug on your local machine a DataProc Spark Job running on a cluster.
As you may know, you can submit a Spark Job either by using the Web UI, sending a request to the DataProc API or using the gcloud dataproc jobs submit spark command. Whichever way, you start by adding the following key-value pair to the properties field in the SparkJob: spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=REMOTE_PORT, where REMOTE_PORT is the port on the worker where the driver will be listening.
Chances are your cluster is on a private network and you need to create a SSH tunnel to the REMOTE_PORT. If that's not the case, you're lucky and you just need to connect to the worker using the public IP and the specified REMOTE_PORT on your IDE.
Using IntelliJ it would be like this:
,
where worker-ip is the worker which is listening (I've used 9094 as port this time). After a few attempts, I realized it's always the worker number 0, but you can connect to it and check whether there is a process running using netstat -tulnp | grep REMOTE_PORT
If for whatever reason your cluster does not have a public IP, you need to set a SSH tunnel from your local machine to the worker. After specifying your ZONE and PROJECT you create a tunnel to REMOTE_PORT:
gcloud compute ssh CLUSTER_NAME-w-0 --project=$PROJECT --zone=$ZONE -- -4 -N -L LOCAL_PORT:CLUSTER_NAME-w-0:REMOTE_PORT
And you set your debug configuration on your IDE pointing to host=localhost/127.0.0.1 and port=LOCAL_PORT

Related

Spark specify ssh port for worker nodes

I'm trying to set up a local Spark cluster. When I add the IP addresses of the workers to spark/conf/workers it tries to ssh into them on the default port 22 when I run sbin/start-all.sh. I have my ssh ports set differently for security reasons. Is there an option I can use to configure spark to use alternate ports for ssh from master to workers, etc?
You should add the following option for /path/to/spark/conf/spark-env.sh
# Change 2222 to whatever port you're using
SPARK_SSH_OPTS="-p 2222"

Unable to get metrics from PrometheusServlet on Databricks Spark 3.1.1

Trying to get prometheus metrics with grafana dashboard working for Databricks clusters on AWS but cannot seem to get connections on the ports as requried. I've tried a few different setups, but will focus on PrometheusServlet in this question as it seems like it should be the quickest path to glory.
PrometheusServlet - I put this in my metrics.properties file using an init script on each worker:
sudo bash -c "cat <<EOF >> /databricks/spark/conf/metrics.properties
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus
# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
EOF"
I also have "spark.ui.prometheus.enabled true" and "spark.executor.processTreeMetrics.enabled true" in the spark config options for the Databricks job
I get a connection refused when trying to hit the worker URL at anything but port 8080. On port 8080 I get a wierd binary response "P%" when I try to connect via curl, and get a bad SSL cert error when I try to connect via the browser. I've opened up the necessary ports on the security group associated with the Spark workers. Trying to add a worker in Grafana just results in a 'Bad Gateway' error.
Has anyone gotten the PrometheusServlet working on Databricks clusters? Is there another way I should be doing this? This is the blog I was following for reference, as the PrometheusServlet documentation is pretty hard to find: https://dzlab.github.io/bigdata/2020/07/03/spark3-monitoring-1/
I'm running Databricks 8.3 runtime, Spark 3.1.1.

Spark Worker - Change web ui host in standalone mode

When I view the master node's web ui, it shows all my current workers attached to the cluster.
https://spark.apache.org/docs/3.0.0-preview/web-ui.html
The issue that I am having though is that the IP address it uses for the worker nodes in the web ui is incorrect. Is there a way to change the worker's web ui host/ip that is used in the master's web ui?
Reading through the documentation, there appears to be "SPARK_WORKER_WEBUI_PORT" which sets the port for the worker but there doesn't seem to be a "SPARK_WORKER_WEBUI_HOST".
http://spark.apache.org/docs/latest/spark-standalone.html
To provide more context, I currently have a spark cluster that is deployed in stand alone mode. The spark cluster (master and slaves) are all behind a router (NAT). The workers bind to the master using their internal IP address. I setup port forwarding to route external traffic to each of the master and slaves. The issue is that since my workers are binding to the master using their internal IP addresses, that it uses the internal IP address in the master node's web ui. This makes the worker node's web ui inaccessible for everyone outside of my NAT. If there is a way to specifically set the IP address to use for each of my worker's web ui, then this would resolve this problem. Thanks!
After more research, I determined that the environment variable I was looking for was: SPARK_PUBLIC_DNS
http://spark.apache.org/docs/latest/spark-standalone.html
This allowed me to set a different external host name for my workers.

Bootstrap AKS agent nodes through terraform

I am currently using terraform to create k8s cluster which is working perfectly fine. Once the nodes are provisioned, I want to run a few bash commands on any one of the node. So far, null_resource seems like an option since it is a cluster and we are unaware of the node names/IPs. However, I am unable to determine what should be the value of connection block since azurerm_kubernetes_cluster does not export the IP address of the load balancer or the vm names. The question mark needs the correct value in the below:
resource "null_resource" "cluster" {
triggers = { "${join(",", azurerm_kubernetes_cluster.k8s.id)}" }
connection = { type = ssh
user = <user>
password = <password>
host = <?>
host_key = <pub_key>
}
}
Any help!
AKS does not expose the nodes of it to the Internet. And you just can connect the nodes through the master of the cluster. If you want to run a few bash commands in the nodes, you can use the SSH connection that makes a pod as a helper to connect to the nodes, see the steps about SSH node access.
Also, you can add the NAT rules for the nodes in the Load Balancer, then you can also SSH to the nodes through the Load Balancer public IP. But it's not a secure way. So I do not suggest this way.
Would recommend just running a daemonset that performs the bash commands on the nodes. As any scale or update operations are going to remove or not have the updated config you are performing on the nodes.
There was no straightforward solution for this one. Static IP was not the right way to do it and hence, I ended up writing a wrapper around terraform. I did not want to run my init scripts on every node that comes up but only one of the nodes. So essentially, now that wrapper communicates with terraform to first deploy only one node which executes cloud-init. After this, it recalls the function to scale terraform and brings up rest of the desired number of instances. In the cloud-init script, I have a check of kubectl get no where if I receive the size as more than one node, I simply skip the cloud-init commands.

Spark standalone cluster doesn't accept connections

I'm trying to run the simplest Spark standalone cluster on an Azure VM. I'm running a single master, with a single worker running on the same machine. I can access the Web UI perfectly, and can see that the worker is registered with the master.
But I can't connect to this cluster using spark-shell on my laptop. When I looked in the logs, I see
15/09/27 12:03:33 ERROR ErrorMonitor: dropping message [class akka.actor.ActorSelectionMessage]
for non-local recipient [Actor[akka.tcp://sparkMaster#40.113.XXX.YYY:7077/]]
arriving at [akka.tcp://sparkMaster#40.113.XXX.YYY:7077] inbound addresses
are [akka.tcp://sparkMaster#somehostname:7077]
akka.event.Logging$Error$NoCause$
Now I think the reason why this is happening is that on Azure, every virtual machine sits behind a type of firewall/load balancer. I'm trying to connect using the Public IP that Azure tells me (40.113.XXX.YYY), but Spark refuses to accept connections because this is not the IP of an interface.
Since this IP is not of the machine, I can't bind to an interface either.
How can I get Spark to accept these packets as well?
Thanks!
you can try to set the ip address to MASTER_IP which in spark-env.sh instead of hostname.
I got the same problem and was able to resolve by getting the configured --ip parameter of the command line that runs spark:
$ ps aux | grep spark
[bla bla...] org.apache.spark.deploy.master.Master --ip YOUR_CONFIGURED_IP [bla bla...]
Then I was able to connect to my cluster by using exactly the same string as YOUR_CONFIGURED_IP:
spark-shell --master spark://YOUR_CONFIGURED_IP:7077

Resources