Unable to get metrics from PrometheusServlet on Databricks Spark 3.1.1 - apache-spark

Trying to get prometheus metrics with grafana dashboard working for Databricks clusters on AWS but cannot seem to get connections on the ports as requried. I've tried a few different setups, but will focus on PrometheusServlet in this question as it seems like it should be the quickest path to glory.
PrometheusServlet - I put this in my metrics.properties file using an init script on each worker:
sudo bash -c "cat <<EOF >> /databricks/spark/conf/metrics.properties
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus
# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
EOF"
I also have "spark.ui.prometheus.enabled true" and "spark.executor.processTreeMetrics.enabled true" in the spark config options for the Databricks job
I get a connection refused when trying to hit the worker URL at anything but port 8080. On port 8080 I get a wierd binary response "P%" when I try to connect via curl, and get a bad SSL cert error when I try to connect via the browser. I've opened up the necessary ports on the security group associated with the Spark workers. Trying to add a worker in Grafana just results in a 'Bad Gateway' error.
Has anyone gotten the PrometheusServlet working on Databricks clusters? Is there another way I should be doing this? This is the blog I was following for reference, as the PrometheusServlet documentation is pretty hard to find: https://dzlab.github.io/bigdata/2020/07/03/spark3-monitoring-1/
I'm running Databricks 8.3 runtime, Spark 3.1.1.

Related

How to debug a Spark job on Dataproc?

I have a Spark job running on a Dataproc cluster. How do I configure the environment to debug it on my local machine with my IDE?
This tutorial assumes the following:
You know how to create GCP Dataproc clusters, either by API calls, cloud shell commands or Web UI
You know how to submit a Spark Job
You have permissions to launch jobs, create clusters and use Compute Engine instances
After some attempts, I've discovered how to debug on your local machine a DataProc Spark Job running on a cluster.
As you may know, you can submit a Spark Job either by using the Web UI, sending a request to the DataProc API or using the gcloud dataproc jobs submit spark command. Whichever way, you start by adding the following key-value pair to the properties field in the SparkJob: spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=REMOTE_PORT, where REMOTE_PORT is the port on the worker where the driver will be listening.
Chances are your cluster is on a private network and you need to create a SSH tunnel to the REMOTE_PORT. If that's not the case, you're lucky and you just need to connect to the worker using the public IP and the specified REMOTE_PORT on your IDE.
Using IntelliJ it would be like this:
,
where worker-ip is the worker which is listening (I've used 9094 as port this time). After a few attempts, I realized it's always the worker number 0, but you can connect to it and check whether there is a process running using netstat -tulnp | grep REMOTE_PORT
If for whatever reason your cluster does not have a public IP, you need to set a SSH tunnel from your local machine to the worker. After specifying your ZONE and PROJECT you create a tunnel to REMOTE_PORT:
gcloud compute ssh CLUSTER_NAME-w-0 --project=$PROJECT --zone=$ZONE -- -4 -N -L LOCAL_PORT:CLUSTER_NAME-w-0:REMOTE_PORT
And you set your debug configuration on your IDE pointing to host=localhost/127.0.0.1 and port=LOCAL_PORT

Could not bind on a random free port error while trying to connect to spark master

I have a spark master running on amazon ec2.
I tried to connect to it using pyspark as follows from another ec2 instance as follows:
spark = SparkSession.builder.appName("MyApp") \
.master("spark_url_as_obtained_in_web_ui") \
.getOrCreate()
The following were the errors:
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-04-04 20:03:04 WARN Utils:66 - Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
............
java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
I tried all the solutions as described here but to no avail:
Connecting to a remote Spark master - Java / Scala
All masters are unresponsive ! ? Spark master is not responding with datastax architecture
Spark Standalone Cluster - Slave not connecting to Master
Spark master-machine:7077 not reachable
spark submit "Service 'Driver' could not bind on port" error
https://community.hortonworks.com/questions/8257/how-can-i-resolve-it.html
What could be going wrong??
Set spark.driver.bindAddress to your local IP like 127.0.0.1.
pyspark -c spark.driver.bindAddress=127.0.0.1
While creating spark session set the below configurations
spark =SparkSession.builder.appName(str(name))\
.master("local[*]").config("spark.driver.memory","5g")\
.config("spark.driver.host","10.51.4.110")\ #Machine ip
.config("spark.driver.bindAddress","10.51.4.110")\ #Machine ip
.getOrCreate()
Sometimes In addition to bind address, we need to set the host address also. In my case system host address changed to system name. Spark shows timeout error. then after setting host and bind address as same. It works fine
*10.51.4.110 - Local Machine IP
I had a similar issue on Windows 10 recently.
Resolved it by following below steps:
Fixed it by setting a user environment variable SPARK_LOCAL_IP = 127.0.0.1
Restart the command line as an admin
On a MacOS, with a standalone spark cluster, setting SPARK_MASTER_HOST to 'localhost' instead of '127.0.0.1' solved to problem for me.
export SPARK_MASTER_HOST='localhost'

Connect Squirrel to Azure HBase cluster

My objective is to access my Hbase cluster on Azure with Squirrel with a Phoenix driver running on my local computer.
My Hbase cluster on Azure is operational. I can see it in the Ambari dashboard and I can access it using SSH. I can start Phoenix with the sqlline.py command pointing to one of the zookeeper nodes. The !tables command returs four lines.
My Hbase cluster is included in an Azure VNet. From my local computer (running Windows 10) I can connect to this VNet. I can ping the IP address (10.254.x.x) of the zookeeper node successfully but pinging the FQDN of the zookeeper node results in an error message:
"Ping request could not find host zk1-.......ax.internal.cloudapp.net.
Please check the name and try again."
When I start Squirrel on my local computer with the URL pointing to the FQDN of the zookeeper node I get an error message:
"Unexpected Error occurred attempting to open an SQL connection". The
stack trace points to a java.util.concurrent.RuntimeException: "Unable
to establish connection"
When I start Squirrel on my local computer with the URL pointing to the IP address of the zookeeper node I get a different error:
"Unexpected Error occurred attempting to open an SQL connection". The
stack trace points to a java.util.concurrent.TimeoutException.
I suspect this has something to do with the Domain Name resolution problem as described here [https://superuser.com/questions/966832/windows-10-dns-resolution-via-vpn-connection-not-working]. I applied the resolution as described by LikeARock47 on Feb 23. This did not improve the situation however.
Does this indeed have to do with the Domain Name resolution issue or is the problem somewhere else?
Is there a better solution to the Domain Name resolution issue?
A JDBC connection from Squirrel on my local Windows10 computer has succesfully been established to the Hbase cluster by using the zookeeper IP address and the port and "/hbase-unsecure":
jdbc:phoenix:10.254.x.x:2181:/hbase-unsecure
I can manage my HBase cluster with a local Squirrel now!
I'd still be interested to find out how I can get the zookeeper FQDN resolved locally.....

Spark standalone cluster doesn't accept connections

I'm trying to run the simplest Spark standalone cluster on an Azure VM. I'm running a single master, with a single worker running on the same machine. I can access the Web UI perfectly, and can see that the worker is registered with the master.
But I can't connect to this cluster using spark-shell on my laptop. When I looked in the logs, I see
15/09/27 12:03:33 ERROR ErrorMonitor: dropping message [class akka.actor.ActorSelectionMessage]
for non-local recipient [Actor[akka.tcp://sparkMaster#40.113.XXX.YYY:7077/]]
arriving at [akka.tcp://sparkMaster#40.113.XXX.YYY:7077] inbound addresses
are [akka.tcp://sparkMaster#somehostname:7077]
akka.event.Logging$Error$NoCause$
Now I think the reason why this is happening is that on Azure, every virtual machine sits behind a type of firewall/load balancer. I'm trying to connect using the Public IP that Azure tells me (40.113.XXX.YYY), but Spark refuses to accept connections because this is not the IP of an interface.
Since this IP is not of the machine, I can't bind to an interface either.
How can I get Spark to accept these packets as well?
Thanks!
you can try to set the ip address to MASTER_IP which in spark-env.sh instead of hostname.
I got the same problem and was able to resolve by getting the configured --ip parameter of the command line that runs spark:
$ ps aux | grep spark
[bla bla...] org.apache.spark.deploy.master.Master --ip YOUR_CONFIGURED_IP [bla bla...]
Then I was able to connect to my cluster by using exactly the same string as YOUR_CONFIGURED_IP:
spark-shell --master spark://YOUR_CONFIGURED_IP:7077

running cassandra on a mesos cluster

I'm trying to deploy Cassandra on a small (test) Mesos cluster. I have one master node (say 10.10.10.1) and three worker nodes: 10.10.10.2-4.
On the official site of apache mesos there is a link to cassandra framework developed for mesos (it is here: https://github.com/mesosphere/cassandra-mesos).
I'm following the tutorial that they provide there. In step 3 they are saying I should edit the conf/mesos.yaml file, specifically that I should set mesos.master.url so that it points to the master node (on which I also have the conf file).
The first thing I tried was just to replace localhost by the master node ip, so I had
mesos.master.url: 'zk://10.10.10.1:2181/mesos'
but when I then started the deployment script (by running bin/cassandra-mesos as they say in point 5 I should) I get the following error:
2015-02-24 09:18:24,262:12041(0x7fad617fa700):ZOO_ERROR#handle_socket_error_msg#1697: Socket [10.10.10.1:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
It keeps retrying and displays the same error until I terminate it.
I tried removing 'zk' or replacing it with 'mesos' in the URL, changing (or removing altogether) the port removing the 'mesos' word in the URL but I keep getting the same error.
I also tried looking at how other frameworks do it (specifically spark, which I am hoping to deploy next) but didn't find anything helpful. Any ideas how to run it? Thanks!
The URL provided to mesos.master.url is passed directly to the underlying Mesos Native Java Library. The format listed in your example looks correct.
Next steps in debugging the connection issue would be to verify the IP address the ZooKeeper sever has bound to. You can find out by running sudo netstat -ntplv | grep 2181 on the server that is running ZooKeeper.
I would expect to see something like the following:
tcp 0 0 0.0.0.0:2181 0.0.0.0:* LISTEN 3957/java
Another possibility could be that ZooKeeper is binding specifically to localhost:
tcp 0 0 127.0.0.1:2181 0.0.0.0:* LISTEN 3957/java
If ZooKeeper has bound to localhost a client will only be able to connect to it with the URL zk://127.0.0.1:2181/mesos
A note about the future of the Cassandra Mesos Framework.
I am one of the developers working on rewriting the cassandra-mesos project to be more robust, stable and easier to run. The code in current master(6aa82acfac) is end-of-life and will be replaced within the next couple of weeks with the code that is in the rewrite branch.
If you would like to try out the latest build of the rewrite branch a marathon.json for running the framework can be found here. After downloading the marathon.json update the values for MESOS_ZK and CASSANDRA_ZK (and any resource values you want to update) then POST the json to marathon at /v2/apps.
If you have one master and no zk, how about setting the mesos.master.url to 10.10.10.1:5050 (where 5050 is the default mesos-master port)?
If Ben is right and the Cassandra framework otherwise requires ZK for its own persistence/HA, then try disabling that feature if possible. Otherwise you may have to rip the ZK code out yourself and recompile, if you really want a ZK-free setup (consequently without any HA features).

Resources