Is there a way to find out what the driver IP is on a databricks cluster? The Ganglia UI shows all the nodes on the main page and there doesn't seem to be a way to filter only for the driver.
You can go into the Spark cluster UI - Master tab within the cluster. The url listed contains IP for the driver and the workers' IPs are listed at the bottom.
Depending on your use case, it may be helpful to know that in an init script you can get the DB_DRIVER_IP from an environment variable.
https://docs.databricks.com/clusters/init-scripts.html#environment-variables
There are other environment variables set at runtime that can be accessed in a scala notebook:
System.getenv.get("MASTER") // spark://10.255.128.6:7077
System.getenv.get("SPARK_LOCAL_IP") // 10.255.128.6
Related
I am completely new at Spark and try to run a tutorial example, which counts the number of lines containing 'a' and 'b' in a text file in the local file system.
I am running it with SparkContext with master = "local", i.e. Spark is running in the same JVM. Now I would like to try it in "cluster mode".
So I would like to run a Spark cluster of a cluster manager and two worker nodes locally on my Mac laptop. What is the easiest way to do that ?
Quoting the official documentation about Spark Standalone Mode:
./sbin/start-master.sh
./sbin/start-slave.sh <master-spark-URL>
In other words, you should start the standalone Master first (using ./sbin/start-master.sh) followed by starting one or more standalone Workers (using ./sbin/start-slave.sh).
Quoting the docs again:
Once you have started a worker, look at the master's web UI (http://localhost:8080 by default)
You're done. Congrats!
If you are looking to learn various ways to use SPARK I would suggest you to download the CLOUDERA quick start VM's which will give a simple cluster setup.
All you need to do is download the quick start VM and play around with the settings accordingly.
The quick start VM can be found here
Reference:Cloudera VM
Is there a way to find a driver to application mapping in cluster mode??
I understand, on submitting an application the CreateSubmissionResponse would return the driver-Id which could be used to monitor or kill the driver program. I am trying to see if there is any alternate way of doing it without storing the driver id.
I saw Driver UI http://<driver>:4040 which gives the application information under Environment section, but spark documentation mentions
"If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc)."
This makes it difficult to map which driver runs on which port.
So is there a way to get all driver id and their applications.
Environment: Spark standalone with Zookeeper as Cluster manager.
Any help is appreciated!
Thanks
If you have Spark history server up and running, the simplest way is to
use its REST API
http://spark.apache.org/docs/latest/monitoring.html#rest-api. Check
/applications endpoint
I have an analytics node running, with Spark Sql Thriftserver running on it. Now I can't run another Spark Application with spark-submit.
It says it doesn't have resources. How to configure the dse node, to be able to run both?
The SparkSqlThriftServer is a Spark application like any other. This means it requests and reserves all resources in the cluster by default.
There are two options if you want to run multiple applications at the same time:
Allocate only part of your resources to each application.
This is done by setting spark.cores.max to a smaller value than the max resources in your cluster.
See Spark Docs
Dynamic Allocation
Which allows applications to change the amount of resources they use depending on how much work they are trying to do.
See Spark Docs
I don't know if this is already answered in SO but I couldn't find a solution to my problem.
I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook
I have also a spark cluster created with google cloud dataproc
Spark master and the notebook are running in different VMs but in the same region and zone.
My problem is that I'm trying to connect to the spark master from the IPython notebook but without success. I use this snippet of code in my python notebook
import pyspark
conf = pyspark.SparkConf()
conf.setMaster("spark://<spark-master-ip or spark-master-hostname>:7077")
I just started working with spark, so I'm sure I'm missing something (authentication, security ...),
What I found over there is connecting a local browser over an SSH tunnel
Somebody already did this kind of set up?
Thank you in advance
Dataproc runs Spark on YARN, so you need to set master to 'yarn-client'. You also need to point Spark at your YARN ResourceManager, which requires a under-documented SparkConf -> Hadoop Configuration conversion. You also have to tell Spark about HDFS on the cluster, so it can stage resources for YARN. You could use Google Cloud Storage instead of HDFS, if you baked The Google Cloud Storage Connector for Hadoop into your image.
Try:
import pyspark
conf = pyspark.SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('My Jupyter Notebook')
# 'spark.hadoop.foo.bar' sets key 'foo.bar' in the Hadoop Configuaration.
conf.set('spark.hadoop.yarn.resourcemanager.address', '<spark-master-hostname>')
conf.set('spark.hadoop.fs.default.name', 'hdfs://<spark-master-hostname>/')
sc = pyspark.SparkContext(conf=conf)
For a more permanent config, you could bake these into a local file 'core-site.xml' as described here, place that in a local directory, and set HADOOP_CONF_DIR to that directory in your environment.
It's also worth noting that while being in the same Zone is important for performance, it is being in the same Network and allowing TCP between internal IP addresses in that network that allows your VMs to communicate. If you are using the default network, then the default-allow-internal firewall rule, should be sufficient.
Hope that helps.
I'm using the Cassandra CQL/JDBC driver I got from google code but it doesn't seem to let me provide a cluster name - is there a way?
I'm using cluster names to ensure I don't run commands against a live system, it has a different cluster name to my dev systems.
Edit: Just to clarify, I have two totally separate Cassandra clusters, one live and one for test. They have different cluster names to ensure that I don't accidentally run test code meant for the test cluster on the live cluster. Therefore any client I need to use must let me set a cluster name. Hector does this.
There is no inbuilt protection for checking cluster names for Cassandra clients. It is built to ensure nodes from different clusters don't try and join together but not to ensure clients connect to the right cluster. It would be possible to add this checking to a client though (since the cluster name is exposed to the client) but I'm not aware of any clients doing this.
I'd strongly recommend firewalling off your different environments to avoid this kind of mistake. If that isn't possible, you should choose different ports to avoid confusion. Change this with the 'rpc_port' setting in cassandra.yaml.
You'd have to mirror the data on two different clusters. You cant access the same cluster with different names.
To rename your cluster (from the default 'Test Cluster') you edit the cassandra configuration file found in location/of/cassandra/conf/cassandra.yaml. Its the top line, if you need more details look at the datastax configuration documentation and explanation.