I have a spark driver which is connected to my Mesos master. Driver is listening on a particular port to my Mesos master for resource offers
Received SUBSCRIBE call for framework 'Simple kafka application' at scheduler-901ab680-7098-4cb0-ab27-4b293285a2b6#xxx.xx.xx.xxx:57033
I would like to configure this port as I will need to whitelist this port on my machines.
I am not able to figure out which conf will this be. I have configured spark.driver.port and broadcast port but I am pretty sure these are not used in this scenario.
To use custom port for communication with Mesos you need to specify LIBPROCESS_PORT environment variable with port number that should be used. By default it's unset or set to 0 and this cause random port.
export LIBPROCESS_PORT=<PORT>
Related
I'm trying to set up a local Spark cluster. When I add the IP addresses of the workers to spark/conf/workers it tries to ssh into them on the default port 22 when I run sbin/start-all.sh. I have my ssh ports set differently for security reasons. Is there an option I can use to configure spark to use alternate ports for ssh from master to workers, etc?
You should add the following option for /path/to/spark/conf/spark-env.sh
# Change 2222 to whatever port you're using
SPARK_SSH_OPTS="-p 2222"
When I view the master node's web ui, it shows all my current workers attached to the cluster.
https://spark.apache.org/docs/3.0.0-preview/web-ui.html
The issue that I am having though is that the IP address it uses for the worker nodes in the web ui is incorrect. Is there a way to change the worker's web ui host/ip that is used in the master's web ui?
Reading through the documentation, there appears to be "SPARK_WORKER_WEBUI_PORT" which sets the port for the worker but there doesn't seem to be a "SPARK_WORKER_WEBUI_HOST".
http://spark.apache.org/docs/latest/spark-standalone.html
To provide more context, I currently have a spark cluster that is deployed in stand alone mode. The spark cluster (master and slaves) are all behind a router (NAT). The workers bind to the master using their internal IP address. I setup port forwarding to route external traffic to each of the master and slaves. The issue is that since my workers are binding to the master using their internal IP addresses, that it uses the internal IP address in the master node's web ui. This makes the worker node's web ui inaccessible for everyone outside of my NAT. If there is a way to specifically set the IP address to use for each of my worker's web ui, then this would resolve this problem. Thanks!
After more research, I determined that the environment variable I was looking for was: SPARK_PUBLIC_DNS
http://spark.apache.org/docs/latest/spark-standalone.html
This allowed me to set a different external host name for my workers.
I have 10 Cassandra Nodes running on Kubernetes on my server and 1 contact point that expose the service on port 10023.
However, when the datastax driver tries to establish a connection with the other nodes of the cluster it uses the exposed port instead of the default one and i get the following error:
com.datastax.driver.core.ConnectionException: [/10.210.1.53:10023] Pool was closed during initialization
Is there a way to expose one single contact point and have it to communicate with the other nodes on the standard port (9042)?
i checked on the datastax documentation if there is anything related to it but i didn't find much.
this is how i connect to the cluster
Cluster.Builder builder = Cluster.builder();
builder.addContactPoints(address)
.withPort(Integer.valueOf(10023))
.withCredentials(user, password)
.withMaxSchemaAgreementWaitSeconds(600)
.withSocketOptions(
new SocketOptions()
.setConnectTimeoutMillis(Integer.valueOf(timeout))
.setReadTimeoutMillis(Integer.valueOf(timeout))
).build();
Cluster cluster = builder.withoutJMXReporting().build();
Session session = cluster.connect();
After driver contacts first node, it fetches information about cluster, and use this information, and this information includes on what ports Cassandra listens.
To implement what you want to do, you need that Cassandra listened on the corresponding port - this is configured via native_transport_port parameter of the cassandra.yaml.
Also, by default Cassandra driver will try to connect to all nodes in cluster because it uses DCAware/TokenAware load balancing policy. If you want to use only one node, then you need to use WhiteListPolicy instead of default policy. But is not optimal from the performance point of view.
I would suggest to re-think how you expose Cassandra to clients.
I have a spark master running on amazon ec2.
I tried to connect to it using pyspark as follows from another ec2 instance as follows:
spark = SparkSession.builder.appName("MyApp") \
.master("spark_url_as_obtained_in_web_ui") \
.getOrCreate()
The following were the errors:
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-04-04 20:03:04 WARN Utils:66 - Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
............
java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
I tried all the solutions as described here but to no avail:
Connecting to a remote Spark master - Java / Scala
All masters are unresponsive ! ? Spark master is not responding with datastax architecture
Spark Standalone Cluster - Slave not connecting to Master
Spark master-machine:7077 not reachable
spark submit "Service 'Driver' could not bind on port" error
https://community.hortonworks.com/questions/8257/how-can-i-resolve-it.html
What could be going wrong??
Set spark.driver.bindAddress to your local IP like 127.0.0.1.
pyspark -c spark.driver.bindAddress=127.0.0.1
While creating spark session set the below configurations
spark =SparkSession.builder.appName(str(name))\
.master("local[*]").config("spark.driver.memory","5g")\
.config("spark.driver.host","10.51.4.110")\ #Machine ip
.config("spark.driver.bindAddress","10.51.4.110")\ #Machine ip
.getOrCreate()
Sometimes In addition to bind address, we need to set the host address also. In my case system host address changed to system name. Spark shows timeout error. then after setting host and bind address as same. It works fine
*10.51.4.110 - Local Machine IP
I had a similar issue on Windows 10 recently.
Resolved it by following below steps:
Fixed it by setting a user environment variable SPARK_LOCAL_IP = 127.0.0.1
Restart the command line as an admin
On a MacOS, with a standalone spark cluster, setting SPARK_MASTER_HOST to 'localhost' instead of '127.0.0.1' solved to problem for me.
export SPARK_MASTER_HOST='localhost'
What will be difference and use of all these?
spark.local.ip
spark.driver.host
spark.driver.bindAddress
spark.driver.hostname
How to fix a machine as a Driver in Spark standalone cluster ?
Short Version
the ApplicationMaster connect to spark Driver by spark.driver.host
spark Driver bind to bindAddress on the client machine
by examples
1 example of port binding
.config('spark.driver.port','50243')
then netstat -ano on windows
TCP 172.18.1.194:50243 0.0.0.0:0 LISTENING 15332
TCP 172.18.1.194:50243 172.18.7.122:54451 ESTABLISHED 15332
TCP 172.18.1.194:50243 172.18.7.124:37412 ESTABLISHED 15332
TCP 172.18.1.194:50243 172.18.7.142:41887 ESTABLISHED 15332
TCP [::]:4040 [::]:0 LISTENING 15332
The nodes in the cluster 172.18.7.1xx are in the same network as my development machine 172.181.1.194 as my netmask is 255.255.248.0
2 example of specify ip from ApplicationMaster to Driver
.config('spark.driver.host','192.168.132.1')
then netstat -ano
TCP 192.168.132.1:58555 0.0.0.0:0 LISTENING 9480
TCP 192.168.132.1:58641 0.0.0.0:0 LISTENING 9480
TCP [::]:4040 [::]:0 LISTENING 9480
however the ApplicationMaster cannot connect and reported error
Caused by: java.net.NoRouteToHostException: No route to host
because this ip is a VM bridge on my development machine
3 example of ip bind
.config('spark.driver.host','172.18.1.194')
.config('spark.driver.bindAddress','192.168.132.1')
then netstat -ano
TCP 172.18.1.194:63937 172.18.7.101:8032 ESTABLISHED 17412
TCP 172.18.1.194:63940 172.18.7.102:9000 ESTABLISHED 17412
TCP 172.18.1.194:63952 172.18.7.121:50010 ESTABLISHED 17412
TCP 192.168.132.1:63923 0.0.0.0:0 LISTENING 17412
TCP [::]:4040 [::]:0 LISTENING 17412
Detailed Version
Before explain in detail, there are only these three related conf variables:
spark.driver.host
spark.driver.port
spark.driver.bindAddress
There are NO variables like spark.driver.hostname or spark.local.ip. But there IS a environment variable called SPARK_LOCAL_IP
and before explain the variables, first we have to understand the application submition process
Main Roles of computers:
development machine
master node (YARN / Spark Master)
worker node
There is an ApplicationMaster for each application, which takes care of resource request from cluster and status monitor of jobs(stages)
The ApplicationMaster is in the cluster, always.
Place of spark Driver
development machine: client mode
within the cluster: cluster mode, same place as the ApplicationMaster
Let's say we are talking about client mode
The spark application can be submitted from a development machine, which act as a client machine of the application, as well as a client machine of the cluster.
The spark application can alternatively submitted from a node within the cluster (master node or worker node or just a specific machine with no resource manager role)
The client machine might not be placed within the same subnet as the cluster and this is one case that these variables try to deal with. Think about your internet connection, it is often not possible that your laptop can be accessed from anywhere around the globe just as google.com.
At the beginning of the application submission process, the spark-submit on the client side would upload necessary files to the spark master or yarn, and negotiate about resource requests. In this step the client connect to the cluster, and the cluster address is the destination address that the client tries to connect.
Then the ApplicationMaster starts on the allocated resource.
The resource allocated for ApplicationMaster is by default random, and cannot control by these variables. It is controlled by the scheduler of the cluster, if you're curious about this.
Then the ApplicationMaster tries to connect BACK to the spark Driver. This is the place that these conf variables take effects.