Issues in using spark cluster from outside LAN - apache-spark

I am trying to use a spark cluster from outside the cluster itself.
The problem is that spark bind to my local machine private ip and it is able to connect to the master but then workers fail to connect to my machine (driver) because of IP problems (they see my private IP, because spark binds on my private IP).
I can see that from workers log:
"--driver-url" "spark://CoarseGrainedScheduler#PRIVATE_IP_MY_LAPTOP:34355"
any help?

Try setting spark.driver.host (search for it here for more info) to your public IP, the workers will then use that address instead of the (automatically resolved) private IP.

Try setting spark.driver.bindAddress to 0.0.0.0 so that the driver program can listen to the global.

Related

Cluster configuration problems in cassandra.yaml for multi-node cluster where only 1 public ip is known

I would like to know about configuration parameters of cassandra.yaml namely,
listen_address
broadcast_address
rpc_address
broadcast_rpc_address
on individual nodes in a particular scenario.
Scenario: 6-node cluster with respective private IPs but only one node has a public IP.
Requirement: remote python application to access the cluster
What I have tried on each node:
listen_address: respective private IP
broadcast_address: blank
rpc_address: blank
except on node with public ip as 0.0.0.0
broadcast_rpc_address: blank except on node with public ip as its public ip
I tried issuing from my application Cluster(['public ip'], port=9042) but I received the warning which eventually led to shutting down my application:
WARNING:cassandra.cluster:Failed to create connection pool for new
host 192.xxx.xx.3:
I recommend adding two interfaces to each machine.
One of them is listen_address
and one for the rpc_address.
in this approach, you don't use as broadcast_rpc_address.
but, if you use for public ip you have to put a general address for all nodes. Only one of them can not have a public address.

Binding Multiple IP addresses to Master in Spark

I am trying to set up Apache Spark with the following systems:
1 Master Node(having public IP, local IP)
Slave Node-3(having public IP, local IP)
Slave node-2(having Local IP)
The configuration is such that the Master Node and Slave Node-3 communicate via public IP's whereas Slave node-2 communicates with the other two Nodes via localIP's.
The problem I am facing is, that since the Master Node binds to a public IP, Slave Node-2 is unable to connect to the Master via it's local IP this giving a connection refused error, however Slave Node-3 is able to communicate with the Master Node without any difficulty.
Is there a way as to how can i allow communication between Master Node and Slave Node-2 or as to how can i bind multiple addresses to the Master Node, for e.g such a configuration is possible in Hadoop where we can have the namenode bind to multiple hosts.
Thank you
If you bind the Master to '0.0.0.0' or all local addresses, then the master should be able to communicate with Node-2 via the private network and Node-3 on the public network.

Spark standalone cluster doesn't accept connections

I'm trying to run the simplest Spark standalone cluster on an Azure VM. I'm running a single master, with a single worker running on the same machine. I can access the Web UI perfectly, and can see that the worker is registered with the master.
But I can't connect to this cluster using spark-shell on my laptop. When I looked in the logs, I see
15/09/27 12:03:33 ERROR ErrorMonitor: dropping message [class akka.actor.ActorSelectionMessage]
for non-local recipient [Actor[akka.tcp://sparkMaster#40.113.XXX.YYY:7077/]]
arriving at [akka.tcp://sparkMaster#40.113.XXX.YYY:7077] inbound addresses
are [akka.tcp://sparkMaster#somehostname:7077]
akka.event.Logging$Error$NoCause$
Now I think the reason why this is happening is that on Azure, every virtual machine sits behind a type of firewall/load balancer. I'm trying to connect using the Public IP that Azure tells me (40.113.XXX.YYY), but Spark refuses to accept connections because this is not the IP of an interface.
Since this IP is not of the machine, I can't bind to an interface either.
How can I get Spark to accept these packets as well?
Thanks!
you can try to set the ip address to MASTER_IP which in spark-env.sh instead of hostname.
I got the same problem and was able to resolve by getting the configured --ip parameter of the command line that runs spark:
$ ps aux | grep spark
[bla bla...] org.apache.spark.deploy.master.Master --ip YOUR_CONFIGURED_IP [bla bla...]
Then I was able to connect to my cluster by using exactly the same string as YOUR_CONFIGURED_IP:
spark-shell --master spark://YOUR_CONFIGURED_IP:7077

What is the difference between broadcast_address and broadcast_rpc_address in cassandra.yaml?

GOAL: I am trying to understand the best way to configure my Cassandra cluster so that several different drivers across several different networking scenarios can communicate with it properly.
PROBLEM/QUESTION: It is not entirely clear to me, after reading the documentation what the difference is between these two settings: broadcast_address and broadcast_rpc_address as it pertains to the way that a driver connects and interacts with the cluster. Which one or which combination of these settings should I use with my node's accessible network endpoint (DNS record attainable by the client's/drivers)?
Here is the documentation for broadcast_address from datastax:
(Default: listen_address)note The IP address a node tells other nodes in the cluster to contact it by. It allows public and private address to be different. For example, use the broadcast_address parameter in topologies where not all nodes have access to other nodes by their private IP addresses.
If your Cassandra cluster is deployed across multiple Amazon EC2 regions and you use the EC2MultiRegionSnitch, set the broadcast_address to public IP address of the node and the listen_address to the private IP.
Here is the documentation for broadcast_rpc_address from datastax:
(Default: unset)note RPC address to broadcast to drivers and other Cassandra nodes. This cannot be set to 0.0.0.0. If blank, it is set to the value of the rpc_address or rpc_interface. If rpc_address or rpc_interfaceis set to 0.0.0.0, this property must be set.
EDIT: This question pertains to Cassandra version 2.1, and may not be relevant in the future.
One of the users of #cassandra on freenode was kind enough to provide an answer to this question:
The rpc family of settings pertain to drivers that use the Thrift protocol to communicate with cassandra. For those drivers that use the native transport, the broadcast_address will be reported and used.
My test case confirms this.

Apache Cassandra remote access

I have installed Apache Cassandra on the remote Ubuntu server. How to allow remote access for an Apache Cassandra database? And how to make a connection?
Remote access to Cassandra is via its thrift port (although note that the JMX port can be used to perform some limited operations).
The thrift port is defined in cassandra.yaml by the rpc_port parameter, which defaults to 9160. Your cassandra node should be bound to the IP address of your server's network card - it shouldn't be 127.0.0.1 or localhost which is the loopback interface's IP, binding to this will prevent direct remote access. You configure the bound address with the rpc_address parameter in cassandra.yaml. Setting this to 0.0.0.0 says "listen on all network interfaces" which may or may not be suitable for you.
To make a connection you can use:
The cassandra-cli in the cassandra distribution's bin directory provides simple get / set / list operations and depends on Java
The cqlsh shell which provides CQL access to cassandra, this depends on Python
A higher level interface such as Apollo
For anyone finding this question now, the top answer is out of date.
Apache Cassandra's thrift interface is deprecated and will be removed in Cassandra 4.0. The default client port is now 9042.
As noted by Tyler Hobbs, you will need to ensure that the rpc_address parameter is not set to 127.0.0.1 or localhost (it is localhost by default). If you set it to 0.0.0.0 to listen on all interfaces, you will also need to set broadcast_rpc_address to either the node's public or private IP address (depending on how you plan to connect to Cassandra)
Cassandra-cli is also deprecated and Apollo is no longer active. Use cqlsh in lieu of cassandra-cli and the Java driver in lieu of Apollo.
I do not recommend making the JMX port accessible remotely unless you secure it properly by enabling SSL and strong authentication.
Hope this is helpful.
cassandra 3.11.3
I did the following to get mine working. Changes in cassandra.yaml :
start_rpc: true
rpc_address: 0.0.0.0
broadcast_rpc_address: ***.***.***.***
broadcast_rpc_address is the address of machine where cassandra is installed
seed_provider:
- class_name: ...
- seeds: "127.0.0.1, ***.***.***.***"
In seeds i added/appended the ip address of machine where cassandra was running.
I accessed it from windows using tableplus. In tableplus, I wrote the ip address of the cassandra machine, in the port section I wrote 9042 and used the username and password, which i used for ssh connection.
For anyone using Azure, the issue may be that you need to create a public ip address since the virtual ip points to the cloud service itself and not the virtual machine. You can find more info in this post

Resources