Spark Clusters: worker info doesn't show on web UI - apache-spark

I have installed spark standalone on a set of clusters. And I tried to launch clusters through the cluster launch script. I have added cluster's IP address into conf/slaves file. The master connects to all slaves through password-less ssh.
After running ./bin/start-slaves.sh script, I get the following message:
starting org.apache.spark.deploy.worker.Worker, logging to /root/spark-0.8.0-incubating/bin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-jbosstest2.out
But the webUI of the master (localhost:8080) is not showing any information about the worker. But when I add localhost entry onto my conf/slaves file the worker info of localhost is shown.
There are no error messages, the message on terminal says the worker is started, but the WebUI is not showing any workers.

I had the same problem. I noticed when I could not telnet master:port from the slaves. In my etc/hosts file (on master) I had a 127.0.0.1 master entry (before my 192.168.0.x master). When I removed the 127.0.0.1 entry from my etc/hosts file I could telnet and when I start-slaves.sh (from the master) my slaves connected

When you run the cluster, check command $jps in worker nodes, whether its up correctly and check it in the logs with the worker's PID.
or
set the following: run the cluster and check if the ports are up or not with your configured ports
export SPARK_MASTER_WEBUI_PORT=5050
export SPARK_WORKER_WEBUI_PORT=4040

check your /etc/hosts and see the bindings for master
If your master is binding to localhost as well as ip address (eg 192.168.x.x), remove localhost. if you have local host intact master will be mapped to localhost which wont allow slaves to connect to master Ip address

You can use: ./start-master.sh --host 192.168.x.x instead of changing the file: /etc/hosts

I met the same issue and finally solved by adding the following line in $SPARK_HOME/conf/spark-env.sh:
SPARK_MASTER_HOST=your_master_ip_address

Related

Running taurus command from master node on azure containers which is unable to reach slave node due to error in method java.rmi.MarshalException

Error at master node trying to connect to remote jmter slave node in same network
You need to ensure that at least port 1099 is open, check out How to open ports to a virtual machine with the Azure portal article for more details.
Apart from port 1099 you need to open:
The port you specify as the server.rmi.localport on slaves
The port you specify as the client.rmi.localport on master
More information:
Remote hosts and RMI configuration
JMeter Distributed Testing with Docker
JMeter Remote Testing: Using a different port

Disable Spark master's check for hostname equality

I have a Spark-master running in a Docker container which in turn is executed on a remote server. Next to the Spark-master there are containers running Spark-slave on the same Docker Host.
Server <---> Docker Host <---> Docker Container
In order to let the slaves find the master, I set a master hostname in Docker SPARKMASTER which the slaves use to connect to the master. So far, so good.
I use the SPARK_MASTER_IP environment variable to let the master bind to that name.
I also exposed the Spark port 7077 to the Docker host and forwarded this port on the physical server host. The port is open and available.
Now on my machine I can connect to the Server using its IP, say 192.168.1.100. When my Spark program connects to the server on port 7077 I get a connection, which is disassociated by the master:
15/10/09 17:13:47 INFO AppClient$ClientEndpoint: Connecting to master spark://192.168.1.100:7077...
15/10/09 17:13:47 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#192.168.1.100:7077] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
I already learned that the reason for this disconnection is that the host IP 192.168.1.100 doesn't match the hostname SPARKMASTER.
I could add a host to my /etc/hosts file which would probably work. But I don't want to do that. Is there a way I can completely disable this check for hostname equality?

Azure Hortonworks CloudBreak hosts file not correct

I have created a cluster using the CloudBreak and that all works and I can log into the servers just fine. The problem that I am having is the network setups on the host os and the docker containers seems to not be setup right. The host os and the containers /etc/hosts file like like this
cloudbreak# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
This causes a problem for the hadoop cluster because all the nodes then don’t know how to communicate. If I change the hosts files to contain the other nodes then things start to work. However this does not seems like something I should have to do. This will also be a problem when trying make new clusters, as I will have to go in and make changes, and the auto scaling will not work if i have to change the host file on every host and docker container.
Any help would be helpful, thanks.
CloudBreak does not use the host file to resolve the other nodes in the cluster, It uses swarm and consul for discovery.

Kairosdb not running

i am trying to run kairosdb and cassandra, but kairosdb shutsdown after i get the following error, i believe it is because kairosdb is not able to establish connection with cassandra. Cassandra seems to be running fine and i cannot understand why this error is popping:
18:33:08.463 [main] ERROR [HConnectionManager.java:71] - Could not start connection pool for host localhost(127.0.0.1):9160
Error injecting constructor, org.kairosdb.core.exception.DatastoreException: me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client. ...
Also, i noticed that kairos_cache ix not created which is mentioned in this link. I changed the permissions of the /tmp/ folder to user from root, still it is not working.
Open your cassandra.yaml file and do the following:
Check if you have the apache thrift rpc server enabled and if it is listening on the port Kairos is listening.
start_rpc: true
rpc_address: localhost
rpc_port: 9160
The message is because KairosDB cannot reach Cassandra.
Probably your cassandra DB is not listening on 127.0.0.1 (loopback).
Check your cassandra.yaml file, probably it is using the IP Adress of your network interface as listen_adress and not 127.0.0.1.
Cassandra only listens on one address, by default it is the local host name of IP.
Otrherwise you may check your port just in case, but the ListenAddress is often the source of this problem.
I had the same issue with a docker deployment of cassandra with KairosDB.
As #JVasques said in his answer, the parameter "start_rpc" is disabled (set to false) in the newest cassandra.yaml file as default.
If anyone needs a default/standard YAML configuration file, it is recommended to download the latest release or the version you are using. You can download it from the official cassandra package on the Apache Website: http://cassandra.apache.org/download/
It is located under conf/cassandra.yaml
Beware: Configuration files of older cassandra versions might not be compatible!
It worked with the following settings in Docker for me:
start_rpc: true
rpc_address: 0.0.0.0
rpc_port: 9160

Can't connect to cassandra node from different host

I have a cassandra node at a machine. When I access cqlsh from the same machne it works properly.
But when I tried to connect to it's cqlsh using "192.x.x.x" from another machine, I'm getting an error saying
Connection error: ('Unable to connect to any servers', {'192.x.x.x': error(111, "Tried connecting to [('192.x.x.x', 9042)]. Last error: Connection refused")})
What is the reason for this? How can I fix it?
Probably the remote Cassandra node is not bound to the external network interface but to the loopback one (this is the default configuration). You can ensure this by using "telnet thecassandrahost 9042" from the remote machine, it should not work.
In order to bind Cassandra to the external network interface you need to edit the cassandra.yaml configuration file and set the properties "listen_address" and "rpc_address" to your remote IP or "0.0.0.0" (not all versions of Cassandra support wildcard addresses).
Check also that the firewall is properly configured or disabled (sudo service iptables stop).
Set the config parameter where this file is located. Possibly /etc/cassandra/cassandra.yaml.
cassandra.yaml
listen_address: 192.x.x.x
rpc_address: 192.x.x.x
Then, restart the service.
1.Update:./conf/cassandra.yaml
rpc_address: 0.0.0.0 ("0.0.0.0" allow anywhere IP,but you can appoint an IP)
\# broadcast_rpc_address: 1.2.3.4 (Delete comment if rpc_address=0.0.0.0)
2.restart
./bin/cassandra
Case: I met a problem that I can't remote access cassandra When I using java access cassandra
I got the same issue. And I am following answer in this post. Unfortunately, I have no luck to make it work. I did some research. And it works now. Here is my change.
Environment
Ubuntu Server 16.04.3 LTS on VirtualBox,
DSE version 5.1
Install DSE
I am installing DSE follow this page
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/install/installGUIdse.html
Go to /etc/dse/cassandra/cassandra.yaml
Change 'seeds' from 127.0.0.1 to {seriver ip} exp. mine is 172.20.10.9
listen_address
In DSE 5.1 official document says 'Never specify 0.0.0.0; it is always wrong.'
What I did is comment out 'listen_address' setting
'rpc_address' change from localhost to 0.0.0.0
'broadcast_address' change to server IP. Mine is 172.20.10.9
'broadcast_rpc_address' change to server ip
restart dse wait for couple minute,it is changing. If still not work, restart machine.
This is my log in 15 seconds.
`ubuntu08#ubuntu08:~$ nodetool status
nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
ubuntu08#ubuntu08:~$ nodetool status
Error: The node does not have system_traces yet, probably still bootstrapping
ubuntu08#ubuntu08:~$ nodetool status
Datacenter: SearchGraphAnalytics
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.0.0.44 278.51 KiB 32 ? 19db0016-df63-4470-9921-f3b5fe4e9341 rack1
`
I can access by run 'cqlsh 172.20.10.9' from local or another machine.
This is another document for Cassandra https://docs.datastax.com/en/developer/java-driver/3.3/manual/address_resolution/
I had the same issue, I was not allowed to listen to 0.0.0.0 and Cassandra was running on a VM with Bridged network. The solution I found was to let the VM SSH to itself, port forwarding the port on the bridged network interface to localhost:
ssh -L 192.168.x.x:9042:127.0.0.1:9042 myvmuser#localhost
Since the IP of the bridged network card would change (depending on which developers machine it was run) the ssh-command had to first get the IP, this snipped worked for that:
ifconfig | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1'
Also, as this should happen on boot you have to create a SSH key for the vm-user and trust it via .ssh/authorized_keys.
Even after I set the RPC_ADDRESS, it didn't work for me , until I set the -e CASSANDRA_START_RPC=true option.
It was always set to false in my case. I have tried this with Ubuntu, Docker and Cassandra.
Set the following config parameter in cassandra.yaml file (For CentOS it is located in /etc/cassandra/default.conf)
rpc_address: 0.0.0.0
Verify that following values are same as below(usually they are default)
start_native_transport:true
native_transport_port:9042
Last step for CentOS , Update the firewall configuration and allow port 9042 through for incoming connections
Access the firewall from “System / Administration / Firewall” in the CentOS menu
Add the port under “Other Ports”
I edit cassandra.yaml set listen_address and roc_address as ip address resolve the problem.

Resources