Why do I see 20 nodes in YARN but 30 workers in spark? - apache-spark

I spun up 30 AWS machines.
When I check YARN UI at the master node's ip 8088, I click on "Nodes" and I can see the following:
under "Active Nodes" I see 20
under "Lost Nodes" I see 0.
I navigate to the spark master at port 18080 I can see that pyspark is telling me that Alive Workers: 30. At beginning of page.
I restarted all of services on master node and slaves but still same thing happening.
How I get YARN to recognize all of nodes?

Check your datanode by below command on your namenode,
sudo yarn node -list -all
and if you can't find all 30 nodes, do below command on your misssing datanode,
sudo service hadoop-yarn-nodemanager start
and do below command on your namenode,
sudo service hadoop-yarn-resourcemanager restart
Or, check /etc/hadoop/conf/slaves in your namenode,
and check below setting in /etc/hadoop/conf/yarn-site.xml of all your nodes
<property>
<name>yarn.resourcemanager.hostname</name>
<value>your namenode name</value>
</property>
Or, write your all nodes' names and ipadress in all nodes' /etc/hosts
for example,
127.0.0.1 localhost.localdomain localhost
192.168.1.10 test1
192.168.1.20 test2
and you have to do the command,
/etc/rc.d/init.d/network reload

Related

Multi-node multi-datacenter CASSANDRA

I am trying to setup a multi-node multi-datacenter cluster in Cassandra 3.11
For data-center 1 I have Cassandra running on 3 nodes(eg. 10.90.22.11, 10.90.22.12 and 10.90.22.13) and for data-center 2 I have Cassandra running on 2 nodes(eg. 10.90.22.21 and 10.90.22.22).
The ring is up but they are working separately. To make them work together I update the endpoint_snitch to be GossipingPropertyFileSnitch and also the dc and rac in cassandra-rackdc.properties to be DC1 and DC2 for respective nodes following the steps mentioned in this link.
After these changes when I restart Cassandra, the status of Cassandra is running however when I check for the ring with nodetool status I receive a error:
nodetool: Failed to connect to '127.0.0.1:7199'
ConnectException: 'Connection refused (Connection refused)'
What am I missing?
This error you posted indicates that nodetool couldn't connect to JMX that is supposed to be listening on port 7199:
Failed to connect to '127.0.0.1:7199'
Verify that Cassandra is running and check that the process is bound to various ports including 7199, 9042 and 7000. You can try running one of these commands:
$ netstat -tnlp
$ sudo lsof -nPi | grep LISTEN | grep java
Cheers!
You should try nodetool command with host/IP what you have put in your cassandra.yaml. Also, you should check your port 7199 or custom port if you set is open/allow from firewall.
nodetool -h hostname/ip status.
you can mention username.password if you enabled. please refer below link for more details and understanding:-
http://cassandra.apache.org/doc/latest/tools/nodetool/status.html

Spark uses Random Port even after defining executor port

I have a small cluster setup for my development purpose, which contains 3 VMs with spark 2.3 installed on all the VMs. I have started the master in VM1 and slaves with master Ipaddress in other 2 Vms. we have Firewall up in all the 3 Vms and opened the port range from 38001:38113 in the firewall
Before starting the VMs we have the following configurations Done.
In Master, Worker 1 & Worker 2 Nodes
Spark-default.conf file was added with the following properties:
spark.blockManager.port 38001
spark.broadcast.port 38018
spark.driver.port 38035
spark.executor.port 38052
spark.fileserver.port 38069
spark.replClassServer.port 38086
spark.shuffle.service.port 38103
In Worker 1 & Worker 2 Nodes
Spark-env.sh file was added with the following properties:
SPARK_WORKER_PORT=38112 -- for worker-1
SPARK_WORKER_PORT=38113 -- for worker-2
When we started the Spark-shell and executed a sample csv file read, the executor started on the Worker is starting with a random port for spark driver.
E.g:
Spark Executor Command: "/usr/java/jdk1.8.0_171-amd64/jre/bin/java" "-cp" "/opt/spark/2.3.0/conf/:/opt/spark/2.3.0/jars/*" "-Xmx1024M" "-Dspark.driver.port=34573" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#293.72.254.89:34573" "--executor-id" "1" "--hostname" "293.72.146.384" "--cores" "4" "--app-id" "app-20180706072052-0000" "--worker-url" "spark://Worker#293.72.146.384:38112"
As you can see in the above command the executor started with Spark.driver.port with 34573. And this is always starting randomly. Because of this my program fails as it is unable to communicate with the driver.
Can anyone help me with this configuration which can be used to execute in network tight environment where All the ports are blocked.
Thanks in advance.
Start worker:
./start-slave.sh spark://hostname:port -p [Worker Port]
Options:
-c CORES, --cores CORES Number of cores to use
-m MEM, --memory MEM Amount of memory to use (e.g. 1000M, 2G)
-d DIR, --work-dir DIR Directory to run apps in (default: SPARK_HOME/work)
-i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: random)
--webui-port PORT Port for web UI (default: 8081)
--properties-file FILE Path to a custom Spark properties file.
Default is conf/spark-defaults.conf.

why is cassandra not getting connected to server?

I have installed cassandra and worked on it. It worked properly. Now, it is showing as-
localhost/<> is in use by another process. Change listen_address:storage_port in cassandra.yaml to values that do not conflict with other services
Fatal configuration error; unable to start server. See log for stacktrace.
INFO 09:17:02 Announcing shutdown
INFO 09:17:02 Compacted 4 sstables to [./../data/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-33,]. 6,485 bytes to 5,751 (~88% of original) in 223ms = 0.024595MB/s. 4 total partitions merged to 1. Partition merge counts were {4:1, }
INFO 09:17:04 Waiting for messaging service to quiesce
user#inblrlt-user:~/dev/Cassandra/apache-cassandra-2.1.7/bin$ ./cqlsh
Connection error: ('Unable to connect to any servers', {'127.0.0.1': OperationTimedOut('errors=None, last_host=None',)})
How to change my server address so that the issue is cleared?
Your localhost is already in use. Follow the following steps-
$ jps
You see some processes running. For example:
9107 Jps
1112 CassandraDaemon
Then kill the CassandraDaemon process by the process id you see after executing jps. In my example, here process id 1112 for CassandraDaemon.
$ kill -9 1112
Then check processes again after a while-
$ jps
You will see CassandraDaemon will no longer be available.
9170 Jps
Then remove your saved_caches and commilog and start cassandra again.
If you want to change the listen_address from localhost to any private ip or public ip, you need to make the following changes:
change seeds: at cassandra.yaml
change listen_address: at cassandra.yaml
change rpc_address: at cassandra.yaml
set JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=<place_your_ip_here>" at cassandra-env.sh

Spark High Availability

I`m using spark 1.2.1 on three nodes that run three workers with slave configuration and run daily jobs by using:
./spark-1.2.1/sbin/start-all.sh
//crontab configuration:
./spark-1.2.1/bin/spark-submit --master spark://11.11.11.11:7077 --driver-class-path home/ubuntu/spark-cassandra-connector-java-assembly-1.2.1-FAT.jar --class "$class" "$jar"
I want to keep spark master and slave workers available at all times, and even if it fail I need it to be restarted like a service (like cassandra does).
Is there any way to do it?
EDIT:
I looked into start-all.sh script and it is only contains the setup for start-master.sh script and start-slaves.sh script.
I tried to create a supervisor configuration file for it and only get the below errors:
11.11.11.11: ssh: connect to host 11.11.11.12 port 22: No route to host
11.11.11.13: org.apache.spark.deploy.worker.Worker running as process 14627. Stop it first.
11.11.11.11: ssh: connect to host 11.11.11.12 port 22: No route to host
11.11.11.12: ssh: connect to host 11.11.11.13 port 22: No route to host
11.11.11.11: org.apache.spark.deploy.worker.Worker running as process 14627. Stop it first.
11.11.11.12: ssh: connect to host 11.11.11.12 port 22: No route to host
11.11.11.13: ssh: connect to host 11.11.11.13 port 22: No route to host
11.11.11.11: org.apache.spark.deploy.worker.Worker running as process 14627. Stop it first.
There are tools like monit and supervisor (or even systemd) that can monitor and restart failed processes.

Cassandra nodetool connection timed out

Im trying to use nodetool to check the status of my cluster, but its unable to connect.
My cassandra.yaml is configured with listen_address and rpc_address set as the server IP (e.g. 10.10.10.266).
Im able to connect through cqlsh and cassandra-cli using the same IP, but when I connect to nodetool it doesnt work.
/bin$ nodetool -h 10.10.10.266 ring
Failed to connect to '10.10.10.266:7199': Connection has timed out
I dont think I have a firewall enabled on the server (Ubuntu). Im running this directly on the server in question, so I wouldnt have thought it would be a firewall issue anyway.
You probably need to uncomment the following parameter in cassandra-env.sh:
-Djava.rmi.server.hostname=<public name>
Replace with the address of the interface you want the jmx interface to listen on.
nodetool connects through JMX interface. By default it's listening on port 7199 (other tools use RPC interface listening on port 9160 by default). Check JMX settings in cassandra-env.sh file. Most likely JMX server is listening on wrong interface (or probably loopback interface).
Default JMX configuration section (cassandra ver. 1.1.5) contains link to troubleshooting guide:
# jmx: metrics and administration interface
#
# add this if you're having trouble connecting:
# JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=<public name>"
#
# see
# https://blogs.oracle.com/jmxetc/entry/troubleshooting_connection_problems_in_jconsole
# for more on configuring JMX through firewalls, etc. (Short version:
# get it working with no firewall first.)
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.port=$JMX_PORT"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.ssl=false"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=false"
JVM_OPTS="$JVM_OPTS $JVM_EXTRA_OPTS"
It also worths to list all network interfaces using ifconfig and try telnet'ing port 7199 on all interfaces.
I was facing the same timeout issue. However I found that my cluster was not getting started properly because of token issue and I was getting "Host ID collision between active endpoint". Once i deleted data directory and restarted cluster then nodetool started working fine.
I also saw this same issue but it turned out to be some weirdness in my hosts file that was preventing JMX from binding to the interfaces.
Specifically, the host file had an entry for the external IP address with the hostname. Our servers had two interfaces, one external and one for an internal network. Removing that hosts entry did the trick.
As someone mentioned, it connects to the JMX port.
You can find the JMX port:
In /etc/cassandra/cassandra-env.sh. This won't work for ccm based local clusters OR
(my fav) by looking at the command-line of Cassandra node process running on the node.
My case was a cluster created locally using ccm so all my nodes were running on same host with different JMX port.
vagrant#triforce:~$ ps -eaf | grep cassandra | grepi -o " [^ ]*jmx.local.port[^ ]* "
-Dcassandra.jmx.local.port=7100
-Dcassandra.jmx.local.port=7300
-Dcassandra.jmx.local.port=7200
vagrant#triforce:~$
This is because I have 3 nodes running on the localhost.
vagrant#triforce:~$ nodetool -p 7100 ring
Datacenter: datacenter1
==========
Address Rack Status State Load Owns Token
3074457345618258602
127.0.0.1 rack1 Up Normal 64.65 MB 33.33% -9223372036854775808
127.0.0.2 rack1 Up Normal 65.26 MB 33.33% -3074457345618258603
127.0.0.3 rack1 Up Normal 65.92 MB 33.33% 3074457345618258602
vagrant#triforce:~$

Resources