Spark uses Random Port even after defining executor port - apache-spark

I have a small cluster setup for my development purpose, which contains 3 VMs with spark 2.3 installed on all the VMs. I have started the master in VM1 and slaves with master Ipaddress in other 2 Vms. we have Firewall up in all the 3 Vms and opened the port range from 38001:38113 in the firewall
Before starting the VMs we have the following configurations Done.
In Master, Worker 1 & Worker 2 Nodes
Spark-default.conf file was added with the following properties:
spark.blockManager.port 38001
spark.broadcast.port 38018
spark.driver.port 38035
spark.executor.port 38052
spark.fileserver.port 38069
spark.replClassServer.port 38086
spark.shuffle.service.port 38103
In Worker 1 & Worker 2 Nodes
Spark-env.sh file was added with the following properties:
SPARK_WORKER_PORT=38112 -- for worker-1
SPARK_WORKER_PORT=38113 -- for worker-2
When we started the Spark-shell and executed a sample csv file read, the executor started on the Worker is starting with a random port for spark driver.
E.g:
Spark Executor Command: "/usr/java/jdk1.8.0_171-amd64/jre/bin/java" "-cp" "/opt/spark/2.3.0/conf/:/opt/spark/2.3.0/jars/*" "-Xmx1024M" "-Dspark.driver.port=34573" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#293.72.254.89:34573" "--executor-id" "1" "--hostname" "293.72.146.384" "--cores" "4" "--app-id" "app-20180706072052-0000" "--worker-url" "spark://Worker#293.72.146.384:38112"
As you can see in the above command the executor started with Spark.driver.port with 34573. And this is always starting randomly. Because of this my program fails as it is unable to communicate with the driver.
Can anyone help me with this configuration which can be used to execute in network tight environment where All the ports are blocked.
Thanks in advance.

Start worker:
./start-slave.sh spark://hostname:port -p [Worker Port]
Options:
-c CORES, --cores CORES Number of cores to use
-m MEM, --memory MEM Amount of memory to use (e.g. 1000M, 2G)
-d DIR, --work-dir DIR Directory to run apps in (default: SPARK_HOME/work)
-i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: random)
--webui-port PORT Port for web UI (default: 8081)
--properties-file FILE Path to a custom Spark properties file.
Default is conf/spark-defaults.conf.

Related

Why do I see 20 nodes in YARN but 30 workers in spark?

I spun up 30 AWS machines.
When I check YARN UI at the master node's ip 8088, I click on "Nodes" and I can see the following:
under "Active Nodes" I see 20
under "Lost Nodes" I see 0.
I navigate to the spark master at port 18080 I can see that pyspark is telling me that Alive Workers: 30. At beginning of page.
I restarted all of services on master node and slaves but still same thing happening.
How I get YARN to recognize all of nodes?
Check your datanode by below command on your namenode,
sudo yarn node -list -all
and if you can't find all 30 nodes, do below command on your misssing datanode,
sudo service hadoop-yarn-nodemanager start
and do below command on your namenode,
sudo service hadoop-yarn-resourcemanager restart
Or, check /etc/hadoop/conf/slaves in your namenode,
and check below setting in /etc/hadoop/conf/yarn-site.xml of all your nodes
<property>
<name>yarn.resourcemanager.hostname</name>
<value>your namenode name</value>
</property>
Or, write your all nodes' names and ipadress in all nodes' /etc/hosts
for example,
127.0.0.1 localhost.localdomain localhost
192.168.1.10 test1
192.168.1.20 test2
and you have to do the command,
/etc/rc.d/init.d/network reload

Is it possible that spark worker require more resource than the cluster?

I have a stand alone cluster running on one node
root 23053 1 0 Apr01 ? 00:25:00 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip ES01 --port 7077 --webui-port 8080
root 23182 1 0 Apr01 ? 00:19:30 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ES01:7077
From the ps -ef command. We can see memory for master and worker is small -xms1g -xmx1g -maxperm256m
But when submit program to the cluster you can specify that the driver use 4G and worker use 8G. Why the program running on the cluster can acquire more memory than cluster?

Spark High Availability

I`m using spark 1.2.1 on three nodes that run three workers with slave configuration and run daily jobs by using:
./spark-1.2.1/sbin/start-all.sh
//crontab configuration:
./spark-1.2.1/bin/spark-submit --master spark://11.11.11.11:7077 --driver-class-path home/ubuntu/spark-cassandra-connector-java-assembly-1.2.1-FAT.jar --class "$class" "$jar"
I want to keep spark master and slave workers available at all times, and even if it fail I need it to be restarted like a service (like cassandra does).
Is there any way to do it?
EDIT:
I looked into start-all.sh script and it is only contains the setup for start-master.sh script and start-slaves.sh script.
I tried to create a supervisor configuration file for it and only get the below errors:
11.11.11.11: ssh: connect to host 11.11.11.12 port 22: No route to host
11.11.11.13: org.apache.spark.deploy.worker.Worker running as process 14627. Stop it first.
11.11.11.11: ssh: connect to host 11.11.11.12 port 22: No route to host
11.11.11.12: ssh: connect to host 11.11.11.13 port 22: No route to host
11.11.11.11: org.apache.spark.deploy.worker.Worker running as process 14627. Stop it first.
11.11.11.12: ssh: connect to host 11.11.11.12 port 22: No route to host
11.11.11.13: ssh: connect to host 11.11.11.13 port 22: No route to host
11.11.11.11: org.apache.spark.deploy.worker.Worker running as process 14627. Stop it first.
There are tools like monit and supervisor (or even systemd) that can monitor and restart failed processes.

run two cassandra versions in the same machine

I have a scenario to run cassandra of two different versions in the same machine but at different ports.
I started one cluster with following cassandra config at port 9161,
# TCP port, for commands and data
storage_port: 7000
# SSL port, for encrypted communication. Unused unless enabled in
# encryption_options
ssl_storage_port: 7004
port for the CQL native transport to listen for clients on
native_transport_port: 9043
port for Thrift to listen for clients on
rpc_port: 9161
seed_provider:
# Addresses of hosts that are deemed contact points.
# Cassandra nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: "<ip1>,<ip2>,<ip3>"
- seeds: "127.0.0.1"
It runs well,
$ /usr/local/apache-cassandra-2.1.1/bin/cassandra -f
...
INFO 05:08:42 Loading settings from file:/usr/local/apache-cassandra-2.1.1/conf/cassandra.yaml
...
INFO 05:09:29 Starting listening for CQL clients on localhost/127.0.0.1:9043...
INFO 05:09:29 Binding thrift service to localhost/127.0.0.1:9161
INFO 05:09:29 Listening for thrift clients...
INFO 05:19:25 No files to compact for user defined compaction
$ jps
5866 CassandraDaemon
8848 Jps
However while starting another cassandra cluster configured to run at port 9160 with config,
# TCP port, for commands and data
storage_port: 7000
# SSL port, for encrypted communication. Unused unless enabled in
# encryption_options
ssl_storage_port: 7004
port for the CQL native transport to listen for clients on
native_transport_port: 9042
port for Thrift to listen for clients on
rpc_port: 9160
seed_provider:
# Addresses of hosts that are deemed contact points.
# Cassandra nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: "<ip1>,<ip2>,<ip3>"
- seeds: "127.0.0.1"
it fails with message
$ /usr/local/apache-cassandra-2.0.11/bin/cassandra -f
Unable to bind JMX, is Cassandra already running?
How can I make it run two different version of cassandra in the same machine?
The problem is that I have no permission to stop the previous version. Neither can I use https://github.com/pcmanus/ccm
The problem is your new version cassandra is also trying to use port 7199 for the JMX monitoring. Change the JMX port to some other unused port, then it will start. The JMX port can be change in the file called cassandraFolder/bin/cassandra.bat. There will be a line
-Dcom.sun.management.jmxremote.port=7199^
Change the above port to some other unused port.
If you are using cassandra in linux environment the JMX configuration will be located in the file called cassandraFolder/conf/cassandra-env.sh. There will be a line
JMX_PORT="7199"
Change this to some other unused port.
But I was unclear with your question.
Are you trying to run new cassandra to join in the existing cluster?
If yes, changing the JMX port will be sufficient.
Are you trying to run new cassandra in a stand alone mode?
If yes, change following configuration in yaml file.
seeds: "127.0.0.2"
listen_address: 127.0.0.2
rpc_address: 127.0.0.2
add the following entry,
127.0.0.1 127.0.0.2
in the /etc/hosts file if your you are running in linux. If you are running in windows add the above entry in C:\Windows\System32\drivers\etc\hosts file. If your intension is to run in stand alone mode then be careful in your configuration. If you made anything wrong then your new cassandra will join into the existing cluster.
This link helps you to run cassandra cluster in a single windows machine
Well I fixed it with changing few more configurations which are storage_port/ssl_storage_port in conf/cassandra.yaml and JMX_PORT in conf/cassandra-env.sh,
conf/cassandra.yaml
# TCP port, for commands and data
storage_port: 7004
# SSL port, for encrypted communication. Unused unless enabled in
# encryption_options
ssl_storage_port: 7005
port for the CQL native transport to listen for clients on
native_transport_port: 9043
port for Thrift to listen for clients on
rpc_port: 9161
seed_provider:
# Addresses of hosts that are deemed contact points.
# Cassandra nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: "<ip1>,<ip2>,<ip3>"
- seeds: "127.0.0.1"
conf/cassandra-env.sh
# Specifies the default port over which Cassandra will be available for
# JMX connections.
JMX_PORT="7200"

Cassandra nodetool connection timed out

Im trying to use nodetool to check the status of my cluster, but its unable to connect.
My cassandra.yaml is configured with listen_address and rpc_address set as the server IP (e.g. 10.10.10.266).
Im able to connect through cqlsh and cassandra-cli using the same IP, but when I connect to nodetool it doesnt work.
/bin$ nodetool -h 10.10.10.266 ring
Failed to connect to '10.10.10.266:7199': Connection has timed out
I dont think I have a firewall enabled on the server (Ubuntu). Im running this directly on the server in question, so I wouldnt have thought it would be a firewall issue anyway.
You probably need to uncomment the following parameter in cassandra-env.sh:
-Djava.rmi.server.hostname=<public name>
Replace with the address of the interface you want the jmx interface to listen on.
nodetool connects through JMX interface. By default it's listening on port 7199 (other tools use RPC interface listening on port 9160 by default). Check JMX settings in cassandra-env.sh file. Most likely JMX server is listening on wrong interface (or probably loopback interface).
Default JMX configuration section (cassandra ver. 1.1.5) contains link to troubleshooting guide:
# jmx: metrics and administration interface
#
# add this if you're having trouble connecting:
# JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=<public name>"
#
# see
# https://blogs.oracle.com/jmxetc/entry/troubleshooting_connection_problems_in_jconsole
# for more on configuring JMX through firewalls, etc. (Short version:
# get it working with no firewall first.)
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.port=$JMX_PORT"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.ssl=false"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=false"
JVM_OPTS="$JVM_OPTS $JVM_EXTRA_OPTS"
It also worths to list all network interfaces using ifconfig and try telnet'ing port 7199 on all interfaces.
I was facing the same timeout issue. However I found that my cluster was not getting started properly because of token issue and I was getting "Host ID collision between active endpoint". Once i deleted data directory and restarted cluster then nodetool started working fine.
I also saw this same issue but it turned out to be some weirdness in my hosts file that was preventing JMX from binding to the interfaces.
Specifically, the host file had an entry for the external IP address with the hostname. Our servers had two interfaces, one external and one for an internal network. Removing that hosts entry did the trick.
As someone mentioned, it connects to the JMX port.
You can find the JMX port:
In /etc/cassandra/cassandra-env.sh. This won't work for ccm based local clusters OR
(my fav) by looking at the command-line of Cassandra node process running on the node.
My case was a cluster created locally using ccm so all my nodes were running on same host with different JMX port.
vagrant#triforce:~$ ps -eaf | grep cassandra | grepi -o " [^ ]*jmx.local.port[^ ]* "
-Dcassandra.jmx.local.port=7100
-Dcassandra.jmx.local.port=7300
-Dcassandra.jmx.local.port=7200
vagrant#triforce:~$
This is because I have 3 nodes running on the localhost.
vagrant#triforce:~$ nodetool -p 7100 ring
Datacenter: datacenter1
==========
Address Rack Status State Load Owns Token
3074457345618258602
127.0.0.1 rack1 Up Normal 64.65 MB 33.33% -9223372036854775808
127.0.0.2 rack1 Up Normal 65.26 MB 33.33% -3074457345618258603
127.0.0.3 rack1 Up Normal 65.92 MB 33.33% 3074457345618258602
vagrant#triforce:~$

Resources