Is it possible that spark worker require more resource than the cluster? - apache-spark

I have a stand alone cluster running on one node
root 23053 1 0 Apr01 ? 00:25:00 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip ES01 --port 7077 --webui-port 8080
root 23182 1 0 Apr01 ? 00:19:30 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ES01:7077
From the ps -ef command. We can see memory for master and worker is small -xms1g -xmx1g -maxperm256m
But when submit program to the cluster you can specify that the driver use 4G and worker use 8G. Why the program running on the cluster can acquire more memory than cluster?

Related

Cassandra nodes becomes unreachable to each other

I have 3 nodes of elassandra running in docker containers.
Containers created like:
Host 10.0.0.1 : docker run --name elassandra-node-1 --net=host -e CASSANDRA_SEEDS="10.0.0.1" -e CASSANDRA_CLUSTER_NAME="BD Storage" -e CASSANDRA_DC="DC1" -e CASSANDRA_RACK="r1" -d strapdata/elassandra:latest
Host 10.0.0.2 : docker run --name elassandra-node-2 --net=host -e CASSANDRA_SEEDS="10.0.0.1,10.0.0.2" -e CASSANDRA_CLUSTER_NAME="BD Storage" -e CASSANDRA_DC="DC1" -e CASSANDRA_RACK="r1" -d strapdata/elassandra:latest
Host 10.0.0.3 : docker run --name elassandra-node-3 --net=host -e CASSANDRA_SEEDS="10.0.0.1,10.0.0.2,10.0.0.3" -e CASSANDRA_CLUSTER_NAME="BD Storage" -e CASSANDRA_DC="DC1" -e CASSANDRA_RACK="r1" -d strapdata/elassandra:latest
Cluster was working fine for a couple of days since created, elastic, cassandra all was perfect.
Currently however all cassandra nodes became unreachable to each other:
Nodetool status on all nodes is like
Datacenter: DC1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN 10.0.0.3 11.95 GiB 8 100.0% 7652f66e-194e-4886-ac10-0fc21ac8afeb r1
DN 10.0.0.2 11.92 GiB 8 100.0% b91fa129-1dd0-4cf8-be96-9c06b23daac6 r1
UN 10.0.0.1 11.9 GiB 8 100.0% 5c1afcff-b0aa-4985-a3cc-7f932056c08f r1
Where the UN is the current host 10.0.0.1
Same on all other nodes.
Nodetool describecluster on 10.0.0.1 is like
Cluster Information:
Name: BD Storage
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
24fa5e55-3935-3c0e-9808-99ce502fe98d: [10.0.0.1]
UNREACHABLE: [10.0.0.2,10.0.0.3]
When attached to the first node its only repeating these infos:
2018-12-09 07:47:32,927 WARN [OptionalTasks:1] org.apache.cassandra.auth.CassandraRoleManager.setupDefaultRole(CassandraRoleManager.java:361) CassandraRoleManager skipped default role setup: some nodes were not ready
2018-12-09 07:47:32,927 INFO [OptionalTasks:1] org.apache.cassandra.auth.CassandraRoleManager$4.run(CassandraRoleManager.java:400) Setup task failed with error, rescheduling
2018-12-09 07:47:32,980 INFO [HANDSHAKE-/10.0.0.2] org.apache.cassandra.net.OutboundTcpConnection.lambda$handshakeVersion$1(OutboundTcpConnection.java:561) Handshaking version with /10.0.0.2
2018-12-09 07:47:32,980 INFO [HANDSHAKE-/10.0.0.3] org.apache.cassandra.net.OutboundTcpConnection.lambda$handshakeVersion$1(OutboundTcpConnection.java:561) Handshaking version with /10.0.0.3
After a while when some node is restarted:
2018-12-09 07:52:21,972 WARN [MigrationStage:1] org.apache.cassandra.service.MigrationTask.runMayThrow(MigrationTask.java:67) Can't send schema pull request: node /10.0.0.2 is down.
Tried so far:
Restarting all containers at the same time
Restarting all containers one after another
Restarting cassandra in all containers like : service cassandra restart
Nodetool disablegossip then enable it
Nodetool repair : Repair command #1 failed with error Endpoint not alive: /10.0.0.2
Seems that all node schemas are different, but I still dont understand why they are marked as down to each other.
If you have different Cassandra version then nodetool repair will not pull the data.Keep same version of Cassandra. sometimes node showing down or unreachable because of gossip was not happening properly. reason may be network, high load on that node or node is very busy and lots of i/o operation on going such as repair, compaction etc.

Spark uses Random Port even after defining executor port

I have a small cluster setup for my development purpose, which contains 3 VMs with spark 2.3 installed on all the VMs. I have started the master in VM1 and slaves with master Ipaddress in other 2 Vms. we have Firewall up in all the 3 Vms and opened the port range from 38001:38113 in the firewall
Before starting the VMs we have the following configurations Done.
In Master, Worker 1 & Worker 2 Nodes
Spark-default.conf file was added with the following properties:
spark.blockManager.port 38001
spark.broadcast.port 38018
spark.driver.port 38035
spark.executor.port 38052
spark.fileserver.port 38069
spark.replClassServer.port 38086
spark.shuffle.service.port 38103
In Worker 1 & Worker 2 Nodes
Spark-env.sh file was added with the following properties:
SPARK_WORKER_PORT=38112 -- for worker-1
SPARK_WORKER_PORT=38113 -- for worker-2
When we started the Spark-shell and executed a sample csv file read, the executor started on the Worker is starting with a random port for spark driver.
E.g:
Spark Executor Command: "/usr/java/jdk1.8.0_171-amd64/jre/bin/java" "-cp" "/opt/spark/2.3.0/conf/:/opt/spark/2.3.0/jars/*" "-Xmx1024M" "-Dspark.driver.port=34573" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#293.72.254.89:34573" "--executor-id" "1" "--hostname" "293.72.146.384" "--cores" "4" "--app-id" "app-20180706072052-0000" "--worker-url" "spark://Worker#293.72.146.384:38112"
As you can see in the above command the executor started with Spark.driver.port with 34573. And this is always starting randomly. Because of this my program fails as it is unable to communicate with the driver.
Can anyone help me with this configuration which can be used to execute in network tight environment where All the ports are blocked.
Thanks in advance.
Start worker:
./start-slave.sh spark://hostname:port -p [Worker Port]
Options:
-c CORES, --cores CORES Number of cores to use
-m MEM, --memory MEM Amount of memory to use (e.g. 1000M, 2G)
-d DIR, --work-dir DIR Directory to run apps in (default: SPARK_HOME/work)
-i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: random)
--webui-port PORT Port for web UI (default: 8081)
--properties-file FILE Path to a custom Spark properties file.
Default is conf/spark-defaults.conf.

Why do I see 20 nodes in YARN but 30 workers in spark?

I spun up 30 AWS machines.
When I check YARN UI at the master node's ip 8088, I click on "Nodes" and I can see the following:
under "Active Nodes" I see 20
under "Lost Nodes" I see 0.
I navigate to the spark master at port 18080 I can see that pyspark is telling me that Alive Workers: 30. At beginning of page.
I restarted all of services on master node and slaves but still same thing happening.
How I get YARN to recognize all of nodes?
Check your datanode by below command on your namenode,
sudo yarn node -list -all
and if you can't find all 30 nodes, do below command on your misssing datanode,
sudo service hadoop-yarn-nodemanager start
and do below command on your namenode,
sudo service hadoop-yarn-resourcemanager restart
Or, check /etc/hadoop/conf/slaves in your namenode,
and check below setting in /etc/hadoop/conf/yarn-site.xml of all your nodes
<property>
<name>yarn.resourcemanager.hostname</name>
<value>your namenode name</value>
</property>
Or, write your all nodes' names and ipadress in all nodes' /etc/hosts
for example,
127.0.0.1 localhost.localdomain localhost
192.168.1.10 test1
192.168.1.20 test2
and you have to do the command,
/etc/rc.d/init.d/network reload

cassandra is not running as service

The system is Linux 14.04.1-Ubuntu x86_64, 200GB space, 8GB memory. Everything is done in both root and user. We installed the Cassandra version 3.6.0 from datastax using the following command (followed the instruction from website: http://docs.datastax.com/en/cassandra/3.x/cassandra/install/installDeb.html):
$ apt-get update
$ apt-get install datastax-ddc
However, the cassandra is not started as service.
root#e7:~# nodetool status
nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused'.
root#e7:~# service cassandra start
root#e7:~# service cassandra status
* Cassandra is not running
We can start Cassandra manually using the command:
$ cassandra -R -f
...
INFO 18:45:02 Starting listening for CQL clients on /127.0.0.1:9042 (unencrypted)...
INFO 18:45:02 Binding thrift service to /127.0.0.1:9160
INFO 18:45:02 Listening for thrift clients...
INFO 18:45:12 Scheduling approximate time-check task with a precision of 10 milliseconds
root#e7:~# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 153.45 KiB 256 100.0% 28ba16df-1e4c-4a40-a786-ebee140364bf rack1
However, we have to start cassandra as a service. Any suggestions how to fix the problem?
Try using http://docs.datastax.com/en/cassandra/3.0/cassandra/install/installDeb.html
This is more stable and I have tried it.
I think the ports are not opened.
Try opening the following ports:
Cassandra inter-node ports
Port number Description
7000 Cassandra inter-node cluster communication.
7001 Cassandra SSL inter-node cluster communication.
7199 Cassandra JMX monitoring port.
Cassandra client port
Port number Description
9042 Cassandra client port.
9160 Cassandra client port (Thrift).
Also what type of Snitch is defined in the Cassandra.yaml file ?

How to check status of Spark (Standalone) services on cloudera-quickstart-vm?

I am trying to get the status of the services namely spark-master and spark-slaves running on Spark (standalone) service running on my local vm
However running sudo service spark-master status is not working.
Can anybody provide some hints on how to check the status of Spark services?
I use jps -lm as the tool to get status of any JVMs on a box, Spark's ones including. Consult jps documentation for more details beside -lm command-line options.
If you however want to filter out the JVM processes that really belong to Spark you should pipe it and use OS-specific tools like grep.
➜ spark git:(master) ✗ jps -lm
999 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080
397
669 org.jetbrains.idea.maven.server.RemoteMavenServer
1198 sun.tools.jps.Jps -lm
➜ spark git:(master) ✗ jps -lm | grep -i spark
999 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080
You can also check out ./sbin/spark-daemon.sh status, but my limited understanding of the tool doesn't make it a recommended one.
When you start Spark Standalone using scripts under sbin, PIDs are stored in /tmp directory by default. ./sbin/spark-daemon.sh status can read them and do the "boilerplate" for you, i.e. status a PID.
➜ spark git:(master) ✗ jps -lm | grep -i spark
999 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080
➜ spark git:(master) ✗ ls /tmp/spark-*.pid
/tmp/spark-jacek-org.apache.spark.deploy.master.Master-1.pid
➜ spark git:(master) ✗ ./sbin/spark-daemon.sh status org.apache.spark.deploy.master.Master 1
org.apache.spark.deploy.master.Master is running.
ps -ef | grep spark works with details of all pids

Resources