Apache Spark behavior when a node in a cluster fails. - apache-spark

What's the behavior when a partition is sent to a node and the node crashes right before executing a job? If a new node is introduced into the cluster, what's the entity that detects the addition of this new machine? Does the new machine get assigned the partition that didn't get processed?

The master considers the worker to be failure if it didnt receive the heartbeat message for past 60 sec (according to spark.worker.timeout). In that case the partition is assigned to another worker(remember partitioned RDD can be reconstructed even if its lost).
For the question if the new node is introduced into cluster? the spark-master will not detect the new node addition to the cluster once the slaves are started, because before application-submit in cluster the sbin/start-master.sh starts the master and sbin/start-slaves.sh reads the conf/slaves file (contains IP address of all slaves) in spark-master machine and starts a slave instance on each machine specified. The spark-master will not read this configuration file after being started. so its not possible to add a new node once all slaves started.

Related

On kubernetes my spark worker pod is trying to access thrift pod by name

Okay. Where to start? I am deploying a set of Spark applications to a Kubernetes cluster. I have one Spark Master, 2 Spark Workers, MariaDB, a Hive Metastore (that uses MariaDB - and it's not a full Hive install - it's just the Metastore), and a Spark Thrift Server (that talks to Hive Metastore and implements the Hive API).
So this setup is working pretty well for everything except the setup of the Thrift Server job (start-thriftserver.sh in the Spark sbin directory on the thrift server pod). By working well I say that outside my cluster I can create spark jobs and submit them to master and then using the Web UI I can see my code test app ran to completion utilizing both workers.
Now the problem. When you launch the start-thriftserver.sh it submits a job to the cluster with itself as the driver (I believe - which is correct behavior). And when I look at the related spark job via the WebUI I see it has workers and they repeatedly get hatched and then exit shortly therafter. When I look at the workers' stderr logs I see that every worker launches and tries to connect back to the thrift server pod at the spark.driver.port. This is correct behavior I believe. The gotcha is that connection fails because it says unknown host exception and it uses a kubernetes raw pod name (not a service name and with no IP in the name) of the thrift server pod to say it can't find the thrift server that initiated the connection. Now Kubernetes DNS stores service names and then only pod names as prefaced with their private IP. In other words the raw name of the pod (without an IP) is never registered with the DNS. That is not how kubernetes works.
So my question. I am struggling to figure out why the spark worker pod is using a raw pod name to try to find the thrift server. It seems it should never do this and that it should be impossible to ever satisfy that request. I have wondered if there is some spark config setting that would tell the workers that the (thrift) driver it needs to be searching for is actually spark-thriftserver.my-namespace.svc. But I can't find anything having done much searching.
There are so many settings that go into a cluster like this that I don't want to barrage you with info. One thing that might clarify my setup: the following string is dumped at the top of a worker log that fails. Notice the raw pod name of the thrift server for driver-url. If anyone has any clue what steps to take to fix this please let me know. I'll edit this post and share settings etc as people request them. Thanks for helping.
Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/jre/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx512M" "-Dspark.master.port=7077" "-Dspark.history.ui.port=18081" "-Dspark.ui.port=4040" "-Dspark.driver.port=41617" "-Dspark.blockManager.port=41618" "-Dspark.master.rest.port=6066" "-Dspark.master.ui.port=8080" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-thriftserver-6bbb54768b-j8hz8:41617" "--executor-id" "12" "--hostname" "172.17.0.6" "--cores" "1" "--app-id" "app-20220408001035-0000" "--worker-url" "spark://Worker#172.17.0.6:37369"

Changing Kafka Host name entry in zookeeper and persisting it across storm topology restart

Background
6 node Kafka Cluster
3 node Zookeeper Cluster
3 node Nimbus Cluster
Apache Storm Worker hosts dynamically adjusted using amazon spot fleet
Scenario
For a particular topology for a given partition it subscribes to, the Zookeeper entry looks as follows
{"topology":{"id":"Topology_Name-25-1520374231","name":"Topology_Name"},"offset":217233,"partition":0,"broker":{"host":"Zk_host_name","port":9092},"topic":"topic1"}
Now for worker hosts to access Zk_host_name, a mapping is added on each worker host in /etc/hosts file as ip ZK_host_name
Now we decided to move to something called Route 53 DNS management service provided by AWS. That way a fixed name such as QA-ZK-Host1 can be set and be mapped to corresponding ip. So that ip can be changed in future giving a flexibility.
Now the original entry as above needed to be changed for the sake of consistency. So corresponding topology was stopped, so as to avoid ongoing changes to offset and using set command the value of the hostname is changed.
set /node_path {"topology":{"id":"Topology_Name-25-1520374231","name":"Topology_Name"},"offset":217233,"partition":0,"broker":{"host":"QA-ZK-Host1","port":9092},"topic":"topic1"}
Problem
The above command works fine and get command on the path gives the changed value. But the moment topology is restarted, old name is restored.
So how to make it persist even after topology restart.
The object you are referencing is being written to Storm's Zookeeper here https://github.com/apache/storm/blob/master/external/storm-kafka/src/jvm/org/apache/storm/kafka/PartitionManager.java#L341.
The "broker" property is created at https://github.com/apache/storm/blob/master/external/storm-kafka/src/jvm/org/apache/storm/kafka/DynamicBrokersReader.java#L186. As you can see, the host property is not your Zookeeper host, but the host running Kafka. The value is being read from Kafka's Zookeeper (see point 3 at https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper).
If you want to change the value, you'll likely need to do it in Kafka. Take a look at http://kafka.apache.org/090/documentation.html (or whatever version you're using) and search for "advertised.host.name", I think that's the setting you want to change.

Spark standalone master HA jobs in WAITING status

We are trying to setup HA on spark standalone master using zookeeper.
We have two zookeeper hosts which we are using for spark ha as well.
Configured following thing in spark-env.sh
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk_server1:2181,zk_server2:2181"
Started both the masters.
started shell and status of the job is RUNNING.
master1 is in ALIVE and master2 is in STANDBY status.
Killed the master1 and master2 has been picked up and all the workers appeared alive in master2.
The shell which is already running has been moved to new master. However, the status is in WAITING status and executors are in LOADING status.
No error in worker log and executor log, except notification that connected to new master.
I could see the worker re-registered, but the executor does not seems to be started. Is there any thing that i am missing.?
My spark version is 1.5.0

How to start/restart the Cassandra node efficiently with auto_bootstrap property

My Understanding on auto_bootstrap is
Below are my understanding about auto_bootstrap property. At first, please correct me if I am wrong at any point.
Initially the property ‘auto_bootstrap’ will not be available in the cassandra.yaml file. This means that the default value was ‘true’.
true - this means that bootstrap/stream the data to the respective node from all the other nodes while starting/restarting
false - do not stream the data while starting/restarting
Where do we need ‘auto_bootstrap: true’
1) When a new node needs to be added in the existing cluster, this needs to set to ‘true’ to bootstrap the data automatically from all the other nodes in the cluster. This will take some considerable amount of time (based on the current load of the cluster) to get the new node added in the cluster. But this will make the load balance automatically in the cluster.
Where do we need ‘auto_bootstrap: false’
1) When a new node needs to be added quickly in the existing cluster without bootstrapping the data, this needs to set to ‘false’. The new node will be added quickly irrespective of the current load of the cluster. Later we need to manually stream the data to the new node to make cluster load balanced.
2) When initializing the fresh cluster with no data, this needs to set to ‘false’. At least the first seed node to be started/added in the fresh cluster should have the value as ‘false’.
My Question is
We are using Cassandra 2.0.3 of six nodes with two data centers (each has 3 nodes). Our Cassandra is a stand-alone process (not service). I am going to change few properties in cassandra.yaml file for one node. It is apparent that node should be restarted after updating the cassandra.ymal file to take the changes effect. Our cluster is loaded with huge data.
How to restart the node
After killing the node, I can simply restart the node as below
$ cd install_location
$ bin/cassandra
This means that restart the node with no auto_bootstrap property (default is true).
with 'true'
1) The node to be restarted currently has its own huge data. Does the node bootstrap again all its own data and replace the existing data.
2) Will it take more time the node to join the cluster again.
with 'false'
I do not want to bootstrap the data. So
3) Can I add the property as auto_bootstrap: false and restart the node as mentioned above.
4) After successful restart I will go and delete the auto_bootstrap property. Is that okay?
Else
5) As I am restarting the node with the same ip address, Will the cluster automatically identify that this is an existing node through gossip info and hence restart the node without streaming the data despite auto_bootstrap is set to true or not present in cassandra.yaml file?
As I am restarting an existing node with the same ip address, restart will happen without streaming any data despite the value of auto_bootstrap. So we can merely restart the existing node without touching any parameters. So option 5 fits here.
First of all, you should always run
nodetool drain
on the node before killing Cassandra so that client connections/ongoing operations have a chance to gracefully complete.
Assuming that the node was fully bootstrapped & had status "Up" and "Joined": when you start Cassandra up again, the node will not need to bootstrap again since it's already joined the cluster & taken ownership of certain sets of tokens. However, it will need to catch up with the data that has been mutated since it was down. Therefore, the commitlogs that occurred during that time will be streamed to the node and the changes will be applied. So, it will take much less time to start up after it has bootstrapped once. Just don't leave it down for too long.
You should not set auto_bootstrap to false unless you're creating the first seed node for a new cluster.
The node will be identified as a pre-existing node which has tokens assigned to it by virtue of the host id that is assigned to it when it joins the cluster. The IP address does not matter unless it is a seed node.

Cassandra 2.1.2 node stuck on joining the cluster

I'm trying but failing to join a new (well old, but wiped out) node to an existing cluster.
Currently cluster consists of 2 nodes and runs C* 2.1.2. I start a third node with 2.1.2, it gets to joining state, it bootstraps, i.e. streams some data as shown by nodetool netstats, but after some time, it gets stuck. From that point nothing gets streamed, the new node stays in joining state. I restarted node twice, everytime it streamed more data, but then got stuck again. (I'm currently on a third round like that).
Other facts:
I don't see any errors in the log on any of the nodes.
The connectivity seems fine, I can ping, netcat to port 7000 all ways.
I have 267 GB load per running node, replication 2, 16 tokens.
Load of a new node is around 100GBs now
I'm guessing that the node after few rounds of restarts, will finally suck in all of the data from running nodes and join the cluster. But definitely it's not the way it should work.
EDIT: I discovered some more info:
The bootstrapping process stops in the middle of streaming some table, always after sending exactly 10MB of some SSTable, e.g.:
$ nodetool netstats | grep -P -v "bytes\(100"
Mode: NORMAL
Bootstrap e0abc160-7ca8-11e4-9bc2-cf6aed12690e
/192.168.200.16
Sending 516 files, 124933333900 bytes total
/home/data/cassandra/data/leadbullet/page_view-2a2410103f4411e4a266db7096512b05/leadbullet-page_view-ka-13890-Data.db 10485760/167797071 bytes(6%) sent to idx:0/192.168.200.16
Read Repair Statistics:
Attempted: 2016371
Mismatch (Blocking): 0
Mismatch (Background): 168721
Pool Name Active Pending Completed
Commands n/a 0 55802918
Responses n/a 0 425963
I can't diagnose the error & I'll be grateful for any help!
Try to telnet from one node to another using correct port.
Make sure you are joining the correct name cluster.
Try use: nodetool repair
You might be pinging the external IP addressed, and your cluster communicates using internal IP addresses.
If you are running on Amazon AWS, make sure you have firewall open on both internal IP addresses.

Resources