I realized that the master spark becomes unresponsive when I kill the leader zookeeper (of course I assigned the leader election task to the zookeeper). The following is the error log that I see on Master Spark node. Do you have any suggestion to resolve it?
15/06/22 10:44:00 INFO ClientCnxn: Unable to read additional data from
> server sessionid 0x14dd82e22f70ef1, likely server has closed socket,
> closing socket connection and attempting reconnect
15/06/22 10:44:00
> INFO ClientCnxn: Unable to read additional data from server sessionid
> 0x24dc5a319b40090, likely server has closed socket, closing socket
> connection and attempting reconnect
15/06/22 10:44:01 INFO
> ConnectionStateManager: State change: SUSPENDED
15/06/22 10:44:01 INFO
> ConnectionStateManager: State change: SUSPENDED
15/06/22 10:44:01 WARN
> ConnectionStateManager: There are no ConnectionStateListeners
> registered.
15/06/22 10:44:01 INFO ZooKeeperLeaderElectionAgent: We
> have lost leadership
15/06/22 10:44:01 ERROR Master: Leadership has
> been revoked -- master shutting down.
This is the expected behaviour. You have to set up 'n' number of masters and you need to specify the zookeeper url in all the master env.sh
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"
Note that zookeeper maintains quorum. This means you need to have odd number of zookeepers and only when the quorum is maintained zookeeper cluster will be up. Since spark depends up on zookeeper it implies that spark cluster will not be up until zookeeper quorum is maintained.
When you set up two(n) masters and bring down a zookeeper the current master will go down and the new master will be elected and all the worker nodes will be attached to the new master.
You should have started your worker by giving
./start-slave.sh spark://master1:port1,master2:port2
You have to wait for 1-2 minutes!! to notice this failover.
Related
I'm using Spark Streaming on AWS EMR to connect to a Kafka cluster on AWS MSK. I'm using spark-sql-kafka-0-10 with Spark 2.4.3.
If the security groups are not correctly configured, the Spark Streaming jobs get stuck for hours with the following warning:
20/06/29 14:10:42 WARN NetworkClient: [Consumer clientId=consumer-1, groupId=spark-kafka-source...] Connection to node -1 could not be established. Broker may not be available.
I would expect the job to fail if the connection cannot be established.
Is there a way I can make the job fail? All the timeout values are set to the default values.
This warning message occurs because you don't have connectivity to one or more than one of the brokers of your Kafka cluster. (might also happen in case when new brokers are added to the existing cluster and you are not aware of it)
Before setting up a job, I would recommend checking for connectivity between the producer's server and all the kafka brokers using telnet
I have a single node cassandra server. It has been working well for a long time until after a server restart I made yesterday.
Now nodetool status gives me the following error
error: null
-- StackTrace --
java.lang.ClassCastException
Cassandra itself seems to be running. The subset of the logs:
...
INFO 03:58:18 Node /10.0.0.4 state jump to NORMAL
INFO 03:58:18 Waiting for gossip to settle before accepting client requests...
INFO 03:58:26 No gossip backlog; proceeding
Release version is 3.7
I haven't been able to crack this. I'd be very thankful for any help.
Let me know if I can provide any more useful information.
I have a cluster in Amazon EC2 compuse by:
- Master: t2.large
- 2xSlaves: t2.micro
I just change the port in my spark-env.sh:
export SPARK_MASTER_WEBUI_PORT=8888
And in slave file I wrote my two slaves IPs.
That's all configuration I set up. After that, I run using ./start-all, and I can see my master in 8888 port.
But when I try to run application I get the following WARN:
17/02/23 13:57:02 INFO TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
17/02/23 13:57:17 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/02/23 13:57:32 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/02/23 13:57:47 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
When I check my cluster, I can see how spark kill executor a create a new ones. I tried using better computers and still doesn't work.
What is happening? How can I fix it?
Basically, Master node also perform as a one of the slave. Once slave on master completed it called the SparkContext to stop and hence this command propagate to all the slaves which stop the execution in mid of the processing.
Error log in one of the worker:
INFO SparkHadoopMapRedUtil: attempt_201612061001_0008_m_000005_18112: Committed
INFO Executor: Finished task 5.0 in stage 8.0 (TID 18112). 2536 bytes result sent to driver
INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERMtdown
Check your resource manager user interface, in case you see any executor failed - it details about memory error. However if executor has not failed but still driver called for shut down - usually this is due to driver memory, please try to increase driver memory. Let me know how it goes.
We are trying to setup HA on spark standalone master using zookeeper.
We have two zookeeper hosts which we are using for spark ha as well.
Configured following thing in spark-env.sh
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk_server1:2181,zk_server2:2181"
Started both the masters.
started shell and status of the job is RUNNING.
master1 is in ALIVE and master2 is in STANDBY status.
Killed the master1 and master2 has been picked up and all the workers appeared alive in master2.
The shell which is already running has been moved to new master. However, the status is in WAITING status and executors are in LOADING status.
No error in worker log and executor log, except notification that connected to new master.
I could see the worker re-registered, but the executor does not seems to be started. Is there any thing that i am missing.?
My spark version is 1.5.0