Spark Standalone Cluster: TaskSchedulerImpl: Initial job has not accepted any resources - apache-spark

I have a cluster in Amazon EC2 compuse by:
- Master: t2.large
- 2xSlaves: t2.micro
I just change the port in my spark-env.sh:
export SPARK_MASTER_WEBUI_PORT=8888
And in slave file I wrote my two slaves IPs.
That's all configuration I set up. After that, I run using ./start-all, and I can see my master in 8888 port.
But when I try to run application I get the following WARN:
17/02/23 13:57:02 INFO TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
17/02/23 13:57:17 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/02/23 13:57:32 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/02/23 13:57:47 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
When I check my cluster, I can see how spark kill executor a create a new ones. I tried using better computers and still doesn't work.
What is happening? How can I fix it?

Related

Spark app fails after ACCEPTED state for a long time. Log says Socket timeout exception

I have Hadoop 3.2.2 running on a cluster with 1 name node, 2 data nodes and 1 resource manager node. I tried to run the sparkpi example on cluster mode. The spark-submit is done from my local machine. YARN accepts the job but the application UI says
this. Further in the terminal where I submitted the job it says
2021-06-05 13:10:03,881 INFO yarn.Client: Application report for application_1622897708349_0001 (state: ACCEPTED)
This continues to print until it fails. Upon failure it prints
I tried increasing the spark.executor.heartbeatInterval to 3600 secs. Still no luck. I also tried running the code from namenode thinking there must be some connection issue with my local machine. Still I'm unable to run it
found the answer albeit I don't know why it works! Adding the private IP address to the security group in AWS did the trick.

Spark workers stopped after driver commanded a shutdown

Basically, Master node also perform as a one of the slave. Once slave on master completed it called the SparkContext to stop and hence this command propagate to all the slaves which stop the execution in mid of the processing.
Error log in one of the worker:
INFO SparkHadoopMapRedUtil: attempt_201612061001_0008_m_000005_18112: Committed
INFO Executor: Finished task 5.0 in stage 8.0 (TID 18112). 2536 bytes result sent to driver
INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERMtdown
Check your resource manager user interface, in case you see any executor failed - it details about memory error. However if executor has not failed but still driver called for shut down - usually this is due to driver memory, please try to increase driver memory. Let me know how it goes.

Spark standalone master HA jobs in WAITING status

We are trying to setup HA on spark standalone master using zookeeper.
We have two zookeeper hosts which we are using for spark ha as well.
Configured following thing in spark-env.sh
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk_server1:2181,zk_server2:2181"
Started both the masters.
started shell and status of the job is RUNNING.
master1 is in ALIVE and master2 is in STANDBY status.
Killed the master1 and master2 has been picked up and all the workers appeared alive in master2.
The shell which is already running has been moved to new master. However, the status is in WAITING status and executors are in LOADING status.
No error in worker log and executor log, except notification that connected to new master.
I could see the worker re-registered, but the executor does not seems to be started. Is there any thing that i am missing.?
My spark version is 1.5.0

Election of new zookeeper leader shuts down the Spark Master

I realized that the master spark becomes unresponsive when I kill the leader zookeeper (of course I assigned the leader election task to the zookeeper). The following is the error log that I see on Master Spark node. Do you have any suggestion to resolve it?
15/06/22 10:44:00 INFO ClientCnxn: Unable to read additional data from
> server sessionid 0x14dd82e22f70ef1, likely server has closed socket,
> closing socket connection and attempting reconnect
15/06/22 10:44:00
> INFO ClientCnxn: Unable to read additional data from server sessionid
> 0x24dc5a319b40090, likely server has closed socket, closing socket
> connection and attempting reconnect
15/06/22 10:44:01 INFO
> ConnectionStateManager: State change: SUSPENDED
15/06/22 10:44:01 INFO
> ConnectionStateManager: State change: SUSPENDED
15/06/22 10:44:01 WARN
> ConnectionStateManager: There are no ConnectionStateListeners
> registered.
15/06/22 10:44:01 INFO ZooKeeperLeaderElectionAgent: We
> have lost leadership
15/06/22 10:44:01 ERROR Master: Leadership has
> been revoked -- master shutting down.
This is the expected behaviour. You have to set up 'n' number of masters and you need to specify the zookeeper url in all the master env.sh
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"
Note that zookeeper maintains quorum. This means you need to have odd number of zookeepers and only when the quorum is maintained zookeeper cluster will be up. Since spark depends up on zookeeper it implies that spark cluster will not be up until zookeeper quorum is maintained.
When you set up two(n) masters and bring down a zookeeper the current master will go down and the new master will be elected and all the worker nodes will be attached to the new master.
You should have started your worker by giving
./start-slave.sh spark://master1:port1,master2:port2
You have to wait for 1-2 minutes!! to notice this failover.

How to find spark master URL on Amazon EMR

I am new to spark and trying to install spark on Amazon cluster with version 1.3.1. when i do
SparkConf sparkConfig = new SparkConf().setAppName("SparkSQLTest").setMaster("local[2]");
it does work for me , however i came to know that this is for testing purpose i can set local[2]
When i tried to use cluster mode i changed it to
SparkConf sparkConfig = new SparkConf().setAppName("SparkSQLTest").setMaster("spark://localhost:7077");
with this i am getting below error
Tried to associate with unreachable remote address [akka.tcp://sparkMaster#localhost:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused
15/06/10 15:22:21 INFO client.AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#localhost:7077/user/Master..
Could someone please let me how to set the master url.
If you are using the bootstrap action from https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark the configuration is setup for Spark on YARN. So just set master to yarn-client or yarn-cluster. Be sure to define the number of executors with memory and cores. More details about Spark on YARN at https://spark.apache.org/docs/latest/running-on-yarn.html
Addition regarding executor settings for memory and core sizing:
Take a look at the default YARN node manager configs for each type at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html, specifically yarn.scheduler.maximum-allocation-mb. You can determine the number of cores from the basic EC2 info url (http://aws.amazon.com/ec2/instance-types/). The max size of the executor memory has to fit within the max allocation less Spark's overhead and in increments of 256MB. A good description of this calculation is at http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/. Don't forget that a little over half the executor memory can be used for RDD cache.

Resources