Spark initial job has not accepted any resources - python-3.x

I have trouble to get my program to run on my spark cluster. I set the cluster up with 1 master and 4 slaves. I started the master, after that, I started the slaves and they show up in the master's web ui.
I then start a small python script to check, if jobs can be executed:
from pyspark import * #SparkContext, SparkConf, spark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import SQLContext
from files import files
import sys
if __name__ == "__main__":
appName = 'SparkExample'
masterUrl = 'spark://10.0.2.55:7077'
conf = SparkConf()
conf.setAppName(appName)
conf.setMaster(masterUrl)
conf.set("spark.driver.cores","1")
conf.set("spark.driver.memory","1g")
conf.set("spark.executor.cores","1")
conf.set("spark.executor.memory","4g")
conf.set("spark.python.worker.memory","256m")
conf.set("spark.cores.max","4")
conf.set("spark.shuffle.service.enabled","true")
conf.set("spark.dynamicAllocation.enabled","true")
conf.set("spark.dynamicAllocation.maxExecutors","1")
for k,v in conf.getAll():
print(k+":"+v)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
#spark = SparkSession.builder.master(masterUrl).appName(appName).config("spark.executor.memory","1g").getOrCreate()
l = [('Alice', 1)]
spark.createDataFrame(l).collect()
spark.createDataFrame(l, ['name', 'age']).collect()
print("#############")
print("Test finished")
print("#############")
But as soon as I should get something back (line 45: " spark.createDataFrame(l).collect()"), spark seems to hang up. After a while, I see the message:
"WARN TaskSchedulerImpl: Initial job has not accepted any resources: check your cluster UI to ensure that workers are registered and have sufficient resources"
So I check the cluster UI:
worker-20171027105227-xx.x.x.x6-35309 10.0.2.56:35309 ALIVE 4 (0 Used) 6.8 GB (0.0 B Used)
worker-20171027110202-xx.x.x.x0-43433 10.0.2.10:43433 ALIVE 16 (1 Used) 30.4 GB (4.0 GB Used)
worker-20171027110746-xx.x.x.x5-45126 10.0.2.65:45126 ALIVE 8 (0 Used) 30.4 GB (0.0 B Used)
worker-20171027110939-xx.x.x.x4-42477 10.0.2.64:42477 ALIVE 16 (0 Used) 30.4 GB (0.0 B Used)
Looks like there are plenty of resources for the small task I created. I also see the task actually running there. When I click on it, I see, that it was launched on 5 executors and all but one EXITED. When I open the log on one of the exited ones, I see the following error message:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/10/27 16:45:23 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14443#CODA
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for TERM
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for HUP
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for INT
17/10/27 16:45:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/27 16:45:24 INFO SecurityManager: Changing view acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing view acls groups to:
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls groups to:
17/10/27 16:45:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, geissler); groups with view permissions: Set(); users with modify permissions: Set(root, geissler); groups with modify permissions: Set()
17/10/27 16:47:25 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
... 8 more
This looks as if the slaves cannot provide their results back to the master to me. But I don't know what to do at this point. The slaves are in the same layer of the network as the master, but on different virtual machines (not docker containers). Is there a way how I can check, if they can/cannot reach the master server? Are there any configuration settings I overlooked when setting the cluster up?
Spark version: 2.1.2 (on master, nodes and pyspark)

The error here was, that the python script was executed locally. Always launch your spark scripts through spark-submit, never just run it as a normal program. Same is true for Java spark programs.

Related

Using `spark-submit` to start a job in a single node standalone spark cluster

I have a single node spark cluster (4 cpu cores and 15GB of memory) configured with a single worker. I can access the web UI and see the worker node. However, I am having trouble submitting the jobs using spark-submit. I have couple of questions.
I have an uber-jar file stored in the cluster. I used the following command to submit a job spark-submit --class Main --deploy-mode cluster --master spark://cluster:7077 uber-jar.jar. This starts the job but fails immediately with the following log messages.
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/11/13 01:19:47 INFO SecurityManager: Changing view acls to: admin
19/11/13 01:19:47 INFO SecurityManager: Changing modify acls to: admin
19/11/13 01:19:47 INFO SecurityManager: Changing view acls groups to:
19/11/13 01:19:47 INFO SecurityManager: Changing modify acls groups to:
19/11/13 01:19:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(admin); groups with view permissions: Set(); users with modify permissions: Set(admin); groups with modify permissions: Set()
19/11/13 01:19:48 INFO Utils: Successfully started service 'driverClient' on port 46649.
19/11/13 01:19:48 INFO TransportClientFactory: Successfully created connection to cluster/10.10.10.10:7077 after 37 ms (0 ms spent in bootstraps)
19/11/13 01:19:48 INFO ClientEndpoint: Driver successfully submitted as driver-20191113011948-0010
19/11/13 01:19:48 INFO ClientEndpoint: ... waiting before polling master for driver state
19/11/13 01:19:53 INFO ClientEndpoint: ... polling master for driver state
19/11/13 01:19:53 INFO ClientEndpoint: State of driver-20191113011948-0010 is FAILED
19/11/13 01:19:53 INFO ShutdownHookManager: Shutdown hook called
19/11/13 01:19:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-4da02cd2-5cfc-4a2a-ad10-41a594569ea1
what am I doing wrong and how do I correctly submit the job.
If my uber-jar file is in my local computer, how do I correctly use spark-submit to submit a spark job using the uber-jar file to the cluster from my local computer. I've experimented running spark-shell in my local computer by pointing to the standalone cluster using spark-shell --master spark:\\cluster:7077. This starts a spark shell in my local computer and I can see (in the spark web UI) the worker gets memory assigned to it in the cluster. However, if I try to perform a task in the shell I get the following error message.
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Check your cluster UI to ensure that workers are registered and have sufficient resources

I create a programme to text data selection on cassandra.
Here is my code.
It's just a simple select all data and show it in the console.
def get_spark_context(app_name, max_cores=120):
# checkpointDirectory = ""
conf = SparkConf().setMaster(local_settings.SPARK_MASTER).setAppName(app_name) \
.set("spark.cores.max", max_cores)\
.set("spark.jars.packages", "datastax:spark-cassandra-connector:2.0.0-s_2.11") \
.set("spark.cassandra.connection.host", local_settings.CASSANDRA_MASTER)
# setup spark context
sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir(local_settings.CHECKPOINT_DIRECTORY)
return sc
def get_sql_context(sc):
sqlc = SQLContext.getOrCreate(sc)
return sqlc
def run():
sc = get_spark_context("Select data")
sql_context = get_sql_context(sc)
sql_context.read.format("org.apache.spark.sql.cassandra") \
.options(table="text", keyspace="data") \
.load().show()
However the console show like that. It sticks into the log:
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
and it never ends.
19/02/21 09:09:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/02/21 09:09:23 WARN Utils: Your hostname, osboxes resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface eth0)
19/02/21 09:09:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/02/21 09:09:44 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/02/21 09:09:59 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Therefore, I have checked my log of spark-worker.
The error log is the following
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/02/21 08:58:18 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 15264#mm_h01
19/02/21 08:58:18 INFO SignalUtils: Registered signal handler for TERM
19/02/21 08:58:18 INFO SignalUtils: Registered signal handler for HUP
19/02/21 08:58:18 INFO SignalUtils: Registered signal handler for INT
19/02/21 08:58:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/02/21 08:58:19 INFO SecurityManager: Changing view acls to: hadoop,osboxes
19/02/21 08:58:19 INFO SecurityManager: Changing modify acls to: hadoop,osboxes
19/02/21 08:58:19 INFO SecurityManager: Changing view acls groups to:
19/02/21 08:58:19 INFO SecurityManager: Changing modify acls groups to:
19/02/21 08:58:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop, osboxes); groups with view permissions: Set(); users with modify permissions: Set(hadoop, osboxes); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:202)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 11 more
19/02/21 09:00:19 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
What does it mean? Is it no commutation between master and workers?
Thanks a lot
This means that the jobs has been submitted to yarn. However due to insufficient resources it cannot launch the job as yarn currently can't provide the requested resources.
Go to Ambari/Cloudera UI see if there are any jobs running.
check container size for yarn.
check if the resources configured for the job is more than total available to yarn/mesos

Remote Spark standalone executor error

I am running a spark (2.0.1) standalone cluster on a remote server (Microsoft azure). I am able to connect my spark app to this cluster, however the tasks are getting stuck without any execution(with the following warning:
WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources)
What I have tried:
I have ensured that the memory, cpu requirements for my app do not exceed the server config.
Have supplied these variables to my spark-env.sh : SPARK_PUBLIC_DNS ,SPARK_DRIVER_HOST, SPARK_LOCAL_IP, SPARK_MASTER_HOST
Can see the master / worker / application webui on the browser.
Have all the ports open on the remote server (for my IP and the vpn).
Disabled ufw.
As far as I can tell, my workers are not able to relay back to the master. Executors are timing out after 120s with the following stderr:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/11/19 18:15:09 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 17261#sparkmasternew
16/11/19 18:15:09 INFO SignalUtils: Registered signal handler for TERM
16/11/19 18:15:09 INFO SignalUtils: Registered signal handler for HUP
16/11/19 18:15:09 INFO SignalUtils: Registered signal handler for INT
16/11/19 18:15:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/19 18:15:10 INFO SecurityManager: Changing view acls to: ubuntu,user1
16/11/19 18:15:10 INFO SecurityManager: Changing modify acls to: ubuntu,user1
16/11/19 18:15:10 INFO SecurityManager: Changing view acls groups to:
16/11/19 18:15:10 INFO SecurityManager: Changing modify acls groups to:
16/11/19 18:15:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu, user1); groups with view permissions: Set(); users with modify permissions: Set(ubuntu, user1); groups with modify permissions: Set()
java.lang.IllegalArgumentException: requirement failed: TransportClient has not yet been set.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:70)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:174)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:270)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
... 8 more
I am using my vm's private IP for SPARK_DRIVER_HOST, SPARK_LOCAL_IP, SPARK_MASTER_HOST and public IP as SPARK_PUBLIC_DNS and to connect to the master. The master and workers are running on the same vm. And this exact setup is working on an ec2 instance. Any help would be appreciated.
UPDATE: I am able to run spark-shell normally from within the machine. The problem seems to be similar to this The executors can't interact with the driver, although I have the ports open on the vm. Is there a way to bind driver to the public IP of my instance/laptop?

spark-cassandra connector in local gives Spark cluster looks down

I am very new to spark and cassandra. I am trying a simple java progam where I am trying to add new rows to cassandra table using spark-cassandra-connector provided by datastax.
I am running dse on my laptop . Using java, I am trying to save the data to cassandra DB thru Spark . Following is the code :
Map<String, String> extra = new HashMap<String, String>();
extra.put("city", "bangalore");
extra.put("dept", "software");
List<User> products = Arrays.asList(new User(1, "vamsi", extra));
JavaRDD<User> productsRDD = sc.parallelize(products);
javaFunctions(productsRDD, User.class).saveToCassandra("test", "users");
When i execute this code I am getting following error
16/03/26 20:57:31 INFO client.AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077...
16/03/26 20:57:44 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
16/03/26 20:57:51 INFO client.AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077...
16/03/26 20:57:59 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
16/03/26 20:58:11 ERROR client.AppClient$ClientActor: All masters are unresponsive! Giving up.
16/03/26 20:58:11 ERROR cluster.SparkDeploySchedulerBackend: Spark cluster looks dead, giving up.
16/03/26 20:58:11 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/03/26 20:58:11 INFO scheduler.DAGScheduler: Failed to run runJob at RDDFunctions.scala:48
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Spark cluster looks down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Looks like you need to fix your Spark configuration...see this:
http://www.datastax.com/dev/blog/common-spark-troubleshooting

spark 1.3 workers accepting jobs but console says resources not available

I am trying to run apache spark 1.3 on amazon EMR with hadoop 2.4 of amazon in standalone with 2 workers. But When I do I get the following message:
[TaskSchedulerImpl] - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I am setting the following parameters:
conf = new SparkConf();
conf.setAppName("SVM Classifier Example");
conf.set("spark.executor.memory", "1024m");
conf.set("spark.cores.max", "1");
But when I run the same on my local (with apache hadoop 2.4 and spark 1.3) I am able to execute it in a few seconds.
I checked each worker machine has plenty of free memory around 1.6G in both cases so that is not an issue.
Here are what the logs on worker say:
15/03/26 20:54:27 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
15/03/26 20:54:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/26 20:54:29 INFO spark.SecurityManager: Changing view acls to: root
15/03/26 20:54:29 INFO spark.SecurityManager: Changing modify acls to: root
15/03/26 20:54:29 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/03/26 20:54:30 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/03/26 20:54:31 INFO Remoting: Starting remoting
15/03/26 20:54:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#ip-XXXX.ec2.internal:50899]
15/03/26 20:54:31 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 50899.
15/03/26 20:54:32 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#ip-XXXX.ec2.internal:49161] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://sparkDriver#ip-XXXX.ec2.internal:49161]].
I am not able to figure out what is wrong. Any inputs and suggestions are appreciated.
EDIT: I am not able to upload screenshot of my console.
But here are the details:
> Worker Id Cores Memory
> 1 8 (8 Used) 1172.0 MB (1024.0 MB Used)
> 2 8 (8 Used) 1536.0 MB (1024.0 MB Used)
>Running Applications
> ID Cores Memory per Node User State Duration
> 1 16 1024.0 MB root Running 1.5h
So it turns the problem was firewall in my system. The firewall policy was such that workers could communicate with master but not driver. Opened the prots for two way communication and this resolved my problem.

Resources