Can't connect from application to the standalone cluster - apache-spark

I'm trying to connect from application to Spark's standalone cluster. I want to do this on one machine.
I run standalone master server by command:
bash start-master.sh
Then I run one worker by command:
bash spark-class org.apache.spark.deploy.worker.Worker spark://PC:7077 -m 512m
(I allocated 512 MBs for it).
At master’s web UI:
http://localhost:8080
I see, that master and worker are running.
Then I try to connect from application to cluster, with following command:
JavaSparkContext sc = new JavaSparkContext("spark://PC:7077", "myapplication");
When I run application it's crashing with following error message:
4/11/01 22:53:26 INFO client.AppClient$ClientActor: Connecting to master spark://PC:7077...
14/11/01 22:53:26 INFO spark.SparkContext: Starting job: collect at App.java:115
14/11/01 22:53:26 INFO scheduler.DAGScheduler: Got job 0 (collect at App.java:115) with 2 output partitions (allowLocal=false)
14/11/01 22:53:26 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect at App.java:115)
14/11/01 22:53:26 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/11/01 22:53:26 INFO scheduler.DAGScheduler: Missing parents: List()
14/11/01 22:53:26 INFO scheduler.DAGScheduler: Submitting Stage 0 (ParallelCollectionRDD[0] at parallelize at App.java:109), which has no missing parents
14/11/01 22:53:27 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0 (ParallelCollectionRDD[0] at parallelize at App.java:109)
14/11/01 22:53:27 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/11/01 22:53:42 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/11/01 22:53:46 INFO client.AppClient$ClientActor: Connecting to master spark://PC:7077...
14/11/01 22:53:57 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/11/01 22:54:06 INFO client.AppClient$ClientActor: Connecting to master spark://PC:7077...
14/11/01 22:54:12 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/11/01 22:54:26 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
14/11/01 22:54:26 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/11/01 22:54:26 INFO scheduler.DAGScheduler: Failed to run collect at App.java:115
Exception in thread "main" 14/11/01 22:54:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up.
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAnd IndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017 )
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015 )
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.s cala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.s cala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAG Scheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/metrics/json,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/json,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment/json,null}
Any ideas what is going on?
P.S. I'm using pre-built version of Spark - spark-1.1.0-bin-hadoop2.4.
Thank You.

Make sure that both the standalone workers and the Spark driver are connected to the Spark master on the exact address listed in its web UI / printed in its startup log message. Spark uses Akka for some of its control-plane communication and Akka can be really picky about hostnames, so these need to match exactly.
There are several options to control which hostnames / network interfaces the driver and master will bind to. Probably the simplest option is to set the SPARK_LOCAL_IP environment variable to control the address that the Master / Driver will bind to. See http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/connectivity_issues.html for an overview of the other settings that affect network address binding.

Related

All executors finish with state KILLED and exitStatus 1

I am trying to setup a local Spark cluster. I am using Spark 2.4.4 on Windows 10 machine.
To start the master and one worker I do
spark-class org.apache.spark.deploy.master.Master
spark-class org.apache.spark.deploy.worker.Worker 172.17.1.230:7077
After submitting an application to the cluster, it finishes successfully but in the Spark web admin UI it says that the application is KILLED. It's also what I get from worker logs. I have tried running my own examples and examples included in the Spark installation. They all get killed with exitStatus 1.
To start spark JavaSparkPi example from spark installation folder
Spark> spark-submit --master spark://172.17.1.230:7077 --class org.apache.spark.examples.JavaSparkPi .\examples\jars\spark-examples_2.11-2.4.4.jar
Part of the log after finishing calculation outputs
20/01/19 18:55:11 INFO DAGScheduler: Job 0 finished: reduce at JavaSparkPi.java:54, took 4.183853 s
Pi is roughly 3.13814
20/01/19 18:55:11 INFO SparkUI: Stopped Spark web UI at http://Nikola-PC:4040
20/01/19 18:55:11 INFO StandaloneSchedulerBackend: Shutting down all executors
20/01/19 18:55:11 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/01/19 18:55:11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/01/19 18:55:11 WARN TransportChannelHandler: Exception in connection from /172.17.1.230:58560
java.io.IOException: An existing connection was forcibly closed by the remote host
stderr log of the completed application outputs this at the end
20/01/19 18:55:11 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 910 bytes result sent to driver
20/01/19 18:55:11 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 910 bytes result sent to driver
20/01/19 18:55:11 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
The worker log outputs
20/01/19 18:55:06 INFO ExecutorRunner: Launch command: "C:\Program Files\Java\jdk1.8.0_231\bin\java" "-cp" "C:\Users\nikol\Spark\bin\..\conf\;C:\Users\nikol\Spark\jars\*" "-Xmx1024M" "-Dspark.driver.port=58484" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#Nikola-PC:58484" "--executor-id" "0" "--hostname" "172.17.1.230" "--cores" "12" "--app-id" "app-20200119185506-0001" "--worker-url" "spark://Worker#172.17.1.230:58069"
20/01/19 18:55:11 INFO Worker: Asked to kill executor app-20200119185506-0001/0
20/01/19 18:55:11 INFO ExecutorRunner: Runner thread for executor app-20200119185506-0001/0 interrupted
20/01/19 18:55:11 INFO ExecutorRunner: Killing process!
20/01/19 18:55:11 INFO Worker: Executor app-20200119185506-0001/0 finished with state KILLED exitStatus 1
I have tried with Spark 2.4.4 for Hadoop 2.6 and 2.7. The problem remains in both the cases.
This problem is the same as this one.

How to deploy spark on Kubeedge?

I tried to use k8s deployment mode to deploy spark-2.4.3 on Kubeedge 1.1.0 but failed (docker version 19.03.4,k8s version 1.16.1).
SPARK_DRIVER_BIND_ADDRESS=10.4.20.34
SPARK_IMAGE=spark:2.4.3
SPARK_MASTER="k8s://http://127.0.0.1:8080"
CMD=(
"$SPARK_HOME/bin/spark-submit"
--conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
--conf "spark.kubernetes.container.image=${SPARK_IMAGE}"
--conf "spark.executor.instances=1"
--conf "spark.kubernetes.executor.limit.cores=1"
--deploy-mode client
--master ${SPARK_MASTER}
--name spark-pi
--class org.apache.spark.examples.SparkPi
--driver-memory 1G
--executor-memory 1G
--num-executors 1
--executor-cores 1
file://${PWD}/spark-examples_2.11-2.4.3.jar
)
${CMD[#]}
Node status is normal.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
edge-node-001 Ready edge 6d1h v1.15.3-kubeedge-v1.1.0-beta.0.178+c6a5aa738261e7-dirty
ubuntu-ms-7b89 Ready master 6d4h v1.16.1
But I got some errors
19/11/17 21:45:12 INFO k8s.ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/11/17 21:45:12 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46571.
19/11/17 21:45:12 INFO netty.NettyBlockTransferService: Server created on 10.4.20.34:46571
19/11/17 21:45:12 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/11/17 21:45:12 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.4.20.34:46571 with 366.3 MB RAM, BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#451882b2{/metrics/json,null,AVAILABLE,#Spark}
19/11/17 21:45:42 INFO k8s.KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
19/11/17 21:45:42 INFO spark.SparkContext: Starting job: reduce at SparkPi.scala:38
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Parents of final stage: List()
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Missing parents: List()
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
19/11/17 21:45:42 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 366.3 MB)
19/11/17 21:45:42 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 366.3 MB)
19/11/17 21:45:42 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.4.20.34:46571 (size: 1256.0 B, free: 366.3 MB)
19/11/17 21:45:42 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1161
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
19/11/17 21:45:42 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/11/17 21:45:57 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:12 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:27 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:42 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:57 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:47:12 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Is it possible to deploy spark on Kubeedge in Kubernetes deployment mode? Or I should try standalone deployment mode?
I'm so confused.

Spark UI's kill is not killing Driver

I am trying to kill my spark-kafka streaming job from Spark UI. It is able to kill the application but the driver is still running.
Can anyone help me with this. I am good with my other streaming jobs. only one of the streaming jobs is giving this problem ever time.
I can't kill the driver through command or spark UI. Spark Master is alive.
Output i collected from logs is -
16/10/25 03:14:25 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
16/10/25 03:14:25 INFO SparkUI: Stopped Spark web UI at http://***:4040
16/10/25 03:14:25 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/10/25 03:14:25 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/10/25 03:14:35 INFO AppClient: Stop request to Master timed out; it may already be shut down.
16/10/25 03:14:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/25 03:14:35 INFO MemoryStore: MemoryStore cleared
16/10/25 03:14:35 INFO BlockManager: BlockManager stopped
16/10/25 03:14:35 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/25 03:14:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/25 03:14:35 INFO SparkContext: Successfully stopped SparkContext
16/10/25 03:14:35 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:438)
at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:124)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.markDead(AppClient.scala:264)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(AppClient.scala:172)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/25 03:14:35 WARN NettyRpcEnv: Ignored message: true
16/10/25 03:14:35 WARN AppClient$ClientEndpoint: Connection to master:7077 failed; waiting for master to reconnect...
16/10/25 03:14:35 WARN AppClient$ClientEndpoint: Connection to master:7077 failed; waiting for master to reconnect...
Get the running driverId from spark UI, and hit the post rest call(spark master rest port like 6066) to kill the pipeline. I have tested it with spark 1.6.1
curl -X POST http://localhost:6066/v1/submissions/kill/driverId
Hope it helps...

Spark not doing any work on slave: Initial job has not accepted any resources

I am trying to do a very simple setup with Spark using SSH tunneling and I can't make it work.
I have master running on my PC, with this setup ./sbin/start-master.sh -h localhost -p 7077 (if not stated otherwise, everything else is default).
On my slave PC (IP is 192.168.0.222), which is in other domain and I don't have a root access to it, I made ssh -N -L localhost:7078:localhost:7077 myMasterPCSSHalias and run slave with ./sbin/start-slave.sh spark://localhost:7078. I can now see this slave on the dashboard at http://localhost:8080/ in my browser. I see that it has 14GB of free memory.
When I then try e.g. this example:
./bin/spark-submit --master spark://localhost:7077 examples/src/main/python/pi.py 10
it hangs on this message until I kill it (you can see the full log message below):
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I am sure I am not using more resources than I have available, the problem still persists even though I use --executor-memory 512m and running executor is just signalling RUNNING state. The only thing in error log is this:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/05/09 22:45:44 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/05/09 22:45:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/09 22:45:45 INFO SecurityManager: Changing view acls to: hnykdan1,dan
16/05/09 22:45:45 INFO SecurityManager: Changing modify acls to: hnykdan1,dan
16/05/09 22:45:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hnykdan1, dan); users with modify permissions: Set(hnykdan1, dan)
and in slave log is this:
16/05/09 22:48:56 INFO Worker: Asked to launch executor app-20160509224034-0013/0 for PythonPi
16/05/09 22:48:56 INFO SecurityManager: Changing view acls to: hnykdan1
16/05/09 22:48:56 INFO SecurityManager: Changing modify acls to: hnykdan1
16/05/09 22:48:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hnykdan1); users with modify permissions: Set(hnykdan1)
16/05/09 22:48:56 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java" "-cp" "/home/hnykdan1/spark/conf/:/home/hnykdan1/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/home/hnykdan1/spark/lib/datanucleus-core-3.2.10.jar:/home/hnykdan1/spark/lib/datanucleus-api-jdo-3.2.6.jar:/home/hnykdan1/spark/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" "-Dspark.driver.port=37450" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#192.168.0.222:37450" "--executor-id" "0" "--hostname" "147.32.8.103" "--cores" "8" "--app-id" "app-20160509224034-0013" "--worker-url" "spark://Worker#147.32.8.103:54894"
Everything looks quite normal and I don't know where might be a problem. Do I need to tunnel even the other way around? It runs fine when I run slave locally in the exactly same fashion. Thanks
Full Log from console
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/05/09 22:28:21 INFO SparkContext: Running Spark version 1.6.1
16/05/09 22:28:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/09 22:28:22 INFO SecurityManager: Changing view acls to: dan
16/05/09 22:28:22 INFO SecurityManager: Changing modify acls to: dan
16/05/09 22:28:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(dan); users with modify permissions: Set(dan)
16/05/09 22:28:22 INFO Utils: Successfully started service 'sparkDriver' on port 34508.
16/05/09 22:28:23 INFO Slf4jLogger: Slf4jLogger started
16/05/09 22:28:23 INFO Remoting: Starting remoting
16/05/09 22:28:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.0.222:44359]
16/05/09 22:28:23 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 44359.
16/05/09 22:28:23 INFO SparkEnv: Registering MapOutputTracker
16/05/09 22:28:23 INFO SparkEnv: Registering BlockManagerMaster
16/05/09 22:28:23 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-db4c3293-423f-4966-a479-b69a90439da9
16/05/09 22:28:23 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/05/09 22:28:23 INFO SparkEnv: Registering OutputCommitCoordinator
16/05/09 22:28:24 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/05/09 22:28:24 INFO SparkUI: Started SparkUI at http://192.168.0.222:4040
16/05/09 22:28:24 INFO HttpFileServer: HTTP File server directory is /tmp/spark-d532a9c1-0455-4937-ad27-b47abb2a65e8/httpd-aa031b8c-f605-41c3-aabe-fc4fe01bdcf8
16/05/09 22:28:24 INFO HttpServer: Starting HTTP Server
16/05/09 22:28:24 INFO Utils: Successfully started service 'HTTP file server' on port 41770.
16/05/09 22:28:24 INFO Utils: Copying /home/hnykdan1/spark/examples/src/main/python/pi.py to /tmp/spark-d532a9c1-0455-4937-ad27-b47abb2a65e8/userFiles-14720bed-cd41-4b15-9bd3-38dbf4f268ff/pi.py
16/05/09 22:28:24 INFO SparkContext: Added file file:/home/hnykdan1/spark/examples/src/main/python/pi.py at http://192.168.0.222:41770/files/pi.py with timestamp 1462825704629
16/05/09 22:28:24 INFO AppClient$ClientEndpoint: Connecting to master spark://localhost:7077...
16/05/09 22:28:24 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160509222824-0011
16/05/09 22:28:24 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44617.
16/05/09 22:28:24 INFO NettyBlockTransferService: Server created on 44617
16/05/09 22:28:24 INFO AppClient$ClientEndpoint: Executor added: app-20160509222824-0011/0 on worker-20160509214654-147.32.8.103-54894 (147.32.8.103:54894) with 8 cores
16/05/09 22:28:24 INFO BlockManagerMaster: Trying to register BlockManager
16/05/09 22:28:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160509222824-0011/0 on hostPort 147.32.8.103:54894 with 8 cores, 1024.0 MB RAM
16/05/09 22:28:24 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.222:44617 with 511.1 MB RAM, BlockManagerId(driver, 192.168.0.222, 44617)
16/05/09 22:28:24 INFO BlockManagerMaster: Registered BlockManager
16/05/09 22:28:25 INFO AppClient$ClientEndpoint: Executor updated: app-20160509222824-0011/0 is now RUNNING
16/05/09 22:28:25 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/05/09 22:28:25 INFO SparkContext: Starting job: reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39
16/05/09 22:28:25 INFO DAGScheduler: Got job 0 (reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39) with 10 output partitions
16/05/09 22:28:25 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39)
16/05/09 22:28:25 INFO DAGScheduler: Parents of final stage: List()
16/05/09 22:28:25 INFO DAGScheduler: Missing parents: List()
16/05/09 22:28:25 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39), which has no missing parents
16/05/09 22:28:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.0 KB, free 4.0 KB)
16/05/09 22:28:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.7 KB, free 6.7 KB)
16/05/09 22:28:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.222:44617 (size: 2.7 KB, free: 511.1 MB)
16/05/09 22:28:26 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/05/09 22:28:26 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39)
16/05/09 22:28:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
16/05/09 22:28:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:28:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:30:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:30:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Since you checked that you have the resources, the next most likely problem is that the executor cannot connect back to the driver. When submitting a job, the driver starts a server that the executor will connect to in order to download the jar(s).
Yes, the error message (Initial job has not accepted any resources...) does not look related to network problem. This is a known issue discussed for example here:
https://github.com/databricks/spark-knowledgebase/issues/9
It's probably related to the network (security groups rules). It's a silly test, but I just made it work by opening master and workers to all TCP traffic (inbound/outbound).

Can't run spark-submit with an application jar on a Mesos cluster

Mesosphere did a great job on simplifying the process of running Spark on Mesos. I am using this guide to setup a development Mesos cluster on Google Cloud Compute.
https://mesosphere.com/docs/tutorials/run-spark-on-mesos/
I can run the example that's in the guide by using spark-shell (finding numbers less than 10). However, when I attempt to submit an application that otherwise works fine with Spark locally it blows up with TASK_FAILED messages (i.e. CoarseMesosSchedulerBackend: Mesos task 4 is now TASK_FAILED).
Here's the command I'm using with the provided Spark Pi example.
./spark-submit --class org.apache.spark.examples.SparkPi --master mesos://10.173.40.36:5050 ~/spark-1.3.0-bin-hadoop2.4/lib/spark-examples-1.3.0-hadoop2.4.0.jar 100
And the output:
jclouds#development-5159-d9:~/learning-spark$ ~/spark-1.3.0-bin-hadoop2.4/bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://10.173.40.36:5050 ~/spark-1.3.0-bin-hadoop2.4/lib/spark-examples-1.3.0-hadoop2.4.0.jar 100
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/03/22 16:44:02 INFO SparkContext: Running Spark version 1.3.0
15/03/22 16:44:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/22 16:44:03 INFO SecurityManager: Changing view acls to: jclouds
15/03/22 16:44:03 INFO SecurityManager: Changing modify acls to: jclouds
15/03/22 16:44:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jclouds); users with modify permissions: Set(jclouds)
15/03/22 16:44:03 INFO Slf4jLogger: Slf4jLogger started
15/03/22 16:44:03 INFO Remoting: Starting remoting
15/03/22 16:44:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#development-5159-d9.c.learning-spark.internal:60301]
15/03/22 16:44:03 INFO Utils: Successfully started service 'sparkDriver' on port 60301.
15/03/22 16:44:03 INFO SparkEnv: Registering MapOutputTracker
15/03/22 16:44:03 INFO SparkEnv: Registering BlockManagerMaster
15/03/22 16:44:03 INFO DiskBlockManager: Created local directory at /tmp/spark-27fad7e3-4ad7-44d6-845f-4a09ac9cce90/blockmgr-a558b7be-0d72-49b9-93fd-5ef8731b314b
15/03/22 16:44:03 INFO MemoryStore: MemoryStore started with capacity 265.0 MB
15/03/22 16:44:04 INFO HttpFileServer: HTTP File server directory is /tmp/spark-de9ac795-381b-4acd-a723-a9a6778773c9/httpd-7115216c-0223-492b-ae6f-4134ba7228ba
15/03/22 16:44:04 INFO HttpServer: Starting HTTP Server
15/03/22 16:44:04 INFO Server: jetty-8.y.z-SNAPSHOT
15/03/22 16:44:04 INFO AbstractConnector: Started SocketConnector#0.0.0.0:36663
15/03/22 16:44:04 INFO Utils: Successfully started service 'HTTP file server' on port 36663.
15/03/22 16:44:04 INFO SparkEnv: Registering OutputCommitCoordinator
15/03/22 16:44:04 INFO Server: jetty-8.y.z-SNAPSHOT
15/03/22 16:44:04 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/03/22 16:44:04 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/03/22 16:44:04 INFO SparkUI: Started SparkUI at http://development-5159-d9.c.learning-spark.internal:4040
15/03/22 16:44:04 INFO SparkContext: Added JAR file:/home/jclouds/spark-1.3.0-bin-hadoop2.4/lib/spark-examples-1.3.0-hadoop2.4.0.jar at http://10.173.40.36:36663/jars/spark-examples-1.3.0-hadoop2.4.0.jar with timestamp 1427042644934
Warning: MESOS_NATIVE_LIBRARY is deprecated, use MESOS_NATIVE_JAVA_LIBRARY instead. Future releases will not support JNI bindings via MESOS_NATIVE_LIBRARY.
Warning: MESOS_NATIVE_LIBRARY is deprecated, use MESOS_NATIVE_JAVA_LIBRARY instead. Future releases will not support JNI bindings via MESOS_NATIVE_LIBRARY.
I0322 16:44:05.035423 308 sched.cpp:137] Version: 0.21.1
I0322 16:44:05.038136 309 sched.cpp:234] New master detected at master#10.173.40.36:5050
I0322 16:44:05.039261 309 sched.cpp:242] No credentials provided. Attempting to register without authentication
I0322 16:44:05.040351 310 sched.cpp:408] Framework registered with 20150322-040336-606645514-5050-2744-0019
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Registered as framework ID 20150322-040336-606645514-5050-2744-0019
15/03/22 16:44:05 INFO NettyBlockTransferService: Server created on 44177
15/03/22 16:44:05 INFO BlockManagerMaster: Trying to register BlockManager
15/03/22 16:44:05 INFO BlockManagerMasterActor: Registering block manager development-5159-d9.c.learning-spark.internal:44177 with 265.0 MB RAM, BlockManagerId(<driver>, development-5159-d9.c.learning-spark.internal, 44177)
15/03/22 16:44:05 INFO BlockManagerMaster: Registered BlockManager
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 2 is now TASK_RUNNING
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now TASK_RUNNING
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 0 is now TASK_RUNNING
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 2 is now TASK_FAILED
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now TASK_FAILED
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 0 is now TASK_FAILED
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/03/22 16:44:05 INFO SparkContext: Starting job: reduce at SparkPi.scala:35
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 3 is now TASK_RUNNING
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 4 is now TASK_RUNNING
15/03/22 16:44:05 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35) with 100 output partitions (allowLocal=false)
15/03/22 16:44:05 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkPi.scala:35)
15/03/22 16:44:05 INFO DAGScheduler: Parents of final stage: List()
15/03/22 16:44:05 INFO DAGScheduler: Missing parents: List()
15/03/22 16:44:05 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:31), which has no missing parents
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 3 is now TASK_FAILED
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Blacklisting Mesos slave value: "20150322-040336-606645514-5050-2744-S1"
due to too many failures; is Spark installed on it?
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 4 is now TASK_FAILED
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Blacklisting Mesos slave value: "20150322-040336-606645514-5050-2744-S0"
due to too many failures; is Spark installed on it?
15/03/22 16:44:05 INFO MemoryStore: ensureFreeSpace(1848) called with curMem=0, maxMem=277842493
15/03/22 16:44:05 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1848.0 B, free 265.0 MB)
15/03/22 16:44:05 INFO MemoryStore: ensureFreeSpace(1296) called with curMem=1848, maxMem=277842493
15/03/22 16:44:05 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1296.0 B, free 265.0 MB)
15/03/22 16:44:05 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on development-5159-d9.c.learning-spark.internal:44177 (size: 1296.0 B, free: 265.0 MB)
15/03/22 16:44:05 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/03/22 16:44:05 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:839
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 5 is now TASK_RUNNING
15/03/22 16:44:05 INFO DAGScheduler: Submitting 100 missing tasks from Stage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:31)
15/03/22 16:44:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Mesos task 5 is now TASK_FAILED
15/03/22 16:44:05 INFO CoarseMesosSchedulerBackend: Blacklisting Mesos slave value: "20150322-040336-606645514-5050-2744-S2"
due to too many failures; is Spark installed on it?
15/03/22 16:44:20 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I suspect it may have something to do with the mesos slave nodes not finding the application jar, but when I put it in HDFS and provide the URL to it, spark-submit tells me it will Skip remote jar.
jclouds#development-5159-d9:~/learning-spark$ ~/spark-1.3.0-bin-hadoop2.4/bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://10.173.40.36:5050 hdfs://10.173.40.36/tmp/spark-examples-1.3.0-hadoop2.4.0.jar 100Spark assembly has been built with Hive, including Datanucleus jars on classpath
Warning: Skip remote jar hdfs://10.173.40.36/tmp/spark-examples-1.3.0-hadoop2.4.0.jar.
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:266)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
--
EDIT: Just to bring this to conclusion, hbogert on the spark user list pointed me in the direction of debugging the spark logs on one of my slave nodes and the problem was as clear as day.
jclouds#development-5159-d3d:/tmp/mesos/slaves/20150322-040336-606645514-5050-2744-S1/frameworks/20150322-040336-606645514-5050-2744-0037/executors/1/runs/latest$ cat stderr
I0329 20:34:26.107267 10026 exec.cpp:132] Version: 0.21.1
I0329 20:34:26.109591 10031 exec.cpp:206] Executor registered on slave 20150322-040336-606645514-5050-2744-S1
sh: 1: /home/jclouds/spark-1.3.0-bin-hadoop2.4/bin/spark-class: not found
jclouds#development-5159-d3d:/tmp/mesos/slaves/20150322-040336-606645514-5050-2744-S1/frameworks/20150322-040336-606645514-5050-2744-0037/executors/1/runs/latest$ cat stdout
Registered executor on 10.217.7.180
Starting task 1
Forked command at 10036
sh -c ' "/home/jclouds/spark-1.3.0-bin-hadoop2.4/bin/spark-class" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver#development-5159-d9.c.learning-spark.internal:54746/user/CoarseGrainedScheduler --executor-id 20150322-040336-606645514-5050-2744-S1 --hostname 10.217.7.180 --cores 10 --app-id 20150322-040336-606645514-5050-2744-0037'
Command exited with status 127 (pid: 10036)
Related:
How to run Hadoop on a Mesos cluster?
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-failing-on-cluster-td8369.html
It's hard to tell without knowing what's the stderr output in the Mesos sandbox logs, but usually you need to make sure you set the MESOS_NATIVE_LIBRARY (in spark-env.sh) and also the spark.executor.uri (in spark-defaults.conf) URL pointing to a Spark tar correctly. If not you need to have spark installed at the same location in each slave.

Resources