Spark, How can add more storage memory? - apache-spark

Hy,
I have many times this error when I use a biggest dataset and I'm using MlLib (ALS)
The dataset have 3 columns (user, movie and rating) and 1.200.000 rows
WARN TaskSetManager: Stage 0 contains a task of very large size (116722 KB). The maximum recommended task size is 100 KB.
Exception in thread "dispatcher-event-loop-3" java.lang.OutOfMemoryError: Java heap space
My machine has now 8 cores, 240Gb RAM and 100GB Disk (50Gb free)
I want add more storage memory and more executors and I set (I'm using spyder IDE)
conf = SparkConf()
conf.set("spark.executor.memory", "40g")
conf.set("spark.driver.memory","20g")
conf.set("spark.executor.cores","8")
conf.set("spark.num.executors","16")
conf.set("spark.python.worker.memory","40g")
conf.set("spark.driver.maxResultSize","0")
sc = SparkContext(conf=conf)
But I still have this:
What did I do wrong?
How I'm launching Spark (PySpark - Spyder IDE)
import sys
import os
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
os.environ['SPARK_HOME']="C:/Apache/spark-1.6.0"
sys.path.append("C:/Apache/spark-1.6.0/python/")
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.set("spark.executor.memory", "25g")
conf.set("spark.driver.memory","10g")
conf.set("spark.executor.cores","8")
conf.set("spark.python.worker.memory","30g")
conf.set("spark.driver.maxResultSize","0")
sc = SparkContext(conf=conf)
Result
16/02/12 18:37:47 INFO SparkContext: Running Spark version 1.6.0
16/02/12 18:37:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/12 18:37:48 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:363)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:104)
at org.apache.hadoop.security.Groups.<init>(Groups.java:86)
at org.apache.hadoop.security.Groups.<init>(Groups.java:66)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:280)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:271)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:248)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:763)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:748)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:621)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2136)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2136)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2136)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:322)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Unknown Source)
16/02/12 18:37:48 INFO SecurityManager: Changing view acls to: rmalveslocal
16/02/12 18:37:48 INFO SecurityManager: Changing modify acls to: rmalveslocal
16/02/12 18:37:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(rmalveslocal); users with modify permissions: Set(rmalveslocal)
16/02/12 18:37:48 INFO Utils: Successfully started service 'sparkDriver' on port 64280.
16/02/12 18:37:49 INFO Slf4jLogger: Slf4jLogger started
16/02/12 18:37:49 INFO Remoting: Starting remoting
16/02/12 18:37:49 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#10.10.5.105:64293]
16/02/12 18:37:49 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 64293.
16/02/12 18:37:49 INFO SparkEnv: Registering MapOutputTracker
16/02/12 18:37:49 INFO SparkEnv: Registering BlockManagerMaster
16/02/12 18:37:49 INFO DiskBlockManager: Created local directory at C:\Users\rmalveslocal\AppData\Local\Temp\1\blockmgr-4bd2f97f-8b4d-423d-a4e3-06f08ecdeca9
16/02/12 18:37:49 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/02/12 18:37:49 INFO SparkEnv: Registering OutputCommitCoordinator
16/02/12 18:37:50 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/02/12 18:37:50 INFO SparkUI: Started SparkUI at http://10.10.5.105:4040
16/02/12 18:37:50 INFO Executor: Starting executor ID driver on host localhost
16/02/12 18:37:50 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 64330.
16/02/12 18:37:50 INFO NettyBlockTransferService: Server created on 64330
16/02/12 18:37:50 INFO BlockManagerMaster: Trying to register BlockManager
16/02/12 18:37:50 INFO BlockManagerMasterEndpoint: Registering block manager localhost:64330 with 511.1 MB RAM, BlockManagerId(driver, localhost, 64330)
16/02/12 18:37:50 INFO BlockManagerMaster: Registered BlockManager

You didn't specify the running mode (standalone, YARN, Mesos) you are using but I assume you use the standalone mode (for one server)
There are three concepts that play here
Worker node - a host that runs one or more executors
Executor - a container that hosts tasks
Tasks- a unit of work that runs in an
executor (parts of stages that together form a job - both these terms
are not important for this discussion)
The default in standalone mode is to allocate all available cores to an executor. In your case you also set it to 8 which equals all your cores. The result is that you have one executor that uses all the cores and since you also set the executor memory to 40G you're only using a fraction of your memory for ti (40/240)
You can either increase the memory for the executor to allow more tasks to run in parallel (and have more memory each) or set the number of cores to 1 so that you'd be able to host 8 executors (in which case you'd probably want to set the memory to a smaller number since 8*40=320)

Related

jupyter notebook error when Starting Spark application using pyspark kernel

I've been trying to configure jupyter notebook and pyspark kernel. I am actually new to this and ubuntu os. When I tried to run some code in the jupyter notebook using pyspark kernel, I received the error log below.
Note that it used to work before but without SQL magic. After I installed sparkmagic to use SQL magic, this happened.
Appreciate your help, thanks.
ID YARN Application ID Kind State Spark UI Driver log Current session?
1 None pyspark idle ✔
The code failed because of a fatal error:
Session 1 unexpectedly reached final status 'error'. See logs:
stdout:
stderr:
19/10/12 16:47:57 WARN Utils: Your hostname, majd-desktop resolves to a loopback address: 127.0.1.1; using 192.168.1.6 instead (on interface enp1s0)
19/10/12 16:47:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/10/12 16:47:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (io.netty.util.internal.logging.InternalLoggerFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/10/12 16:48:00 INFO SparkContext: Running Spark version 2.4.4
19/10/12 16:48:00 INFO SparkContext: Submitted application: livy-session-1
19/10/12 16:48:00 INFO SecurityManager: Changing view acls to: majd
19/10/12 16:48:00 INFO SecurityManager: Changing modify acls to: majd
19/10/12 16:48:00 INFO SecurityManager: Changing view acls groups to:
19/10/12 16:48:00 INFO SecurityManager: Changing modify acls groups to:
19/10/12 16:48:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(majd); groups with view permissions: Set(); users with modify permissions: Set(majd); groups with modify permissions: Set()
19/10/12 16:48:00 INFO Utils: Successfully started service 'sparkDriver' on port 33779.
19/10/12 16:48:00 INFO SparkEnv: Registering MapOutputTracker
19/10/12 16:48:00 INFO SparkEnv: Registering BlockManagerMaster
19/10/12 16:48:00 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/10/12 16:48:00 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/10/12 16:48:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-d9d22c37-be4c-4498-b115-2011ee176dbf
19/10/12 16:48:00 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
19/10/12 16:48:00 INFO SparkEnv: Registering OutputCommitCoordinator
19/10/12 16:48:00 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
19/10/12 16:48:00 INFO Utils: Successfully started service 'SparkUI' on port 4041.
19/10/12 16:48:00 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.6:4041
19/10/12 16:48:00 INFO SparkContext: Added JAR file:///home/majd/anaconda3/share/apache-livy-0.4.0.60ee047/rsc/target/jars/livy-api-0.4.0-incubating-SNAPSHOT.jar at spark://192.168.1.6:33779/jars/livy-api-0.4.0-incubating-SNAPSHOT.jar with timestamp 1570888080918
19/10/12 16:48:00 INFO SparkContext: Added JAR file:///home/majd/anaconda3/share/apache-livy-0.4.0.60ee047/rsc/target/jars/livy-rsc-0.4.0-incubating-SNAPSHOT.jar at spark://192.168.1.6:33779/jars/livy-rsc-0.4.0-incubating-SNAPSHOT.jar with timestamp 1570888080919
19/10/12 16:48:00 INFO SparkContext: Added JAR file:///home/majd/anaconda3/share/apache-livy-0.4.0.60ee047/rsc/target/jars/netty-all-4.0.29.Final.jar at spark://192.168.1.6:33779/jars/netty-all-4.0.29.Final.jar with timestamp 1570888080919
19/10/12 16:48:00 INFO SparkContext: Added JAR file:///home/majd/anaconda3/share/apache-livy-0.4.0.60ee047/repl/scala-2.11/target/jars/commons-codec-1.9.jar at spark://192.168.1.6:33779/jars/commons-codec-1.9.jar with timestamp 1570888080919
19/10/12 16:48:00 INFO SparkContext: Added JAR file:///home/majd/anaconda3/share/apache-livy-0.4.0.60ee047/repl/scala-2.11/target/jars/livy-core_2.11-0.4.0-incubating-SNAPSHOT.jar at spark://192.168.1.6:33779/jars/livy-core_2.11-0.4.0-incubating-SNAPSHOT.jar with timestamp 1570888080920
19/10/12 16:48:00 INFO SparkContext: Added JAR file:///home/majd/anaconda3/share/apache-livy-0.4.0.60ee047/repl/scala-2.11/target/jars/livy-repl_2.11-0.4.0-incubating-SNAPSHOT.jar at spark://192.168.1.6:33779/jars/livy-repl_2.11-0.4.0-incubating-SNAPSHOT.jar with timestamp 1570888080920
19/10/12 16:48:00 INFO Executor: Starting executor ID driver on host localhost
19/10/12 16:48:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38259.
19/10/12 16:48:01 INFO NettyBlockTransferService: Server created on 192.168.1.6:38259
19/10/12 16:48:01 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/10/12 16:48:01 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.6, 38259, None)
19/10/12 16:48:01 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.6:38259 with 366.3 MB RAM, BlockManagerId(driver, 192.168.1.6, 38259, None)
19/10/12 16:48:01 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.6, 38259, None)
19/10/12 16:48:01 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.6, 38259, None).
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.

Spark Standalone on Kubernetes - application got finished after consecutive master then driver failure

Trying to achieve High Availability of SparkMaster using ZooKeeper with SparkDriver resiliency using metaData checkpoint into GlusterFS.
Some Informations :
Using Spark 2.2.0 (prebuilt binary)
Submitting a streaming app with --deploy-mode cluster and --supervise from a separate spark client pod
Spark Components on Kubernetes are of type Statefulset for Dynamic Volume Provisioning (Previously using Replication Controller/ Deployment)
Created 3 GlusterFS shared pvc - spark-master-pvc,spark-worker-pvc,spark-ckp-pvc
Successfully achieved the Scenarios like - Only Master Failure, Only Driver Failure, Consecutive Master and Driver Failure, Driver Failure then Master. But the Scenario like Submitted a Job -> Master Failure (Working fine) -> Driver Failure i.e. Worker Pod failure is not working.
NEW ALIVE MASTER's log -
18/06/11 10:23:16 INFO ZooKeeperLeaderElectionAgent: We have gained leadership
18/06/11 10:23:16 INFO Master: I have been elected leader! New state: RECOVERING
18/06/11 10:23:16 INFO Master: Trying to recover app: app-20180611102123-0001
18/06/11 10:23:16 INFO Master: Trying to recover worker: worker-20180611101834-10.1.53.142-36203
18/06/11 10:23:16 INFO Master: Trying to recover worker: worker-20180611102123-10.1.170.85-39447
18/06/11 10:23:16 INFO Master: Trying to recover worker: worker-20180611101834-10.1.185.87-38235
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.53.142:36203 after 7 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.185.87:38235 after 3 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.53.142:38994 after 12 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.170.85:39447 after 7 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO Master: Application has been re-registered: app-20180611102123-0001
18/06/11 10:23:16 INFO Master: Worker has been re-registered: worker-20180611102123-10.1.170.85-39447
18/06/11 10:23:16 INFO Master: Worker has been re-registered: worker-20180611101834-10.1.53.142-36203
18/06/11 10:23:16 INFO Master: Worker has been re-registered: worker-20180611101834-10.1.185.87-38235
18/06/11 10:23:16 INFO Master: Recovery complete - resuming operations!
18/06/11 10:24:37 INFO Master: Received unregister request from application app-20180611102123-0001
18/06/11 10:24:37 INFO Master: Removing app app-20180611102123-0001
18/06/11 10:24:37 INFO Master: 10.1.53.142:38994 got disassociated, removing it.
18/06/11 10:24:37 INFO Master: 10.1.53.142:38994 got disassociated, removing it.
18/06/11 10:24:37 WARN Master: Got status update for unknown executor app-20180611102123-0001/0
18/06/11 10:24:37 WARN Master: Got status update for unknown executor app-20180611102123-0001/1
18/06/11 10:24:38 INFO Master: 10.1.53.142:36203 got disassociated, removing it.
18/06/11 10:24:38 INFO Master: Removing worker worker-20180611101834-10.1.53.142-36203 on 10.1.53.142:36203
18/06/11 10:24:38 INFO Master: Re-launching driver-20180611102017-0000
18/06/11 10:24:38 INFO Master: Launching driver driver-20180611102017-0000 on worker worker-20180611101834-10.1.185.87-38235
18/06/11 10:24:38 INFO Master: 10.1.53.142:59142 got disassociated, removing it.
18/06/11 10:24:38 INFO Master: 10.1.53.142:36203 got disassociated, removing it.
18/06/11 10:24:38 INFO Master: 10.1.53.142:36203 got disassociated, removing it.
18/06/11 10:24:43 INFO Master: Registering worker 10.1.53.143:35156 with 8 cores, 30.3 GB RAM
DRIVER is remained in Halted State. Driver Error Log -
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/06/11 19:32:14 INFO SecurityManager: Changing view acls to: root
18/06/11 19:32:14 INFO SecurityManager: Changing modify acls to: root
18/06/11 19:32:14 INFO SecurityManager: Changing view acls groups to:
18/06/11 19:32:14 INFO SecurityManager: Changing modify acls groups to:
18/06/11 19:32:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
18/06/11 19:32:15 INFO Utils: Successfully started service 'Driver' on port 40594.
18/06/11 19:32:15 INFO WorkerWatcher: Connecting to worker spark://Worker#10.1.185.87:38235
18/06/11 19:32:15 INFO TransportClientFactory: Successfully created connection to /10.1.185.87:38235 after 44 ms (0 ms spent in bootstraps)
18/06/11 19:32:15 INFO WorkerWatcher: Successfully connected to spark://Worker#10.1.185.87:38235
18/06/11 19:32:15 INFO CheckpointReader: Checkpoint files found: file:/ckp/checkpoint-1528712675000,file:/ckp/checkpoint-1528712675000.bk,file:/ckp/checkpoint-1528712670000,file:/ckp/checkpoint-1528712670000.bk,file:/ckp/checkpoint-1528712665000,file:/ckp/checkpoint-1528712665000.bk,file:/ckp/checkpoint-1528712660000,file:/ckp/checkpoint-1528712660000.bk,file:/ckp/checkpoint-1528712655000,file:/ckp/checkpoint-1528712655000.bk
18/06/11 19:32:15 INFO CheckpointReader: Attempting to load checkpoint from file file:/ckp/checkpoint-1528712675000
18/06/11 19:32:15 INFO Checkpoint: Checkpoint for time 1528712675000 ms validated
18/06/11 19:32:15 INFO CheckpointReader: Checkpoint successfully loaded from file file:/ckp/checkpoint-1528712675000
18/06/11 19:32:15 INFO CheckpointReader: Checkpoint was generated at time 1528712675000 ms
18/06/11 19:32:15 INFO SparkContext: Running Spark version 2.2.0
18/06/11 19:32:15 INFO SparkContext: Submitted application: SparkStreamingWithCheckPointAndZK
18/06/11 19:32:15 INFO SecurityManager: Changing view acls to: root
18/06/11 19:32:15 INFO SecurityManager: Changing modify acls to: root
18/06/11 19:32:15 INFO SecurityManager: Changing view acls groups to:
18/06/11 19:32:15 INFO SecurityManager: Changing modify acls groups to:
18/06/11 19:32:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
18/06/11 19:32:15 INFO Utils: Successfully started service 'sparkDriver' on port 46544.
18/06/11 19:32:15 INFO SparkEnv: Registering MapOutputTracker
18/06/11 19:32:15 INFO SparkEnv: Registering BlockManagerMaster
18/06/11 19:32:15 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/06/11 19:32:15 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/06/11 19:32:16 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-623c4b9e-8045-4a19-a746-96a3b23c1184
18/06/11 19:32:16 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
18/06/11 19:32:16 INFO SparkEnv: Registering OutputCommitCoordinator
18/06/11 19:32:16 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/06/11 19:32:16 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.1.185.87:4040
18/06/11 19:32:16 INFO SparkContext: Added JAR file:///opt/spark/jars/spark-0.0.1-SNAPSHOT.jar at spark://10.1.185.87:46544/jars/spark-0.0.1-SNAPSHOT.jar with timestamp 1528745536460
18/06/11 19:32:16 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.1.170.81:7077...
18/06/11 19:32:36 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.1.170.81:7077...
18/06/11 19:32:56 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.1.170.81:7077...
18/06/11 19:33:16 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
18/06/11 19:33:16 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.
18/06/11 19:33:16 INFO SparkUI: Stopped Spark web UI at http://10.1.185.87:4040
18/06/11 19:33:16 INFO StandaloneSchedulerBackend: Shutting down all executors
18/06/11 19:33:16 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46323.
18/06/11 19:33:16 INFO NettyBlockTransferService: Server created on 10.1.185.87:46323
18/06/11 19:33:16 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/06/11 19:33:16 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
18/06/11 19:33:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 WARN StandaloneAppClient$ClientEndpoint: Drop UnregisterApplication(null) because has not yet connected to master
18/06/11 19:33:16 INFO BlockManagerMasterEndpoint: Registering block manager 10.1.185.87:46323 with 366.3 MB RAM, BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/06/11 19:33:16 INFO MemoryStore: MemoryStore cleared
18/06/11 19:33:16 INFO BlockManager: BlockManager stopped
18/06/11 19:33:16 INFO BlockManagerMaster: BlockManagerMaster stopped
18/06/11 19:33:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/06/11 19:33:16 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:141)
at apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at scala.Option.map(Option.scala:146)
at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:829)
at org.apache.spark.streaming.api.java.JavaStreamingContext$.getOrCreate(JavaStreamingContext.scala:626)
at org.apache.spark.streaming.api.java.JavaStreamingContext.getOrCreate(JavaStreamingContext.scala)
at org.merlin.spark.SparkKafkaStreamingWithGluster.main(SparkKafkaStreamingWithGluster.java:42)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
18/06/11 19:33:16 INFO SparkContext: SparkContext already stopped.
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:141)
at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at scala.Option.map(Option.scala:146)
at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:829)
at org.apache.spark.streaming.api.java.JavaStreamingContext$.getOrCreate(JavaStreamingContext.scala:626)
at org.apache.spark.streaming.api.java.JavaStreamingContext.getOrCreate(JavaStreamingContext.scala)
at org.merlin.spark.SparkKafkaStreamingWithGluster.main(SparkKafkaStreamingWithGluster.java:42)
... 6 more
Am I choosing the right resource controller i.e. Statefulsets of kubernetes for spark?
M new to this environment, any help will be highly appreciable.
Seems like your driver is not able to find master node. Here is the log
18/06/11 19:33:16 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
Try to telnet ip and port from your client machine.

Spark not doing any work on slave: Initial job has not accepted any resources

I am trying to do a very simple setup with Spark using SSH tunneling and I can't make it work.
I have master running on my PC, with this setup ./sbin/start-master.sh -h localhost -p 7077 (if not stated otherwise, everything else is default).
On my slave PC (IP is 192.168.0.222), which is in other domain and I don't have a root access to it, I made ssh -N -L localhost:7078:localhost:7077 myMasterPCSSHalias and run slave with ./sbin/start-slave.sh spark://localhost:7078. I can now see this slave on the dashboard at http://localhost:8080/ in my browser. I see that it has 14GB of free memory.
When I then try e.g. this example:
./bin/spark-submit --master spark://localhost:7077 examples/src/main/python/pi.py 10
it hangs on this message until I kill it (you can see the full log message below):
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I am sure I am not using more resources than I have available, the problem still persists even though I use --executor-memory 512m and running executor is just signalling RUNNING state. The only thing in error log is this:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/05/09 22:45:44 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/05/09 22:45:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/09 22:45:45 INFO SecurityManager: Changing view acls to: hnykdan1,dan
16/05/09 22:45:45 INFO SecurityManager: Changing modify acls to: hnykdan1,dan
16/05/09 22:45:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hnykdan1, dan); users with modify permissions: Set(hnykdan1, dan)
and in slave log is this:
16/05/09 22:48:56 INFO Worker: Asked to launch executor app-20160509224034-0013/0 for PythonPi
16/05/09 22:48:56 INFO SecurityManager: Changing view acls to: hnykdan1
16/05/09 22:48:56 INFO SecurityManager: Changing modify acls to: hnykdan1
16/05/09 22:48:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hnykdan1); users with modify permissions: Set(hnykdan1)
16/05/09 22:48:56 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java" "-cp" "/home/hnykdan1/spark/conf/:/home/hnykdan1/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/home/hnykdan1/spark/lib/datanucleus-core-3.2.10.jar:/home/hnykdan1/spark/lib/datanucleus-api-jdo-3.2.6.jar:/home/hnykdan1/spark/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" "-Dspark.driver.port=37450" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#192.168.0.222:37450" "--executor-id" "0" "--hostname" "147.32.8.103" "--cores" "8" "--app-id" "app-20160509224034-0013" "--worker-url" "spark://Worker#147.32.8.103:54894"
Everything looks quite normal and I don't know where might be a problem. Do I need to tunnel even the other way around? It runs fine when I run slave locally in the exactly same fashion. Thanks
Full Log from console
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/05/09 22:28:21 INFO SparkContext: Running Spark version 1.6.1
16/05/09 22:28:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/09 22:28:22 INFO SecurityManager: Changing view acls to: dan
16/05/09 22:28:22 INFO SecurityManager: Changing modify acls to: dan
16/05/09 22:28:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(dan); users with modify permissions: Set(dan)
16/05/09 22:28:22 INFO Utils: Successfully started service 'sparkDriver' on port 34508.
16/05/09 22:28:23 INFO Slf4jLogger: Slf4jLogger started
16/05/09 22:28:23 INFO Remoting: Starting remoting
16/05/09 22:28:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.0.222:44359]
16/05/09 22:28:23 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 44359.
16/05/09 22:28:23 INFO SparkEnv: Registering MapOutputTracker
16/05/09 22:28:23 INFO SparkEnv: Registering BlockManagerMaster
16/05/09 22:28:23 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-db4c3293-423f-4966-a479-b69a90439da9
16/05/09 22:28:23 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/05/09 22:28:23 INFO SparkEnv: Registering OutputCommitCoordinator
16/05/09 22:28:24 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/05/09 22:28:24 INFO SparkUI: Started SparkUI at http://192.168.0.222:4040
16/05/09 22:28:24 INFO HttpFileServer: HTTP File server directory is /tmp/spark-d532a9c1-0455-4937-ad27-b47abb2a65e8/httpd-aa031b8c-f605-41c3-aabe-fc4fe01bdcf8
16/05/09 22:28:24 INFO HttpServer: Starting HTTP Server
16/05/09 22:28:24 INFO Utils: Successfully started service 'HTTP file server' on port 41770.
16/05/09 22:28:24 INFO Utils: Copying /home/hnykdan1/spark/examples/src/main/python/pi.py to /tmp/spark-d532a9c1-0455-4937-ad27-b47abb2a65e8/userFiles-14720bed-cd41-4b15-9bd3-38dbf4f268ff/pi.py
16/05/09 22:28:24 INFO SparkContext: Added file file:/home/hnykdan1/spark/examples/src/main/python/pi.py at http://192.168.0.222:41770/files/pi.py with timestamp 1462825704629
16/05/09 22:28:24 INFO AppClient$ClientEndpoint: Connecting to master spark://localhost:7077...
16/05/09 22:28:24 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160509222824-0011
16/05/09 22:28:24 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44617.
16/05/09 22:28:24 INFO NettyBlockTransferService: Server created on 44617
16/05/09 22:28:24 INFO AppClient$ClientEndpoint: Executor added: app-20160509222824-0011/0 on worker-20160509214654-147.32.8.103-54894 (147.32.8.103:54894) with 8 cores
16/05/09 22:28:24 INFO BlockManagerMaster: Trying to register BlockManager
16/05/09 22:28:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160509222824-0011/0 on hostPort 147.32.8.103:54894 with 8 cores, 1024.0 MB RAM
16/05/09 22:28:24 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.222:44617 with 511.1 MB RAM, BlockManagerId(driver, 192.168.0.222, 44617)
16/05/09 22:28:24 INFO BlockManagerMaster: Registered BlockManager
16/05/09 22:28:25 INFO AppClient$ClientEndpoint: Executor updated: app-20160509222824-0011/0 is now RUNNING
16/05/09 22:28:25 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/05/09 22:28:25 INFO SparkContext: Starting job: reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39
16/05/09 22:28:25 INFO DAGScheduler: Got job 0 (reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39) with 10 output partitions
16/05/09 22:28:25 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39)
16/05/09 22:28:25 INFO DAGScheduler: Parents of final stage: List()
16/05/09 22:28:25 INFO DAGScheduler: Missing parents: List()
16/05/09 22:28:25 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39), which has no missing parents
16/05/09 22:28:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.0 KB, free 4.0 KB)
16/05/09 22:28:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.7 KB, free 6.7 KB)
16/05/09 22:28:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.222:44617 (size: 2.7 KB, free: 511.1 MB)
16/05/09 22:28:26 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/05/09 22:28:26 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39)
16/05/09 22:28:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
16/05/09 22:28:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:28:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:30:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:30:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Since you checked that you have the resources, the next most likely problem is that the executor cannot connect back to the driver. When submitting a job, the driver starts a server that the executor will connect to in order to download the jar(s).
Yes, the error message (Initial job has not accepted any resources...) does not look related to network problem. This is a known issue discussed for example here:
https://github.com/databricks/spark-knowledgebase/issues/9
It's probably related to the network (security groups rules). It's a silly test, but I just made it work by opening master and workers to all TCP traffic (inbound/outbound).

Kafka message consumption with spark

I am using HDP-2.3 sandbox for Consuming kafka messages by running SPARK submit job.
i am putting some messages in kafka as below:
kafka-console-producer.sh --broker-list sandbox.hortonworks.com:6667 --topic webevent
OR
kafka-console-producer.sh --broker-list sandbox.hortonworks.com:6667 --topic test --new-producer < myfile.txt
Now i need to consume above messages from spark job as shown below:
./bin/spark-submit --master spark://192.168.255.150:7077 --executor-memory 512m --class org.apache.spark.examples.streaming.JavaDirectKafkaWordCount lib/spark-examples-1.4.1-hadoop2.4.0.jar 192.168.255.150:2181 webevent 10
Where 2181 is a zookeeper port
I am getting Error as shown(Guide me how to consume that message from Kafka):
16/05/02 15:21:30 INFO SparkContext: Running Spark version 1.3.1
16/05/02 15:21:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/02 15:21:31 INFO SecurityManager: Changing view acls to: root
16/05/02 15:21:31 INFO SecurityManager: Changing modify acls to: root
16/05/02 15:21:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/05/02 15:21:31 INFO Slf4jLogger: Slf4jLogger started
16/05/02 15:21:31 INFO Remoting: Starting remoting
16/05/02 15:21:32 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#sandbox.hortonworks.com:53950]
16/05/02 15:21:32 INFO Utils: Successfully started service 'sparkDriver' on port 53950.
16/05/02 15:21:32 INFO SparkEnv: Registering MapOutputTracker
16/05/02 15:21:32 INFO SparkEnv: Registering BlockManagerMaster
16/05/02 15:21:32 INFO DiskBlockManager: Created local directory at /tmp/spark-c70b08b9-41a3-42c8-9d83-bc4258e299c6/blockmgr-c2d86de6-34a7-497c-8018-d3437a100e87
16/05/02 15:21:32 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
16/05/02 15:21:32 INFO HttpFileServer: HTTP File server directory is /tmp/spark-a8f7ade9-292c-42c4-9e54-43b3b3495b0c/httpd-65d36d04-1e2a-4e69-8d20-295465100070
16/05/02 15:21:32 INFO HttpServer: Starting HTTP Server
16/05/02 15:21:32 INFO Server: jetty-8.y.z-SNAPSHOT
16/05/02 15:21:32 INFO AbstractConnector: Started SocketConnector#0.0.0.0:37014
16/05/02 15:21:32 INFO Utils: Successfully started service 'HTTP file server' on port 37014.
16/05/02 15:21:32 INFO SparkEnv: Registering OutputCommitCoordinator
16/05/02 15:21:32 INFO Server: jetty-8.y.z-SNAPSHOT
16/05/02 15:21:32 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
16/05/02 15:21:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/05/02 15:21:32 INFO SparkUI: Started SparkUI at http://sandbox.hortonworks.com:4040
16/05/02 15:21:33 INFO SparkContext: Added JAR file:/usr/hdp/2.3.0.0-2130/spark/lib/spark-examples-1.4.1-hadoop2.4.0.jar at http://192.168.255.150:37014/jars/spark-examples-1.4.1-hadoop2.4.0.jar with timestamp 1462202493866
16/05/02 15:21:34 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#192.168.255.150:7077/user/Master...
16/05/02 15:21:34 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160502152134-0000
16/05/02 15:21:34 INFO AppClient$ClientActor: Executor added: app-20160502152134-0000/0 on worker-20160502150437-sandbox.hortonworks.com-36920 (sandbox.hortonworks.com:36920) with 1 cores
16/05/02 15:21:34 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160502152134-0000/0 on hostPort sandbox.hortonworks.com:36920 with 1 cores, 512.0 MB RAM
16/05/02 15:21:34 INFO AppClient$ClientActor: Executor updated: app-20160502152134-0000/0 is now RUNNING
16/05/02 15:21:34 INFO AppClient$ClientActor: Executor updated: app-20160502152134-0000/0 is now LOADING
16/05/02 15:21:34 INFO NettyBlockTransferService: Server created on 43440
16/05/02 15:21:34 INFO BlockManagerMaster: Trying to register BlockManager
16/05/02 15:21:34 INFO BlockManagerMasterActor: Registering block manager sandbox.hortonworks.com:43440 with 265.4 MB RAM, BlockManagerId(<driver>, sandbox.hortonworks.com, 43440)
16/05/02 15:21:34 INFO BlockManagerMaster: Registered BlockManager
16/05/02 15:21:35 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/05/02 15:21:35 INFO VerifiableProperties: Verifying properties
16/05/02 15:21:35 INFO VerifiableProperties: Property group.id is overridden to
16/05/02 15:21:35 INFO VerifiableProperties: Property zookeeper.connect is overridden to
16/05/02 15:21:35 INFO SimpleConsumer: Reconnect due to socket error: java.io.EOFException: Received -1 when reading from channel, socket has likely been closed.
Error: application failed with exception
org.apache.spark.SparkException: java.io.EOFException: Received -1 when reading from channel, socket has likely been closed.
at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:416)
at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:416)
at scala.util.Either.fold(Either.scala:97)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:415)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:532)
at org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
at org.apache.spark.examples.streaming.JavaDirectKafkaWordCount.main(JavaDirectKafkaWordCount.java:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:577)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:174)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
OR
wen i use this:
./bin/spark-submit --master spark://192.168.255.150:7077 --executor-memory 512m --class org.apache.spark.examples.streaming.JavaDirectKafkaWordCount lib/spark-examples-1.4.1-hadoop2.4.0.jar 192.168.255.150:6667 webevent 10
where 6667 is a Kafka’s message producing port, i am getting this error:
16/05/02 15:27:26 INFO SimpleConsumer: Reconnect due to socket error: java.nio.channels.ClosedChannelException
Error: application failed with exception
org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:416)
at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:416)
i dont know if this can help:
./bin/spark-submit --class consumer.kafka.client.Consumer --master spark://192.168.255.150:7077 --executor-memory 1G lib/kafka-spark-consumer-1.0.6.jar 10

Spark worker can not connect to Master

While starting the worker node I get the following error :
Spark Command: /usr/lib/jvm/default-java/bin/java -cp /home/ubuntu/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ip-1-70-44-5:7077
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/10/16 19:19:10 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
15/10/16 19:19:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/10/16 19:19:11 INFO SecurityManager: Changing view acls to: ubuntu
15/10/16 19:19:11 INFO SecurityManager: Changing modify acls to: ubuntu
15/10/16 19:19:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
15/10/16 19:19:12 INFO Slf4jLogger: Slf4jLogger started
15/10/16 19:19:12 INFO Remoting: Starting remoting
15/10/16 19:19:12 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker#1.70.44.4:55126]
15/10/16 19:19:12 INFO Utils: Successfully started service 'sparkWorker' on port 55126.
15/10/16 19:19:12 INFO Worker: Starting Spark worker 1.70.44.4:55126 with 2 cores, 2.9 GB RAM
15/10/16 19:19:12 INFO Worker: Running Spark version 1.5.1
15/10/16 19:19:12 INFO Worker: Spark home: /home/ubuntu/spark-1.5.1-bin-hadoop2.6
15/10/16 19:19:12 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
15/10/16 19:19:12 INFO WorkerWebUI: Started WorkerWebUI at http://1.70.44.4:8081
15/10/16 19:19:12 INFO Worker: Connecting to master ip-1-70-44-5:7077...
15/10/16 19:19:24 INFO Worker: Retrying connection to master (attempt # 1)
15/10/16 19:19:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[sparkWorker-akka.actor.default-dispatcher-5,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1c5651e9 rejected from java.util.concurrent.ThreadPoolExecutor#671ba687[Running, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 0]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1.apply(Worker.scala:211)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1.apply(Worker.scala:210)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters(Worker.scala:210)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1.apply$mcV$sp(Worker.scala:288)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
at org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$reregisterWithMaster(Worker.scala:234)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:521)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/10/16 19:19:24 INFO ShutdownHookManager: Shutdown hook called
I have added the hostnames to the conf/slaves file. I dont know which enviroment variables to set in spark-env.sh so right not its not being used.
Any pointers to the solution ?
Also, if I should use spark-env.sh then which enviroment vvariables should I run ?
setup details :
2 ubuntu14 machines having 2 cores each.
Please advise.
thanks
So, after some tinkering around I found that slave was not able to communicate with Master on the given port. I changed the security access rules and enabled all TCP traffic on all ports . This solved the problem.
To check if the port is open :
telnet master.ip master.port
The default port is 7077.
My spark-env.sh :
export SPARK_WORKER_INSTANCES=2
export SPARK_MASTER_IP=<ip address>
I'm afraid your hostname may be invalid to Spark, and you hava to change your spark-env.sh.
You can set the variable SPARK_MASTER_IP to be the real ip of master, instead of its hostname.
e.g.
export SPARK_MASTER_IP=1.70.44.5
INSTEAD OF
export SPARK_MASTER_IP=ip-1-70-44-5

Resources