Spark streaming application on a single virtual machine, standalone mode - apache-spark

I have created spark streaming application, which worked fine when deploy mode was client.
On my virtual machine I have master and only one worker.
When I tried to change mode to "cluster" it fails. In web UI, I see that the driver is running, but application is failed.
EDITED
In the log, I see following content:
16/03/23 09:06:25 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
16/03/23 09:06:25 INFO Master: Launching driver driver-20160323090625-0001 on worker worker-20160323085541-10.0.2.15-36648
16/03/23 09:06:32 INFO Master: metering.dev.enerbyte.com:37168 got disassociated, removing it.
16/03/23 09:06:32 INFO Master: 10.0.2.15:59942 got disassociated, removing it.
16/03/23 09:06:32 INFO Master: metering.dev.enerbyte.com:37166 got disassociated, removing it.
16/03/23 09:06:46 INFO Master: Registering app wibeee-pipeline
16/03/23 09:06:46 INFO Master: Registered app wibeee-pipeline with ID app-20160323090646-0007
16/03/23 09:06:46 INFO Master: Launching executor app-20160323090646-0007/0 on worker worker-20160323085541-10.0.2.15-36648
16/03/23 09:06:50 INFO Master: Received unregister request from application app-20160323090646-0007
16/03/23 09:06:50 INFO Master: Removing app app-20160323090646-0007
16/03/23 09:06:50 WARN Master: Got status update for unknown executor app-20160323090646-0007/0
16/03/23 09:06:50 INFO Master: metering.dev.enerbyte.com:37172 got disassociated, removing it.
16/03/23 09:06:50 INFO Master: 10.0.2.15:45079 got disassociated, removing it.
16/03/23 09:06:51 INFO Master: Removing driver: driver-20160323090625-0001
So what happens is that master launches the driver on the worker,application gets registered, and then executir is tried to be launched on the same worker, which fails (although I have only one worker!)
EDIT
Can the issue be related to the fact that I use checkpointing, because I have "updateStateByKey" transformation in my code. It is set to "/tmp", but I always get a warning that "when run in cluster mode, "/tmp" needs to change. How should I set it?
Can that be the reason of my problem?
Thank you

According to log you have provided, it may not because of properties file but check this.
spark-submit only copies jar file to driver when running in cluster mode, so if your application tries to read properties file kept in the system from where you are running spark-submit, driver can not find it when running in cluster mode.
reading from properties file works in client mode because driver starts at the same machine where your are executing spark-submit.
You can copy properties to same directory in all nodes or keep properties file in cassandra file system and read from there.

Related

Spark - Master: got disassociated, removing it

I am deploying a Spark cluster with 1 Master node and 3 worker nodes. Upon moments of deploying the Master and Worker nodes, the master starts spamming the logs with the following messages;
19/07/17 12:56:51 INFO Master: I have been elected leader! New state: ALIVE
19/07/17 12:56:56 INFO Master: Registering worker 172.26.140.209:35803 with 1 cores, 2.0 GB RAM
19/07/17 12:56:57 INFO Master: 172.26.140.163:59146 got disassociated, removing it.
19/07/17 12:56:58 INFO Master: 172.26.140.132:56252 got disassociated, removing it.
19/07/17 12:56:58 INFO Master: 172.26.140.194:62135 got disassociated, removing it.
19/07/17 12:57:02 INFO Master: Registering worker 172.26.140.169:44249 with 1 cores, 2.0 GB RAM
19/07/17 12:57:02 INFO Master: 172.26.140.163:59202 got disassociated, removing it.
19/07/17 12:57:03 INFO Master: 172.26.140.132:56355 got disassociated, removing it.
19/07/17 12:57:03 INFO Master: 172.26.140.194:62157 got disassociated, removing it.
19/07/17 12:57:07 INFO Master: 172.26.140.163:59266 got disassociated, removing it.
19/07/17 12:57:08 INFO Master: 172.26.140.132:56376 got disassociated, removing it.
19/07/17 12:57:08 INFO Master: Registering worker 172.26.140.204:43921 with 1 cores, 2.0 GB RAM
19/07/17 12:57:08 INFO Master: 172.26.140.194:62203 got disassociated, removing it.
19/07/17 12:57:12 INFO Master: 172.26.140.163:59342 got disassociated, removing it.
19/07/17 12:57:13 INFO Master: 172.26.140.132:56392 got disassociated, removing it.
19/07/17 12:57:13 INFO Master: 172.26.140.194:62268 got disassociated, removing it.
19/07/17 12:57:17 INFO Master: 172.26.140.163:59417 got disassociated, removing it.
19/07/17 12:57:18 INFO Master: 172.26.140.132:56415 got disassociated, removing it.
19/07/17 12:57:18 INFO Master: 172.26.140.194:62296 got disassociated, removing it.
19/07/17 12:57:22 INFO Master: 172.26.140.163:59472 got disassociated, removing it.
19/07/17 12:57:23 INFO Master: 172.26.140.132:56483 got disassociated, removing it.
19/07/17 12:57:23 INFO Master: 172.26.140.194:62323 got disassociated, removing it.
The Worker Nodes seem to be connected to the Master correctly and are logging the following;
19/07/17 12:56:56 INFO Utils: Successfully started service 'sparkWorker' on port 35803.
19/07/17 12:56:56 INFO Worker: Starting Spark worker 172.26.140.209:35803 with 1 cores, 2.0 GB RAM
19/07/17 12:56:56 INFO Worker: Running Spark version 2.4.3
19/07/17 12:56:56 INFO Worker: Spark home: /opt/spark
19/07/17 12:56:56 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
19/07/17 12:56:56 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://spark-worker-0.spark-worker-service.default.svc.cluster.local:8081
19/07/17 12:56:56 INFO Worker: Connecting to master spark-master-service.default.svc.cluster.local:7077...
19/07/17 12:56:56 INFO TransportClientFactory: Successfully created connection to spark-master-service.default.svc.cluster.local/10.0.179.236:7077 after 49 ms (0 ms spent in bootstraps)
19/07/17 12:56:56 INFO Worker: Successfully registered with master spark://172.26.140.196:7077
But the Master still logs the disassociating error, for three separate nodes, every 5 seconds.
What is strange is that the IP addresses listed in the Masters logs are all from the kube-proxy service;
kube-system kube-proxy-5vp9r 1/1 Running 0 39h 172.26.140.163 aks-agentpool-31454219-2 <none> <none>
kube-system kube-proxy-kl695 1/1 Running 0 39h 172.26.140.132 aks-agentpool-31454219-1 <none> <none>
kube-system kube-proxy-xgjws 1/1 Running 0 39h 172.26.140.194 aks-agentpool-31454219-0 <none> <none>
My questions are two-fold;
1) Why are the kube-proxy nodes connecting to the Master? Or why does the Master node think that the kube-proxy nodes are taking part in this cluster?
2) What setting do I need to change in order to clear this message from my log files.
Here is the contents of my spark-defaults.conf file
spark.master=spark://spark-master-service:7077
spark.submit.deploy-mode=cluster
spark.executor.cores=1
spark.driver.memory=500m
spark.executor.memory=500m
spark.eventLog.enabled=true
spark.eventLog.dir=/mnt/eventLog
I cannot find any meaningful reason why this is occurring and any assistance would be greatly appreciated.
I had the same problem with my Spark Cluster in Kubernetes, tested spark 2.4.3 and Spark 2.4.4 and also Kubernetes 16.0 and 13.0
This is the solution:
This is how I got my spark object first
spark = SparkSession.builder.appName('Kubernetes-Spark-app').getOrCreate()
and the issue was resolved, by using the cluster ip of the Spark master!
spark = SparkSession.builder.master('spark://10.0.106.83:7077').appName('Kubernetes-Spark-app').getOrCreate()
works with this chart
helm install microsoft/spark --generate-name

Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60
seconds
Removing worker [worker name] on [IP]:[port]
Telling app of
lost executor: [executor number]
I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?
I was facing this same issue, increasing interval worked.
Excerpt from Logs start-all.sh logs
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.
Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf
spark.network.timeout 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000

How to resolve master ip got disassociated issue?

7/04/29 11:39:36 INFO Master: Launching driver driver-20170429113936-0000 on worker worker-20170429113809-192.168.5.197-7078
17/04/29 11:39:42 INFO Master: 192.168.5.5:35660 got disassociated, removing it.
17/04/29 11:39:42 INFO Master: 192.168.5.5:35658 got disassociated, removing it.
17/04/29 11:39:42 INFO Master: 192.168.5.5:39706 got disassociated, removing it.
I got most of the time when i start spark standalone cluster. What will be reason of master IP got disassociated always and How to resolve this ?

Can't spark-submit to analytics node on DataStax Enterprise

I have a 6 node cluster, one of those is spark enabled.
I also have a spark job that I would like to submit to the cluster / that node, so I enter the following command
spark-submit --class VDQConsumer --master spark://node-public-ip:7077 target/scala-2.10/vdq-consumer-assembly-1.0.jar
it launches the spark ui on that node, but eventually gets here:
15/05/14 14:19:55 INFO SparkContext: Added JAR file:/Users/cwheeler/dev/git/vdq-consumer/target/scala-2.10/vdq-consumer-assembly-1.0.jar at http://node-ip:54898/jars/vdq-consumer-assembly-1.0.jar with timestamp 1431627595602
15/05/14 14:19:55 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#node-ip:7077/user/Master...
15/05/14 14:19:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#node-ip:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/05/14 14:20:15 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#node-ip:7077/user/Master...
15/05/14 14:20:35 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#node-ip:7077/user/Master...
15/05/14 14:20:55 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/05/14 14:20:55 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
15/05/14 14:20:55 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
Does anyone have any idea what just happened?

Spark driver program launching in `cluster` mode failed in a weird way

I'm new to Spark. Now I encountered a problem: when I launch a program in a standalone spark cluster while command line:
./spark-submit --class scratch.Pi --deploy-mode cluster --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 hdfs://bx-42-68:9000/jars/pi.jar
It will throws following error:
15/01/28 19:48:51 INFO Slf4jLogger: Slf4jLogger started
15/01/28 19:48:51 INFO Utils: Successfully started service 'driverClient' on port 59290.
Sending launch command to spark://bx-42-68:7077
Driver successfully submitted as driver-20150128194852-0003
... waiting before polling master for driver state
... polling master for driver state
State of driver-20150128194852-0003 is FAILED
Master of cluster outputs following log:
15/01/28 19:48:52 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
15/01/28 19:48:52 INFO Master: Launching driver driver-20150128194852-0003 on worker worker-20150126133948-bx-42-151-26286
15/01/28 19:48:55 INFO Master: Removing driver: driver-20150128194852-0003
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient#bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient#bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://driverClient#bx-42-68:59290] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/28 19:48:57 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.16.42.68%3A48091-16#-1393479428] was not delivered. [9] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
And the corresponding worker for launching driver program outputs:
15/01/28 19:48:52 INFO Worker: Asked to launch driver driver-20150128194852-0003
15/01/28 19:48:52 INFO DriverRunner: Copying user jar hdfs://bx-42-68:9000/jars/pi.jar to /data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/01/28 19:48:55 INFO DriverRunner: Launch Command: "/opt/apps/jdk-1.7.0_60/bin/java" "-cp" "/data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar:::/data11/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/data11/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-XX:MaxPermSize=128m" "-Dspark.executor.memory=5g" "-Dspark.akka.askTimeout=10" "-Dspark.rdd.compress=true" "-Dspark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" "-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" "-Dspark.app.name=YANL" "-Dspark.driver.extraJavaOptions=-XX:MaxPermSize=1024m" "-Dspark.jars=hdfs://bx-42-68:9000/jars/pi.jar" "-Dspark.master=spark://bx-42-68:7077" "-Dspark.storage.memoryFraction=0.6" "-Dakka.loglevel=WARNING" "-XX:MaxPermSize=1024m" "-Xms5120M" "-Xmx5120M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker#bx-42-151:26286/user/Worker" "scratch.Pi"
15/01/28 19:48:55 WARN Worker: Driver driver-20150128194852-0003 exited with failure
My spark-env.sh is:
export SCALA_HOME=/opt/apps/scala-2.11.5
export JAVA_HOME=/opt/apps/jdk-1.7.0_60
export SPARK_HOME=/data11/spark-1.2.0-bin-hadoop2.4
export PATH=$JAVA_HOME/bin:$PATH
export SPARK_MASTER_IP=`hostname -f`
export SPARK_LOCAL_IP=`hostname -f`
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=10.16.42.68:2181,10.16.42.134:2181,10.16.42.151:2181,10.16.42.150:2181,10.16.42.125:2181 -Dspark.deploy.zookeeper.dir=/spark"
SPARK_WORKER_MEMORY=43g
SPARK_WORKER_CORES=22
And my spark-defaults.conf is:
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.executor.memory 20g
spark.rdd.compress true
spark.storage.memoryFraction 0.6
spark.serializer org.apache.spark.serializer.KryoSerializer
However, when I launch the program with client mode with following command, it works fine.
./spark-submit --class scratch.Pi --deploy-mode client --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 /data11/pi.jar
The reason why it works in "client" mode and not in "cluster" mode is because there is no support for "cluster" mode in a standalone cluster.(mentioned in the spark documentation).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors.
Note that cluster mode is currently not supported for standalone
clusters, Mesos clusters, or python applications.
If you look at "Submitting Applications" section in spark documentation, it is clearly mentioned that the support for cluster mode is not available in standalone clusters.
Reference link : http://spark.apache.org/docs/1.2.0/submitting-applications.html
Go to above link and have a look at "Launching Applications with spark-submit" section.
Think it will help. Thanks.

Resources