How to autostart an Apache Spark cluster using Supervisord? - apache-spark

Starting an Apache Spark cluster is usually done through the spark-submit shell scripts provided by the code base. However, the problem is that every time the cluster shuts down and starts again, you need to execute those shell scripts to start the spark cluster.
Supervisord is great for managing processes and seems like a good candidate for starting the spark processes automatically after reboot.
However, after starting the master process via
command=/usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin/java -cp :/path/spark-1.3.0-bin-cdh4/sbin/../conf:/path/spark-1.3.0-bin-cdh4/lib/spark-assembly-1.3.0-hadoop2.0.0-mr1-cdh4.2.0.jar:/path/spark-1.3.0-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/path/spark-1.3.0-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/path/spark-1.3.0-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar:etc/hadoop/conf -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip master.mydomain.com --port 7077 --webui-port 18080
and the worker process by
command=/usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin/java -cp :/path/spark-1.3.0-bin-cdh4/sbin/../conf:/path/spark-1.3.0-bin-cdh4/lib/spark-assembly-1.3.0-hadoop2.0.0-mr1-cdh4.2.0.jar:/path/spark-1.3.0-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/path/spark-1.3.0-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/path/spark-1.3.0-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar:etc/hadoop/conf -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://master.mydomain.com:7077
I end up with the following error after I submit my spark application:
15/06/05 17:16:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/05 17:16:32 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
15/06/05 17:16:32 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1
15/06/05 17:16:32 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 2
15/06/05 17:16:32 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 3
15/06/05 17:16:32 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 4
15/06/05 17:16:32 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 5
15/06/05 17:16:32 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 6
15/06/05 17:16:32 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 7
15/06/05 17:16:32 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 8
15/06/05 17:16:32 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 9
15/06/05 17:16:32 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
15/06/05 17:16:32 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: Master removed our application: FAILED
Does anyone know how to manage the spark processes through supervisord?
I'm also open to alternative solutions.

The spark master can be run in the foreground by
command=/path/spark-1.3.0-bin-cdh4/sbin/../bin/spark-class org.apache.spark.deploy.master.Master --ip master.mydomain.com --port 7077 --webui-port 18080
And the worker
command=/path/spark-1.3.0-bin-cdh4/sbin/../bin/spark-class org.apache.spark.deploy.worker.Worker spark://master.mydomain.com:7077

Related

WARN - Running Spark Locally with Docker - Initial job has not accepted any resources

I launched Spark master and worker in my laptop using Docker bridge network spark
docker network create spark
I put the following command
docker run -ti -p 8080:8080 -p 7077:7077 -p 4040:4040 -e SPARK_NO_DAEMONIZE=true --network=spark --name spark-master apache/spark:v3.3.0 /opt/spark/sbin/start-master.sh
docker run -ti -p 8080:8080 -p 7077:7077 -p 4040:4040 -e SPARK_NO_DAEMONIZE=true --network=spark --name spark-master apache/spark:v3.3.0 /opt/spark/sbin/start-worker.sh spark://<master>:7077
Once they both start, I try and launch the following application code from my IDE (written in Kotlin, but doesn't matter if it's also in Java)
var sparkSession = SparkSession.builder()
.appName("mapreduce")
.master("spark://localhost:7077")
.config("spark.dynamicAllocation.enabled", "false")
.orCreate
var dataset: Dataset<String> = sparkSession.createDataset(listOf("Banana", "Car", "Glass", "Banana", "Computer", "Car"),
Encoders.STRING())
dataset = dataset.map(MapFunction{c: String->"word: "+c} , Encoders.STRING())
dataset.show()
The code works if master is local. I opened localhost:8080 and localhost:8081 and I can see the job getting registered. So, why am I getting a warning message as follows
22/09/05 01:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220905001717-0003/1 is now EXITED (Command exited with code 1)
22/09/05 01:17:38 INFO StandaloneSchedulerBackend: Executor app-20220905001717-0003/1 removed: Command exited with code 1
22/09/05 01:17:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
22/09/05 01:17:38 INFO BlockManagerMaster: Removal of executor 1 requested
22/09/05 01:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220905001717-0003/2 on worker-20220905000436-172.18.0.3-8082 (172.18.0.3:8082) with 8 core(s)
22/09/05 01:17:38 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
22/09/05 01:17:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20220905001717-0003/2 on hostPort 172.18.0.3:8082 with 8 core(s), 1024.0 MiB RAM
22/09/05 01:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220905001717-0003/2 is now RUNNING
22/09/05 01:17:40 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up

im new to pyspark and i wanna lunch spark standalone cluster i lunched the spark-master using bin\spark-class2.cmd org.apache.spark.deploy.master.Master it worked well i checked on http://localhost:8080/ .
i wanted to lunch spark-shell using spark-shell --master spark://192.168.43.78:7077 <--- the URL of spark master and i got this error:
ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
22/05/24 20:28:59 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:92)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:577)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2589)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:937)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:931)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)
at $line3.$read$$iw$$iw.<init>(<console>:15)
...
How can i fix that ?

pyspark tasks stuck on Airflow and Spark Standalone Cluster with Docker-compose

I setup Airflow and Spark standalone cluster on docker-compose.
Airflow run spark-submit tasks via spark client mode, which are submitted directly to spark master. However when I execute spark-submit task, the task got stuck.
Spark-submit Command:
spark-submit --verbose --master spark:7077 --name dummy_sql_spark_job ${AIRFLOW_HOME}/dags/spark/spark_sql.py
What i see from spark-submit driver logs:
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/1 is now EXITED (Command exited with code 1)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Executor app-20220104070012-0011/1 removed: Command exited with code 1
22/01/04 07:02:19 INFO BlockManagerMaster: Removal of executor 1 requested
22/01/04 07:02:19 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
22/01/04 07:02:19 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220104070012-0011/5 on worker-20220104061702-172.27.0.9-38453 (172.27.0.9:38453) with 1 core(s)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Granted executor ID app-20220104070012-0011/5 on hostPort 172.27.0.9:38453 with 1 core(s), 1024.0 MiB RAM
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/5 is now RUNNING
22/01/04 07:02:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
What i see from one of the spark workers:
spark-worker-1_1 | 22/01/04 07:02:18 INFO SecurityManager: Changing modify acls groups to:
spark-worker-1_1 | 22/01/04 07:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
spark-worker-1_1 | 22/01/04 07:02:19 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=5001" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#172.27.0.6:5001" "--executor-id" "3" "--hostname" "172.27.0.11" "--cores" "1" "--app-id" "app-20220104070012-0011" "--worker-url" "spark://Worker#172.27.0.11:35093"
Versions:
Airflow image: apache/airflow:2.2.3
Spark driver version: 3.1.2
Spark server: 3.2.0
Network
All containers airflow-scheduler, airflow-webserver, spark-master, spark-worker-n connected to same external network.
spark-driver is installed under airflow containers (scheduler, webserver), because corresponding dags and tasks are executed by airflow-scheduler.
UPDATE
After replacing driver spark version to match the master's one 3.2.0, the issue get disappeared. So it means, that in my particular case the issue was not due to connectivity between different spark actors (driver, master, worker/executor), but due to version mismatch. For some reason spark workers does not log corresponding error, which is misleading.
Most of the threads was pointing to connectivity issues. However in my case issue was due to mismatch of spark's driver vs master/worker version.
After replacing driver spark version to match the master's one 3.2.0, as well as ensure the same python version both on driver and executor sides (3.9.10) the issue get disappeared. So it means, that in my particular case the issue was not due to connectivity between different spark actors (driver, master, worker/executor), but due to version mismatch. For some reason spark workers does not log corresponding error, which is misleading.

How do I use the portable runner and spark-submit to submit beams wordcount python example to a remote spark cluster on EMR running yarn?

I am trying to submit beams wordcount python example to a remote spark cluster on emr running yarn as its resource manager. According to the spark documentation this needs to be done using the portable runner.
Following the portable runner instructions, I have started the job service endpoint, and it appears to start correctly::
$ docker run --net=host apache/beam_spark_job_server:latest --spark-master-url=spark://*.***.***.***:7077
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: ArtifactStagingService started on localhost:8098
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: Java ExpansionService started on localhost:8097
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: JobService started on localhost:8099
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: Job server now running, terminate with Ctrl+C
Now I try to submit the job using spark-submit, input is a plain text version of Sherlock Holmes:
$ spark-submit --master=yarn --deploy-mode=cluster wordcount.py --input data/sherlock.txt --output output --runner=PortableRunner --job_endpoint=localhost:8099 --environment_type=DOCKER --environment_config=apachebeam/python3.7_sdk
20/08/31 12:19:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/08/31 12:19:40 INFO RMProxy: Connecting to ResourceManager at ip-***-**-**-***.ec2.internal/***.**.**.***:8032
20/08/31 12:19:40 INFO Client: Requesting a new application from cluster with 2 NodeManagers
20/08/31 12:19:40 INFO Configuration: resource-types.xml not found
20/08/31 12:19:40 INFO ResourceUtils: Unable to find 'resource-types.xml'.
20/08/31 12:19:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (6144 MB per container)
20/08/31 12:19:40 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
20/08/31 12:19:40 INFO Client: Setting up container launch context for our AM
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: /usr/lib/spark/python/lib/pyspark.zip not found; cannot run pyspark application in YARN mode.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.deploy.yarn.Client.$anonfun$findPySparkArchives$2(Client.scala:1167)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.deploy.yarn.Client.findPySparkArchives(Client.scala:1163)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:858)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:178)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1134)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/08/31 12:19:40 INFO ShutdownHookManager: Shutdown hook called
20/08/31 12:19:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-ee751413-e29d-4b1f-8a16-fb8650b1ca10
It appears to want pyspark to be installed, I am fairly new to submitting beam jobs to a spark cluster, is there a reason why pyspark would need to be installed when submitting a beam job? I have a feeling my spark-submit command is wrong, but I am having a hard time finding any more concrete documentation on how to do what I am trying to do.

Can't spark-submit to analytics node on DataStax Enterprise

I have a 6 node cluster, one of those is spark enabled.
I also have a spark job that I would like to submit to the cluster / that node, so I enter the following command
spark-submit --class VDQConsumer --master spark://node-public-ip:7077 target/scala-2.10/vdq-consumer-assembly-1.0.jar
it launches the spark ui on that node, but eventually gets here:
15/05/14 14:19:55 INFO SparkContext: Added JAR file:/Users/cwheeler/dev/git/vdq-consumer/target/scala-2.10/vdq-consumer-assembly-1.0.jar at http://node-ip:54898/jars/vdq-consumer-assembly-1.0.jar with timestamp 1431627595602
15/05/14 14:19:55 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#node-ip:7077/user/Master...
15/05/14 14:19:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#node-ip:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/05/14 14:20:15 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#node-ip:7077/user/Master...
15/05/14 14:20:35 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#node-ip:7077/user/Master...
15/05/14 14:20:55 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/05/14 14:20:55 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
15/05/14 14:20:55 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
Does anyone have any idea what just happened?

Resources