How to start, stop and re-start cluster manually in Bluemix Spark? - apache-spark

What is the simplest way to start and stop Spark clusters manually in Bluemix? I would basically want to run the same things that
sbin/start-all.sh
and
sbin/stop-all.sh
do on a standalone Spark installment.

You can't. The Apache Spark service in Bluemix runs clusters that are shared by many users. No user is allowed to shut down or start these clusters.

Related

What's the most elegant/right way to stop a spark job running on a Kubernetes cluster?

I'm new to apache spark and I'm trying to run a spark job using spark-submit on my Kubernetes cluster. I was wondering if there's a right way to stop spark jobs once the driver and executor pods are spawned? Would deleting the pods themselves be enough?
Thanks!
When you will delete executor it will be recreated again and spark application will work. However if you will delete driver pod it will stop application.
So killing driver pod is actually the way to stop the Spark
Application during the execution.
As you are new to Spark and you want to run it on Kubernetes, you should check this tutorial.
At present the only way to stop Spark job running on Kuberentes is to delete the Driver Pod (unless you have an app controlling Spark context which is able to manipulate it). Since all other job-related resources are linked to Spark Driver Pod with such as called ownerReferences, they will be removed automatically by Kubernetes.
It should clean things up when the job completes automatically.

Can You Use a Script to Start Spark Cluster Nodes?

I'm running Hadoop and Spark on a four-node cluster in AWS EC2.
After doing a lot of web research, it seems the accepted way to start Spark on a cluster (once Hadoop is running) is to:
1) Log into the master node and run start-master.sh.
2) Log into each slave node and run start-slave.sh, passing it the DNS and port information for the master node.
My question is: If there are, let's say 20 nodes, this is pretty tedious and time consuming. Is there a way to start Spark from some localized location the way Hadoop is started? When you run Hadoop from the master node, it starts all the slave nodes remotely. I'm looking for a solution like that, or for a python script that can SSH into the nodes and start them.
You could use Apache Ambari to manage the whole cluster, which would SSH to all nodes for you
Otherwise, you could use a system like Ansible to configure and start all the services
Sounds like you're only using Spark Standalone, though, not YARN, because there is no start-slaves script for YARN

How to run a Spark Standalone master on Kubernetes that will use the Kubernetes Cluser Manager to start workers

I have an application that currently uses Standalone Mode locally to use spark functionality via the SparkContext. We are not using spark-submit to upload our jobs, we are running our application in a container on kubernetes so we would like to take advantage of the dynamic scheduling that kubernetes provides to run the jobs.
We started out looking for a helm chart to create stand alone cluster running on kubernetes similar to how you would have run a standalone cluster on machines ( vms or actual machines ) a few years ago and came across the following
https://github.com/helm/charts/tree/master/stable/spark
Issues:
very old instances of spark
not using the containers provided by spark
this setup wastes a bunch of resources if you need to have large worker nodes reserved and running all the time regardless of your need
Next we started looking at the spark-operator approach here https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Issues:
Doesn't support the way we interact with spark, takes the approach that all the apps are standalone apps that are pushed to the cluster to run
No longstanding master that allows us to take advantage of cached resources in the cluster
Along this journey we discovered that spark now supports a kubernetes cluster manager ( similar to the way it does with yarn, mesos ) so we are looking that this might be the best approach, but this still does not provide a standalone master that would allow for the in memory caching. I have looked to see if there was a way that I could get the org.apache.spark.deploy.master.Master to start and use the
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager
So I guess what I'm trying to ask is does anyone have any experience in trying to run a Standalone Master, that would use the kubernetes backend such as "KubernetesClusterManager" in order to have the worker nodes dynamically created as pods and running executors while having a permanent Standalone Master that would allow a SparkContext to connect to it remotely in client mode.

How to make spark streaming job run perpetually on HD Insights (YARN)?

I am developing a spark application running in HD Insights Cluster (YARN based) with IntelliJ. Currently, I submit jobs through the Azure HD Insights plug-in directly from IntelliJ. This, in turns, use the Livy API to submit the job remotely.
When I am done with developing the code, I would like the streaming job to be run perpetually. Currently, if the job fails five times, the program stops and doesn't restart itself. Is there any way to change this behavior? Or what solution do most people use to make spark restart after failing?
Restart of Yarn Spark jobs is controlled by Yarn settings. So you need to increase number of restarts for the spark application (yarn application master) in yarn. I believe it's: yarn.resourcemanager.am.max-attempts.
In HDInsight go to Ambari UI and change this setting in Yarn -> Config -> Advanced Yarn-site.
In order to submit production job you can use livy APIs directly as described here: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-eventhub-streaming#run-the-application-remotely-on-a-spark-cluster-using-livy

Datastax Spark Sql Thriftserver with Spark Application

I have an analytics node running, with Spark Sql Thriftserver running on it. Now I can't run another Spark Application with spark-submit.
It says it doesn't have resources. How to configure the dse node, to be able to run both?
The SparkSqlThriftServer is a Spark application like any other. This means it requests and reserves all resources in the cluster by default.
There are two options if you want to run multiple applications at the same time:
Allocate only part of your resources to each application.
This is done by setting spark.cores.max to a smaller value than the max resources in your cluster.
See Spark Docs
Dynamic Allocation
Which allows applications to change the amount of resources they use depending on how much work they are trying to do.
See Spark Docs

Resources