how to auto scale spark job in kubernetes cluster - apache-spark

Need an advice on running spark/kubernetes. I have Spark 2.3.0 which comes with native kubernetes support. I am trying to run the spark job using spark-submit with parameters master as"kubernetes-apiserver:port" & other required parameters like spark image and others as mentioned here .
How to enable auto scaling / increase the no of worker nodes based on load? Is there a sample document I can follow ? Some basic example/document would be very helpful.
Or is there any other way to deploy the spark on kubernetes which can help me achieve auto scale based on load.

Basically, Apache Spark 2.3.0 does not officially support auto scalling on K8S cluster, as you can see in future work after 2.3.0.
BTW, it's still a feature working in progress, but you can try on the k8s fork for Spark 2.2

Related

What is Databricks Spark cluster manager? Can it be changed?

Original Spark distributive supports several cluster managers like YARN, Mesos, Spark Standalone, K8s.
I can't find what is under the hood in Databricks Spark, which cluster manager it is using, and is it possible to change?
What's Databricks Spark architecture?
Thanks.
You can't check the cluster manager in Databricks, and you really don't need that because this part is managed for you. You can think about it as a kind of standalone cluster, but there are differences. General Databricks architecture is shown here.
You can change the cluster configuration by different means - init scripts, configuration parameters, etc. See documentation for more details.

Run parallel jobs on-prem dynamic spark clusters

I am new to spark, And we have a requirement to set up a dynamic spark cluster to run multiple jobs. by referring to some articles, we can achieve this by using EMR (Amazon) service.
Is there any way to the same setup that can be done locally?
Once Spark clusters are available with services running on different ports on different servers, how to point mist to new spark cluster for each job.
Thanks in advance.
Yes, you can use a Standalone cluster that Spark provides where you can set up Spark Cluster (master nodes and slave nodes). There are also docker containers that can be used to achieve that. Take a look here.
Other options it will be to take and deploy locally Hadoop ecosystems, like MapR, Hortonworks, Cloudera.

How can we set the execution parameters for an apache spark application

We have setup a multinode cluster for testing the Spark application with 4 nodes.
Each node has 250GB RAM,48 cores.
Running master on one node and 3 as slaves.
And we have developed a spark application using scala.
We use the spark-submit option to run the job.
Now here is the point we are struck and need more clarifications to proceed.
Query 1:
Which is the best option to run a spark job.
a) Spark as master
b) Yarn as master
and the difference.
Query 2:
While running any spark job we can provide option like number of executors,no of cores,executor memory etc.
Could you please advise what would be the optimal value for these parameters for better performance in my case.
Any help would be very much appreciated since it would be helpful for anyone who starts with Spark :)
Thanks.!!
Query1: YARN is a better resource manager and supports more features than Spark Master. For more you can visit
Apache Spark Cluster Managers
Query2: You can only assign resources at the time of job initialization. There are command line flags available. Also, if you don't wish to pass command line flags with spark-submit you can set them when creating spark configuration in the code.
You can see the available flags using
spark-submit --help
Fore more information visit Spark Configuration
Electing resources majorly depends on the size of data you want to process and the problem complexity.
Please visit 5 mistakes to avoid while writng spark applications

Getting "AssertionError("Unknown application type")" when Connecting to DSE 5.1.0 Spark

I am connecting to DSE (Spark) using this:
new SparkConf()
.setAppName(name)
.setMaster("spark://localhost:7077")
With DSE 5.0.8 works fine (Spark 1.6.3) but now fails with DSE 5.1.0 getting this error:
java.lang.AssertionError: Unknown application type
at org.apache.spark.deploy.master.DseSparkMaster.registerApplication(DseSparkMaster.scala:88) ~[dse-spark-5.1.0.jar:2.0.2.6]
After checking the use-spark jar, I've come up with this:
if(rpcendpointref instanceof DseAppProxy)
And within spark, seems to be RpcEndpointRef (NettyRpcEndpointRef).
How can I fix this problem?
I had a similar issue, and fixed it by following this:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkRemoteCommands.html
Then you need to run your job using dse spark-submit, without specifying any master.
Resource Manager Changes
The DSE Spark Resource manager is different than the OSS Spark Standalone Resource Manager. The DSE method uses a different uri "dse://" because under the hood it actually is performing a CQL based request. This has a number of benefits over the Spark RPC but as noted does not match some of the submission
mechanisms possible in OSS Spark.
There are several articles on this on the Datastax Blog as well as documentation notes
Network Security with DSE 5.1 Spark Resource Manager
Process Security with DSE 5.1 Spark Resource Manager
Instructions on the URL Change
Programmatic Spark Jobs
While it is still possible to launch an application using "setJars" you must also add the DSE specific jars and config options to talk with the resource manager. In DSE 5.1.3+ there is a class provided
DseConfiguration
Which can be applied to your Spark Conf DseConfiguration.enableDseSupport(conf) (or invoked via implicit) which will set these options for you.
Example
Docs
This is of course for advanced users only and we strongly recommend using dse spark-submit if at all possible.
I found a solution.
First of all, I think is impossible to run a Spark job within an Application within DSE 5.1. Has to be sent with dse spark-submit
Once sent, it works perfectly. In order to do the communications to the job I used Apache Kafka.
If you don't want to use a job, you can always go back to a Apache Spark.

Use SparkLauncher to programmatically submit a spark job to the dse spark cluster

I am relatively new to Spark and DSE and I am trying to submit a spark job to the DSE spark cluster programmatically?
I am using the org.apache.spark.launcher.SparkLauncher api. I tried following the documentation for the SparkLauncher.
Process launcher = new SparkLauncher().setAppName("appName")
.setAppResource("spark-job.jar")
.setSparkHome("spark-home")
.setMainClass("main-class")
.setVerbose(true).launch();
launcher.waitFor();
But it doesn't seem to launch the job on the dse cluster. I can trigger the job manually using: dse spark-submit command
Will appreciate any help here. Thanks !
I believe this has something to do with not setting your sparkHOme. Identify your spark home In DSE and then add
.setSparkHome("sparkHomeDir")
And You rather use SparkHandle than blocking wait.
SparkAppHandle handle = launcher.startApplication();

Resources