DynamicAllocation enabled with Spark on Kubernetes? - apache-spark

Latest documentation for spark 2.4.5 suggests "Dynamic Resource Allocation and External Shuffle Service" in Future work, however, I have also found some older documentation for spark 2.2.0 suggesting it is supported after setting up external shuffle service.
Have you successfully enabled Spark dynamic allocation on Kubernetes? If so, what challenges did you face and which documentation did you reference?
We are currently using AWS EMR service for Spark, and would like to try out Spark on Kubernetes with Dynamic Allocation enabled.
Thanks!

the older docs do belong to the older Spark fork repo, which has been used as a basement and POC for the main Apache Spark repository work related to K8s. If you want to have this feature enabled your you - you are restricted to use only this older Spark 2.2.0 fork. Note that it is not recommended for PROD.

Related

What is Databricks Spark cluster manager? Can it be changed?

Original Spark distributive supports several cluster managers like YARN, Mesos, Spark Standalone, K8s.
I can't find what is under the hood in Databricks Spark, which cluster manager it is using, and is it possible to change?
What's Databricks Spark architecture?
Thanks.
You can't check the cluster manager in Databricks, and you really don't need that because this part is managed for you. You can think about it as a kind of standalone cluster, but there are differences. General Databricks architecture is shown here.
You can change the cluster configuration by different means - init scripts, configuration parameters, etc. See documentation for more details.

Spark JobServer can use Cassandra as SharedDb

I have been doing a research about Configuring Spark JobServer Backend (SharedDb) with Cassandra.
And I saw in the SJS documentation that they cited Cassandra as one of the Shared DBs that can be used.
Here is the documentation part:
Spark Jobserver offers a variety of options for backend storage such as:
H2/PostreSQL or other SQL Databases
Cassandra
Combination of SQL DB or Zookeeper with HDFS
But I didn't find any configuration example for this.
Would anyone have an example? Or can help me to configure it?
Edited:
I want to use Cassandra to store metadata and jobs from Spark JobServer. So, I can hit any servers through a proxy behind of these servers.
Cassandra was supported in the previous versions of Jobserver. You just needed to have Cassandra running, add correct settings to your configuration file for Jobserver: https://github.com/spark-jobserver/spark-jobserver/blob/0.8.0/job-server/src/main/resources/application.conf#L60 and specify spark.jobserver.io.JobCassandraDAO as DAO.
But Cassandra DAO was recently deprecated and removed from the project, because it was not really used and maintained by the community.

how to auto scale spark job in kubernetes cluster

Need an advice on running spark/kubernetes. I have Spark 2.3.0 which comes with native kubernetes support. I am trying to run the spark job using spark-submit with parameters master as"kubernetes-apiserver:port" & other required parameters like spark image and others as mentioned here .
How to enable auto scaling / increase the no of worker nodes based on load? Is there a sample document I can follow ? Some basic example/document would be very helpful.
Or is there any other way to deploy the spark on kubernetes which can help me achieve auto scale based on load.
Basically, Apache Spark 2.3.0 does not officially support auto scalling on K8S cluster, as you can see in future work after 2.3.0.
BTW, it's still a feature working in progress, but you can try on the k8s fork for Spark 2.2

Can I run spark 2.0.* artifact on a spark 2.2.* stand-alone cluster?

I am aware of the fact that with the change of major version of spark (i.e. from 1.* to 2.*) there will be compile time failures due to changes in existing APIs.
As per my knowledge spark guarantees that with minor version update (i.e. 2.0.* to 2.2.*), changes will be backward compatible.
Although this will eliminate the possibility of compile-time failures with upgrade, would it be safe to assume that there won't be any run time failure too if submit a job on spark 2.2.* stand alone cluster using an artifact(jar) created using 2.0.* dependencies?
would it be safe to assume that there won't be any run time failure too if submit a job on 2.2.* cluster using an artifact(jar) created using 2.0.* dependencies?
Yes.
I'd even say that there's no concept of a Spark cluster unless we talk about the built-in Spark Standalone cluster.
In other words, you deploy a Spark application to a cluster, e.g. Hadoop YARN or Apache Mesos, as a application jar that may or may not contain Spark jars and so disregard what's already available in the environment.
If however you do think of Spark Standalone, things may have been broken between releases even between 2.0 and 2.2 as the jars in your Spark application have to be compatible with the ones on JVM of Spark workers (they are already pre-loaded).
I would not claim full compatibility between releases of Spark Standalone.

Getting "AssertionError("Unknown application type")" when Connecting to DSE 5.1.0 Spark

I am connecting to DSE (Spark) using this:
new SparkConf()
.setAppName(name)
.setMaster("spark://localhost:7077")
With DSE 5.0.8 works fine (Spark 1.6.3) but now fails with DSE 5.1.0 getting this error:
java.lang.AssertionError: Unknown application type
at org.apache.spark.deploy.master.DseSparkMaster.registerApplication(DseSparkMaster.scala:88) ~[dse-spark-5.1.0.jar:2.0.2.6]
After checking the use-spark jar, I've come up with this:
if(rpcendpointref instanceof DseAppProxy)
And within spark, seems to be RpcEndpointRef (NettyRpcEndpointRef).
How can I fix this problem?
I had a similar issue, and fixed it by following this:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkRemoteCommands.html
Then you need to run your job using dse spark-submit, without specifying any master.
Resource Manager Changes
The DSE Spark Resource manager is different than the OSS Spark Standalone Resource Manager. The DSE method uses a different uri "dse://" because under the hood it actually is performing a CQL based request. This has a number of benefits over the Spark RPC but as noted does not match some of the submission
mechanisms possible in OSS Spark.
There are several articles on this on the Datastax Blog as well as documentation notes
Network Security with DSE 5.1 Spark Resource Manager
Process Security with DSE 5.1 Spark Resource Manager
Instructions on the URL Change
Programmatic Spark Jobs
While it is still possible to launch an application using "setJars" you must also add the DSE specific jars and config options to talk with the resource manager. In DSE 5.1.3+ there is a class provided
DseConfiguration
Which can be applied to your Spark Conf DseConfiguration.enableDseSupport(conf) (or invoked via implicit) which will set these options for you.
Example
Docs
This is of course for advanced users only and we strongly recommend using dse spark-submit if at all possible.
I found a solution.
First of all, I think is impossible to run a Spark job within an Application within DSE 5.1. Has to be sent with dse spark-submit
Once sent, it works perfectly. In order to do the communications to the job I used Apache Kafka.
If you don't want to use a job, you can always go back to a Apache Spark.

Resources