Consistent spark cluster on eks - apache-spark

I used spark-submit as described in the spark docs.
But I was looking for something else for my deployment. I want a cluster that is always on. Meaning that the master and the workers are always up. I want to connect to the spark master from inside a python script with the masters' IP.
What is the best way to achieve this with EKS ? Is the spark operator a good option ?
EDIT: using PySpark

Related

Suggestions for using Spark in AWS without any cluster?

I want to leverage a couple Spark APIs to convert some data in an EC2 container. We deploy these containers on Kubernetes.
I'm not super familiar with Spark, but I see there are requirements on Spark context, etc. Is it possible for me to just leverage the Spark APIs/RDDs, etc without needing any cluster? I just have a simple script I want to run that leverages Spark. I was thinking I could somehow fatjar this dependency or something, but not quite sure what I'm looking for.
Yes, you need a cluster to run spark.
Cluster is nothing more than a platform to install Spark
I think your question should be "Can Spark run on single/standalone node ?"
If this you want to know then yes spark can run on standalone node as spark has its own stand alone cluster.
"but I see there are requirements on Spark context":
SparkContext is the entry point of spark, you need to create it in order to use any spark function.

Run parallel jobs on-prem dynamic spark clusters

I am new to spark, And we have a requirement to set up a dynamic spark cluster to run multiple jobs. by referring to some articles, we can achieve this by using EMR (Amazon) service.
Is there any way to the same setup that can be done locally?
Once Spark clusters are available with services running on different ports on different servers, how to point mist to new spark cluster for each job.
Thanks in advance.
Yes, you can use a Standalone cluster that Spark provides where you can set up Spark Cluster (master nodes and slave nodes). There are also docker containers that can be used to achieve that. Take a look here.
Other options it will be to take and deploy locally Hadoop ecosystems, like MapR, Hortonworks, Cloudera.

how to auto scale spark job in kubernetes cluster

Need an advice on running spark/kubernetes. I have Spark 2.3.0 which comes with native kubernetes support. I am trying to run the spark job using spark-submit with parameters master as"kubernetes-apiserver:port" & other required parameters like spark image and others as mentioned here .
How to enable auto scaling / increase the no of worker nodes based on load? Is there a sample document I can follow ? Some basic example/document would be very helpful.
Or is there any other way to deploy the spark on kubernetes which can help me achieve auto scale based on load.
Basically, Apache Spark 2.3.0 does not officially support auto scalling on K8S cluster, as you can see in future work after 2.3.0.
BTW, it's still a feature working in progress, but you can try on the k8s fork for Spark 2.2

Airflow + Kubernetes VS Airflow + Spark

Like some article that I previously read. It said that in new Kubernetes version, already include Spark capabilities. But with some different ways such as using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit.
Is that the best practice to Combine Airflow + Kubernetes is to remove Spark and using KubernetesPodOperator to execute the task?
Which is have a better performance since Kubernetes have AutoScaling that Spark doesn’t have.
Need someone expert in Kubernetes to help me explain this. I’m still newbie with this Kubernetes, Spark, and Airflow things. :slight_smile:
Thank You.
in new Kubernetes version, already include Spark capabilities
I think you got that backwards. New versions of Spark can run tasks in a Kubernetes cluster.
using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit
Using Kubernetes would allow you to run containers with whatever isolated dependencies you wanted.
Meaning
With BashOperator, you must distribute the files to some shared filesystem or to all the nodes that ran the Airflow tasks. For example, spark-submit must be available on all Airflow nodes.
Similarly with Python, you ship out some zip or egg files that include your pip/conda dependency environment
remove Spark and using KubernetesPodOperator to execute the task
There is still good reasons to run Spark with Airflow, but instead you would be packaging a Spark driver container to execute spark-submit inside a container against the Kubernetes cluster. This way, you only need docker installed, not Spark (and all dependencies)
Kubernetes have AutoScaling that Spark doesn’t have
Spark does have Dynamic Resource Allocation...
One more solution which may help you is to use Apache Livy on Kubernetes (PR: https://github.com/apache/incubator-livy/pull/167) with Airflow HttpOperator.

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

Resources