Airflow + Kubernetes VS Airflow + Spark - apache-spark

Like some article that I previously read. It said that in new Kubernetes version, already include Spark capabilities. But with some different ways such as using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit.
Is that the best practice to Combine Airflow + Kubernetes is to remove Spark and using KubernetesPodOperator to execute the task?
Which is have a better performance since Kubernetes have AutoScaling that Spark doesn’t have.
Need someone expert in Kubernetes to help me explain this. I’m still newbie with this Kubernetes, Spark, and Airflow things. :slight_smile:
Thank You.

in new Kubernetes version, already include Spark capabilities
I think you got that backwards. New versions of Spark can run tasks in a Kubernetes cluster.
using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit
Using Kubernetes would allow you to run containers with whatever isolated dependencies you wanted.
Meaning
With BashOperator, you must distribute the files to some shared filesystem or to all the nodes that ran the Airflow tasks. For example, spark-submit must be available on all Airflow nodes.
Similarly with Python, you ship out some zip or egg files that include your pip/conda dependency environment
remove Spark and using KubernetesPodOperator to execute the task
There is still good reasons to run Spark with Airflow, but instead you would be packaging a Spark driver container to execute spark-submit inside a container against the Kubernetes cluster. This way, you only need docker installed, not Spark (and all dependencies)
Kubernetes have AutoScaling that Spark doesn’t have
Spark does have Dynamic Resource Allocation...

One more solution which may help you is to use Apache Livy on Kubernetes (PR: https://github.com/apache/incubator-livy/pull/167) with Airflow HttpOperator.

Related

Consistent spark cluster on eks

I used spark-submit as described in the spark docs.
But I was looking for something else for my deployment. I want a cluster that is always on. Meaning that the master and the workers are always up. I want to connect to the spark master from inside a python script with the masters' IP.
What is the best way to achieve this with EKS ? Is the spark operator a good option ?
EDIT: using PySpark

Unable to run hop pipelines on Spark running on Kubernetes

I am looking for help in running hop pipelines on Spark cluster, running on kubernetes.
I have spark master deployed with 3 worker nodes on kubernetes
I am using hop-run.sh command to run pipeline on spark running on kubernetes.
Facing Below exception
-java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.services.s3.AmazonS3ClientBuilder
Looks like fat.jar is not getting associated with the spark when running hop-run.sh command.
I tried running same with spark-submit command too but not sure how to pass references of pipelines and workflows to Spark running on kubernetes, though I am able to add fat jar to the classpath (can be seen in logs)
Any kind of help is appreciated.
Thanks
like
Could it be that you are using version 1.0?
We had a missing jar for S3 VFS which has been resolved in 1.1
https://issues.apache.org/jira/browse/HOP-3327
For more information on how to use spark-submit you can take a look at the following documentation:
https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-spark-pipeline-engine.html#_running_with_spark_submit
The location to the fat-jar the pipeline and the required metadata-export can all be VFS locations so no need to place those on the cluster itself.

What is Databricks Spark cluster manager? Can it be changed?

Original Spark distributive supports several cluster managers like YARN, Mesos, Spark Standalone, K8s.
I can't find what is under the hood in Databricks Spark, which cluster manager it is using, and is it possible to change?
What's Databricks Spark architecture?
Thanks.
You can't check the cluster manager in Databricks, and you really don't need that because this part is managed for you. You can think about it as a kind of standalone cluster, but there are differences. General Databricks architecture is shown here.
You can change the cluster configuration by different means - init scripts, configuration parameters, etc. See documentation for more details.

Suggestions for using Spark in AWS without any cluster?

I want to leverage a couple Spark APIs to convert some data in an EC2 container. We deploy these containers on Kubernetes.
I'm not super familiar with Spark, but I see there are requirements on Spark context, etc. Is it possible for me to just leverage the Spark APIs/RDDs, etc without needing any cluster? I just have a simple script I want to run that leverages Spark. I was thinking I could somehow fatjar this dependency or something, but not quite sure what I'm looking for.
Yes, you need a cluster to run spark.
Cluster is nothing more than a platform to install Spark
I think your question should be "Can Spark run on single/standalone node ?"
If this you want to know then yes spark can run on standalone node as spark has its own stand alone cluster.
"but I see there are requirements on Spark context":
SparkContext is the entry point of spark, you need to create it in order to use any spark function.

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

Resources