Suggestions for using Spark in AWS without any cluster? - apache-spark

I want to leverage a couple Spark APIs to convert some data in an EC2 container. We deploy these containers on Kubernetes.
I'm not super familiar with Spark, but I see there are requirements on Spark context, etc. Is it possible for me to just leverage the Spark APIs/RDDs, etc without needing any cluster? I just have a simple script I want to run that leverages Spark. I was thinking I could somehow fatjar this dependency or something, but not quite sure what I'm looking for.

Yes, you need a cluster to run spark.
Cluster is nothing more than a platform to install Spark
I think your question should be "Can Spark run on single/standalone node ?"
If this you want to know then yes spark can run on standalone node as spark has its own stand alone cluster.
"but I see there are requirements on Spark context":
SparkContext is the entry point of spark, you need to create it in order to use any spark function.

Related

What is Databricks Spark cluster manager? Can it be changed?

Original Spark distributive supports several cluster managers like YARN, Mesos, Spark Standalone, K8s.
I can't find what is under the hood in Databricks Spark, which cluster manager it is using, and is it possible to change?
What's Databricks Spark architecture?
Thanks.
You can't check the cluster manager in Databricks, and you really don't need that because this part is managed for you. You can think about it as a kind of standalone cluster, but there are differences. General Databricks architecture is shown here.
You can change the cluster configuration by different means - init scripts, configuration parameters, etc. See documentation for more details.

Run parallel jobs on-prem dynamic spark clusters

I am new to spark, And we have a requirement to set up a dynamic spark cluster to run multiple jobs. by referring to some articles, we can achieve this by using EMR (Amazon) service.
Is there any way to the same setup that can be done locally?
Once Spark clusters are available with services running on different ports on different servers, how to point mist to new spark cluster for each job.
Thanks in advance.
Yes, you can use a Standalone cluster that Spark provides where you can set up Spark Cluster (master nodes and slave nodes). There are also docker containers that can be used to achieve that. Take a look here.
Other options it will be to take and deploy locally Hadoop ecosystems, like MapR, Hortonworks, Cloudera.

Airflow + Kubernetes VS Airflow + Spark

Like some article that I previously read. It said that in new Kubernetes version, already include Spark capabilities. But with some different ways such as using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit.
Is that the best practice to Combine Airflow + Kubernetes is to remove Spark and using KubernetesPodOperator to execute the task?
Which is have a better performance since Kubernetes have AutoScaling that Spark doesn’t have.
Need someone expert in Kubernetes to help me explain this. I’m still newbie with this Kubernetes, Spark, and Airflow things. :slight_smile:
Thank You.
in new Kubernetes version, already include Spark capabilities
I think you got that backwards. New versions of Spark can run tasks in a Kubernetes cluster.
using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit
Using Kubernetes would allow you to run containers with whatever isolated dependencies you wanted.
Meaning
With BashOperator, you must distribute the files to some shared filesystem or to all the nodes that ran the Airflow tasks. For example, spark-submit must be available on all Airflow nodes.
Similarly with Python, you ship out some zip or egg files that include your pip/conda dependency environment
remove Spark and using KubernetesPodOperator to execute the task
There is still good reasons to run Spark with Airflow, but instead you would be packaging a Spark driver container to execute spark-submit inside a container against the Kubernetes cluster. This way, you only need docker installed, not Spark (and all dependencies)
Kubernetes have AutoScaling that Spark doesn’t have
Spark does have Dynamic Resource Allocation...
One more solution which may help you is to use Apache Livy on Kubernetes (PR: https://github.com/apache/incubator-livy/pull/167) with Airflow HttpOperator.

Is it worth deploying Spark on YARN if I have no other cluster software?

I have a Spark cluster running in standalone mode. I am currently executing code on using Jupyter notebook calling pyspark. Is there a benefit to using YARN as the cluster manager, assuming that the machines are not doing anything else?
Would I get better performance using YARN? If so, why?
Many thanks,
John
I'd say YES by considering these points.
Why Run on YARN?
Using YARN as Spark’s cluster manager confers a few benefits over Spark standalone:
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
Any how Spark standalone mode also requires worker for slave activity which can not run non Spark applications, where as with YARN, this is isolated in containers, so adoption of another compute framework should be a code change instead of infra + code. So the cluster can be shared among different frameworks.
YARN is the only cluster manager for Spark that supports security. With
YARN, Spark can run against Kerberized Hadoop clusters and uses
secure authentication between its processes.
YARN allows you to dynamically share and centrally configure the same
pool of cluster resources between all frameworks that run on YARN.
You can throw your entire cluster at a MapReduce job, then use some
of it on an Impala query and the rest on Spark application, without
any changes in configuration.
I would say 1,2 and 3 are suitable for mentioned scenarios but not point 4 as we assumed no other frameworks are going to use the cluster.
souce

Is spark or spark with mesos the easiest to start with?

If I want a simple setup that would give me a quick start: would a combination of apache-spark and mesos would be the easiest? or maybe apache-spark alone would be better because....i.e. mesos would add complexity to the process given what it does, or maybe mesos does way so many things that would be hard to deal with spark alone, etc...
All I want is to be able to submit jobs and manage the cluster and jobs easily, nothing fancy for now, is spark or spark/mesos better or something else...
The easiest way to start using Spark is starting stand alone spark cluster on EC2.
It is as easy as running single script - spark-ec2 and it will do the rest for you.
The only case when stand alone cluster may not suit you - if you want to run more then single spark job at a time (at least it was the case with Spark 1.1).
For me personally the stand alone Spark cluster was good enough for a long time when I was running ad-hoc jobs - analyzing company's logs on S3 and learning Spark, and then destroy the cluster.
If you want to run more than one Spark at a time - I would go with Mesos.
Alternative would be to install CDH from Cloudera which is relatively easy (they provide install scripts and install instructions) and it is available for free.
CDH would provide you powerful tools to manage the cluster.
Using CDH for running Spark - they use YARN, and we have one or another issue from time to time with running Spark on YARN.
The main disadvantage to me - CDHs provider its own build of Spark - so it usually one minor version behind, which is a lot for such rapid progressing project as Spark.
So I would try Mesos for running Spark if I need to run more then one job at a time.
Just for completeness, Hortonworks provides downloadable HDP sandbox VM as well as supports Spark on HDP. It is also a good starting point.
Additionally, you can spin off your own cluster. I do thisonmy laptop, not for real big data usecases but for learning with moderate amount of data.
import subprocess as s
from time import sleep
cmd = "D:\\spark\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\bin\\spark-class.cmd"
master = "org.apache.spark.deploy.master.Master"
worker = "org.apache.spark.deploy.worker.Worker"
masterUrl="spark://BigData:7077"
cmds={"masters":1,"workers":3}
masterProcess=[cmd,master]
workerProcess=[cmd,worker,masterUrl]
noWorker = 3
pMaster = s.Popen(masterProcess)
sleep(3)
pWorkers = []
for i in range(noWorker):
pw = s.Popen(workerProcess)
pWorkers.append(pw)
The code above starts master and 3 workers, which I can monitor using the UI. This is just to get going and if you need aquick local set up.

Resources