How to set up Spark cluster on Windows machines? - apache-spark

I am trying to set up a Spark cluster on Windows machines.
The way to go here is using the Standalone mode, right?
What are the concrete disadvantages of not using Mesos or YARN? And how much pain would it be to use either one of those? Does anyone have some experience here?

FYI, I got an answer in the user-group: https://groups.google.com/forum/#!topic/spark-users/SyBJhQXBqIs
The standalone mode is indeed the way to go. Mesos does not work under Windows and YARN probably neither.

Quick note, YARN should eventually work on Windows via the Hortonworks Data Platform (version 2.0 beta is on YARN but it is on Linux only at this time). Another potential route is to have it work against Hadoop 1.1 (Hortonworks Data Platform for Windows 1.1) - but your approach of having it run on Standalone mode is definitely the easiest to getting of the ground.

Related

Is it OK to force the hosting of the applicationMaster on one same node (YARN)?

I am submitting Spark applications to my Hadoop 3 nodes cluster.
the applicationMaster is always (client or cluster mode) hosted on the client machine.
Thanks for clarifying this.
No, It's not "OK".
One of the ideologies behind spark is resilience. If you are forcing 1 node to be the application master you are introducing a bottleneck & a single point of failure. You are using yarn, there is no reason to specify a master.
If this is just for you and this works, go for it.
You aren't following a "Normal" spark strategy for a yarn cluster. Is that 'OK'? If you have a good reason, yes it's ok.
Would I use this in production? No.
Are there simpler more common ways of running a cluster? Yes.
You are mixing strategies of running Spark Standalone and Yarn. These are two fundamentally different architectures. If you can make the two architectures work together that's fun. But you may hit some weird problems and as this is a custom set of settings you may not find a lot of support to help you.

Spark/k8s: How do I install Spark 2.4 on an existing kubernetes cluster, in client mode?

I want to install Apache Spark v2.4 on my Kubernetes cluster, but there does not seem to be a stable helm chart for this version. An older/stable chart (for v1.5.1) exists at
https://github.com/helm/charts/tree/master/stable/spark
How can I create/find a v2.4 chart?
Then: The reason for needing v2.4 is to enable client-mode, because I would like to be able to submit (PySpark/Jupyter notebook) jobs to the cluster from my laptop's dev environment. What extra steps are required to enable client-mode (including exposing the service)?
The closest attempt so far (but for Spark v2.0.0) that I have found, but which I haven't yet got working, is at
https://github.com/Uninett/kubernetes-apps/tree/master/spark
At https://github.com/phatak-dev/kubernetes-spark (also two years old), there is nothing about jupyter deployment.
Pangeo-specific: https://discourse.jupyter.org/t/spark-integration-documentation/243
SO thread: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1030
I have searched for up-to-date resources on this but have found nothing that has everything in one place. I will update this question with other relevant links if and when people are able to point them out to me. Hopefully it will be possible to cobble together an answer.
As ever, huge thanks in advance.
Update:
https://github.com/SnappyDataInc/spark-on-k8s for v2.2 is extremely easy to deploy - looks promising...
see https://hub.helm.sh/charts/microsoft/spark this is based off https://github.com/helm/charts/tree/master/stable/spark and uses spark 2.4.6 with hadoop 3.1. You can check the source for this chat at https://github.com/dbanda/charts. The Livy service makes it easy to submit spark jobs via REST API. You can also submit jobs using Zeppelin. We made this chart as alternative way to run spark on K8s without using the spark-submit k8s mode. I hope it helps.

Is Apache Spark recommended to run on windows?

I have a requirement to run Spark on Windows in a production environment. I would like to get advice in understanding if Apache Spark on Windows is recommended. If not, I would like to know the reason behind the same.

Running two versions of Apache Spark in cluster mode

I want to be able to run Spark 2.0 and Spark 1.6.1 in cluster mode on a single cluster to be able to share resources, what are the best practices to do this? this is because I want to be able to shield a certain set of applications from code changes that rely on 1.6.1 and others on Spark 2.0.
Basically the cluster could rely on dynamic allocation for Spark 2.0 but maybe not for 1.6.1 - this is flexible.
By using Docker this is possible you can run various version of Spark application, since Docker runs the application in Isolation.
Docker is an open platform for developing, shipping, and running applications. . With Docker you can separate your applications from your infrastructure and treat your infrastructure like a managed application.
Industries are adopting Docker since it provide this flexibility to run various version application in a single nut shell and many more
Mesos also allows to Run Docker Containers using Marathon
For more information please refer
https://www.docker.com/
https://mesosphere.github.io/marathon/docs/native-docker.html
Hope this helps!!!....

Submit spark application from laptop

I want to submit spark python applications from my laptop. I have a standalone spark cluster, and the master is running at some visible IP (MASTER_IP). After downloading and unzipping Spark on my laptop, I got this to work
./bin/spark-submit --master spark://MASTER_IP:7077 ~/PATHTO/pi.py
From what I understand, it is defaulting to client mode (vs cluster mode). According to Spark (http://spark.apache.org/docs/latest/submitting-applications.html) -
"only YARN supports cluster mode for Python applications." Since I'm not using YARN, I must use client mode.
My question is - do I need to download all of Spark on my laptop? Or just a few libraries?
I want to allow the rest of my team to use my Spark cluster, but I want them to do the least amount of work as possible. They don't need to setup a cluster. They only need to submit jobs to it. Having them downloading all of Spark seems like overkill.
So, what exactly is the minimum that they need?
The spark-1.5.0-bin-hadoop2.6 package I have here is 304MB unpacked. More than half, 175MB is made up of spark-assembly-1.5.0-hadoop2.6.0.jar, the main Spark stuff. You can't get rid of this unless you want to compile your own package maybe. A large part of the rest is spark-examples-1.5.0-hadoop2.6.0.jar, 113MB. Removing this and zipping back up is harmless and saves you a lot already.
However, using some tools such that they don't have to work with the spark package directly, like spark-jobserver (never used but never heard somebody very positive about the current state) or spark-kernel (needs your own code still to interface with it, or when used with notebook (see below) limited compared to alternatives) as suggested by Reactormonk makes it even easier for them.
A popular thing to do in that sense is set up access to a notebook. As you're using Python, IPython with a PySpark profile would be most straightforward to set up. Other alternatives are Zeppelin and spark-notebook (my favourite) for using Scala.

Resources