Spark/k8s: How do I install Spark 2.4 on an existing kubernetes cluster, in client mode? - apache-spark

I want to install Apache Spark v2.4 on my Kubernetes cluster, but there does not seem to be a stable helm chart for this version. An older/stable chart (for v1.5.1) exists at
https://github.com/helm/charts/tree/master/stable/spark
How can I create/find a v2.4 chart?
Then: The reason for needing v2.4 is to enable client-mode, because I would like to be able to submit (PySpark/Jupyter notebook) jobs to the cluster from my laptop's dev environment. What extra steps are required to enable client-mode (including exposing the service)?
The closest attempt so far (but for Spark v2.0.0) that I have found, but which I haven't yet got working, is at
https://github.com/Uninett/kubernetes-apps/tree/master/spark
At https://github.com/phatak-dev/kubernetes-spark (also two years old), there is nothing about jupyter deployment.
Pangeo-specific: https://discourse.jupyter.org/t/spark-integration-documentation/243
SO thread: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1030
I have searched for up-to-date resources on this but have found nothing that has everything in one place. I will update this question with other relevant links if and when people are able to point them out to me. Hopefully it will be possible to cobble together an answer.
As ever, huge thanks in advance.
Update:
https://github.com/SnappyDataInc/spark-on-k8s for v2.2 is extremely easy to deploy - looks promising...

see https://hub.helm.sh/charts/microsoft/spark this is based off https://github.com/helm/charts/tree/master/stable/spark and uses spark 2.4.6 with hadoop 3.1. You can check the source for this chat at https://github.com/dbanda/charts. The Livy service makes it easy to submit spark jobs via REST API. You can also submit jobs using Zeppelin. We made this chart as alternative way to run spark on K8s without using the spark-submit k8s mode. I hope it helps.

Related

How can we connect to remote spark cluster via jupyterhub?

First of all my agenda is to be able to use spark codes inside jupyterhub. In other words I want to connect a remote spark cluster to jupyterhub. After searching about it I came up with two solutions:1)Livy and 2)spark magic. I have tried Livy but since it doesn't support spark version 3 I have put that one aside, also another way we considered was via spark magic but we couldn't install or even find a good documentation about it.
what came into our minds was somehow merge the two image of spark and jupyterhub together to gain what we need.
Does anyone have any idea how it can be done or a better suggestion?
Even a good documentation about spark magic that we can use would be wonderful.
thank you for your help.

Importing vs Installing spark

I am new to the spark world and to some extent coding.
This question might seem too basic but please clear my confusion.
I know that we have to import spark libraries to write spark application. I use intellij and sbt.
After writing the application , I can also run them and see the output on "run".
My question is, why should I install spark separately on my machine(local) if I can just import them as libraries and run them.
Also what is the need for it to be installed on the cluster since we can just submit the jar file and jvm is already present in all the machines of the clustor
Thank you for the help!
I understand your confusion.
Actually you don't really need to install spark on your machine if you are for example running it on scala/java and you can just import spark-core or any other dependancies into your project and once you start your spark job on mainClass it will create an standalone spark runner on your machine and run your job on if (local[*]).
There are many reasons for having spark on your local machine.
One of them is for running spark job on pyspark which requires spark/python/etc libraries and a runner(local[] or remote[]).
Another reason can be if you want to run your job on-premise.
It might be easier to create cluster on your local datacenter and maybe appoint your machine as master and the other machines connected to your master as worker.(this solution might be abit naive but you asked for basics so this might spark your curiosity to read more about infrastructure design of a data processing system more)

HDInsight SparkHistory on Azure shows no applications

I have created a Spark HDInsight Cluster on Azure. The cluster was used to run different jobs (either Spark or Hive).
Until a month ago, the history of the jobs could be seen in the Spark History Server dashboard. It seems that following the update that introduced Spark 1.6.0, this dashboard is no longer showing any applications.
I have also tried to bypass this issue by executing the PowerShell cmdlet for get-azurehdinsightjob as sugested here. The output is again an empty list of applications.
I would appreciate any help as this dashboard used to work and now all my experiments are stalled.
I managed to solve the issue by deleting everything inside wasb:///hdp/spark-events. Maybe the issue was related to the size of the folder, as no other log files could be appended.
All the following jobs are now appearing successfully in the Spark History Server dashboard.

Submit spark application from laptop

I want to submit spark python applications from my laptop. I have a standalone spark cluster, and the master is running at some visible IP (MASTER_IP). After downloading and unzipping Spark on my laptop, I got this to work
./bin/spark-submit --master spark://MASTER_IP:7077 ~/PATHTO/pi.py
From what I understand, it is defaulting to client mode (vs cluster mode). According to Spark (http://spark.apache.org/docs/latest/submitting-applications.html) -
"only YARN supports cluster mode for Python applications." Since I'm not using YARN, I must use client mode.
My question is - do I need to download all of Spark on my laptop? Or just a few libraries?
I want to allow the rest of my team to use my Spark cluster, but I want them to do the least amount of work as possible. They don't need to setup a cluster. They only need to submit jobs to it. Having them downloading all of Spark seems like overkill.
So, what exactly is the minimum that they need?
The spark-1.5.0-bin-hadoop2.6 package I have here is 304MB unpacked. More than half, 175MB is made up of spark-assembly-1.5.0-hadoop2.6.0.jar, the main Spark stuff. You can't get rid of this unless you want to compile your own package maybe. A large part of the rest is spark-examples-1.5.0-hadoop2.6.0.jar, 113MB. Removing this and zipping back up is harmless and saves you a lot already.
However, using some tools such that they don't have to work with the spark package directly, like spark-jobserver (never used but never heard somebody very positive about the current state) or spark-kernel (needs your own code still to interface with it, or when used with notebook (see below) limited compared to alternatives) as suggested by Reactormonk makes it even easier for them.
A popular thing to do in that sense is set up access to a notebook. As you're using Python, IPython with a PySpark profile would be most straightforward to set up. Other alternatives are Zeppelin and spark-notebook (my favourite) for using Scala.

installing spark 1.4 on google cloud cluster

I set up a google compute cluster with Click to Deploy
I want to use spark 1.4 but I get spark 1.1.0
Anyone know if it is possible to set up a cluster with spark 1.4?
I also had issues with this. These are the steps I took:
Download a copy of GCE's bdutil from github https://github.com/GoogleCloudPlatform/bdutil
Download the spark version you want, in this case spark-1.4.1 from the spark website and store it into a google compute bucket that you control. Make sure it's a spark that supports the hadoop you'll also be deploying with bdutil
Edit the spark env file https://github.com/GoogleCloudPlatform/bdutil/blob/master/extensions/spark/spark_env.sh
Change SPARK_HADOOP2_TARBALL_URI='gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz' to SPARK_HADOOP2_TARBALL_URI='gs://[YOUR SPARK PATH]' I'm assuming you want hadoop 2, if you want hadoop 1 make sure you change the right variable.
Once that's all done, from the modified bdutil, build your hadoop+spark cluster, you should have a modern version of spark after this
You'll have to make sure you execute the spark_env.sh with the -e command when executing bdutil, you'll need to also add the hadoop_2 env if you're installing hadoop2 which I was as well.
One option would be to try http://spark-packages.org/package/sigmoidanalytics/spark_gce , this deploys Spark 1.2.0 but you could edit the file to deploy 1.4.0.

Resources