installing spark 1.4 on google cloud cluster - apache-spark

I set up a google compute cluster with Click to Deploy
I want to use spark 1.4 but I get spark 1.1.0
Anyone know if it is possible to set up a cluster with spark 1.4?

I also had issues with this. These are the steps I took:
Download a copy of GCE's bdutil from github https://github.com/GoogleCloudPlatform/bdutil
Download the spark version you want, in this case spark-1.4.1 from the spark website and store it into a google compute bucket that you control. Make sure it's a spark that supports the hadoop you'll also be deploying with bdutil
Edit the spark env file https://github.com/GoogleCloudPlatform/bdutil/blob/master/extensions/spark/spark_env.sh
Change SPARK_HADOOP2_TARBALL_URI='gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz' to SPARK_HADOOP2_TARBALL_URI='gs://[YOUR SPARK PATH]' I'm assuming you want hadoop 2, if you want hadoop 1 make sure you change the right variable.
Once that's all done, from the modified bdutil, build your hadoop+spark cluster, you should have a modern version of spark after this
You'll have to make sure you execute the spark_env.sh with the -e command when executing bdutil, you'll need to also add the hadoop_2 env if you're installing hadoop2 which I was as well.

One option would be to try http://spark-packages.org/package/sigmoidanalytics/spark_gce , this deploys Spark 1.2.0 but you could edit the file to deploy 1.4.0.

Related

Spark/k8s: How do I install Spark 2.4 on an existing kubernetes cluster, in client mode?

I want to install Apache Spark v2.4 on my Kubernetes cluster, but there does not seem to be a stable helm chart for this version. An older/stable chart (for v1.5.1) exists at
https://github.com/helm/charts/tree/master/stable/spark
How can I create/find a v2.4 chart?
Then: The reason for needing v2.4 is to enable client-mode, because I would like to be able to submit (PySpark/Jupyter notebook) jobs to the cluster from my laptop's dev environment. What extra steps are required to enable client-mode (including exposing the service)?
The closest attempt so far (but for Spark v2.0.0) that I have found, but which I haven't yet got working, is at
https://github.com/Uninett/kubernetes-apps/tree/master/spark
At https://github.com/phatak-dev/kubernetes-spark (also two years old), there is nothing about jupyter deployment.
Pangeo-specific: https://discourse.jupyter.org/t/spark-integration-documentation/243
SO thread: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1030
I have searched for up-to-date resources on this but have found nothing that has everything in one place. I will update this question with other relevant links if and when people are able to point them out to me. Hopefully it will be possible to cobble together an answer.
As ever, huge thanks in advance.
Update:
https://github.com/SnappyDataInc/spark-on-k8s for v2.2 is extremely easy to deploy - looks promising...
see https://hub.helm.sh/charts/microsoft/spark this is based off https://github.com/helm/charts/tree/master/stable/spark and uses spark 2.4.6 with hadoop 3.1. You can check the source for this chat at https://github.com/dbanda/charts. The Livy service makes it easy to submit spark jobs via REST API. You can also submit jobs using Zeppelin. We made this chart as alternative way to run spark on K8s without using the spark-submit k8s mode. I hope it helps.

Installing Spark/Zeppelin on Standalone node

I have a Cloudera cluster which is being managed by an admin team. However there is no Zeppelin installed in the cluster.
I would like to install Zeppelin on a separate node and connect with the Cloudera cluster?
Is it feasible to install zeppelin on a node which is not part of the cluster and submit spark jobs to it?
Any reference is really appreciated?
Thanks
Zeppelin is just another Spark client.
For example, on the machine that you want to use Zeppelin on, you should first make sure that spark shell and spark submit work as expected, then Zeppelin configurations become much easier
An easy way to manage that would be to have the admins use Cloudera Manager to install Spark (and Hive and Hadoop) client libraries into this standalone node, then I assume they give you SSH access, or you tell them how to install it

Spark Standalone Cluster :Configuring Distributed File System

I have just moved from a Spark local setup to a Spark standalone cluster. Obviously, loading and saving files no longer works.
I understand that I need to use Hadoop for saving and loading files.
My Spark installation is spark-2.2.1-bin-hadoop2.7
Question 1:
Am I correct that I still need to separately download, install and configure Hadoop to work with my standalone Spark cluster?
Question 2:
What would be the difference between running with Hadoop and running with Yarn? ...and which is easier to install and configure (assuming fairly light data loads)?
A1. Right. The package you mentioned is just packed with hadoop client with specified version and still you need to install hadoop if you want to use hdfs.
A2. Running with yarn means you're using resource manager of spark as yarn. (http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications) So, when the case you don't need DFS, like when you're only running spark streaming applications, you still can install Hadoop but only run yarn processes to use its resource management functionality.

If I already have Hadoop installed, should I download Apache Spark WITH Hadoop or WITHOUT Hadoop?

I already have Hadoop 3.0.0 installed. Should I now install the with-hadoop or without-hadoop version of Apache Spark from this page?
I am following this guide to get started with Apache Spark.
It says
Download the latest version of Apache Spark (Pre-built according to
your Hadoop version) from this link:...
But I am confused. If I already have an instance of Hadoop running in my machine, and then I download, install and run Apache-Spark-WITH-Hadoop, won't it start another additional instance of Hadoop?
First off, Spark does not yet support Hadoop 3, as far as I know. You'll notice this by no available option for "your Hadoop version" available for download.
You can try setting HADOOP_CONF_DIR and HADOOP_HOME in your spark-env.sh, though, regardless of which you download.
You should always download the version without Hadoop if you already have it.
won't it start another additional instance of Hadoop?
No. You still would need to explicitly configure and start that version of Hadoop.
That Spark option is already configured to use the included Hadoop, I believe
This is in addition to the answer by #cricket_007.
If you have Hadoop installed, do not download spark with Hadoop, however, as your Hadoop version is still unsupported by any version of spark, you will need to download the one with Hadoop. Although, you will need to configure the bundled Hadoop version on your machine for Spark to run on. This will mean that all your data on the Hadoop 3 will be LOST. So, If you need this data, please take a backup of the data before beginning your downgrade/re-configuration. I do not think you will be able to host 2 instances of Hadoop on the same system because of certain environment variables.

Accessing Cassandra from Google Cloud Dataproc

I just set up a Spark cluster in Google Cloud using DataProc and I have a standalone installation of Cassandra running on a separate VM. I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?
The connector can be downloaded here:
https://github.com/datastax/spark-cassandra-connector
The instructions on building are here:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/12_building_and_artifacts.md
sbt is needed to build it.
Where can I find sbt for the DataProc installation ?
Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?
I'm going to follow up the really helpful comment #angus-davis made not too long ago.
Where can I find sbt for the DataProc installation ?
At present, sbt is not included on Cloud Dataproc clusters. The sbt documentation contains information on how to install sbt manually. If you need to re-install sbt on your clusters, I highly recommend you create an init action to install sbt when you create a cluster. After some research, it looks like SBT is covered under a BSD-3 license, which means we can probably (no promise) include it in Cloud Dataproc clusters.
Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?
The answer to this is it depends on what you mean.
binaries - /usr/bin
config - /etc/spark/conf
spark_home - /usr/lib/spark
Importantly, this same pattern is used for other major OSS components installed on Cloud Dataproc clusters, like Hadoop and Hive.
I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?
The Stack Overflow answer Angus sent is probably the easiest way if it can be used as a Spark package. Based on what I can find, however, this is probably not an option. This means you will need to install sbt and manually install.
You can use cassandra along with the mentioned jar and connector from datastax. You can simply download the jar and pass it to dataproc cluster. You can find Google provided template, I contributed to, in this link [1]. This explains how you can use the template to connect to Cassandra using Dataproc.

Resources