Accessing Cassandra from Google Cloud Dataproc - apache-spark

I just set up a Spark cluster in Google Cloud using DataProc and I have a standalone installation of Cassandra running on a separate VM. I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?
The connector can be downloaded here:
https://github.com/datastax/spark-cassandra-connector
The instructions on building are here:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/12_building_and_artifacts.md
sbt is needed to build it.
Where can I find sbt for the DataProc installation ?
Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?

I'm going to follow up the really helpful comment #angus-davis made not too long ago.
Where can I find sbt for the DataProc installation ?
At present, sbt is not included on Cloud Dataproc clusters. The sbt documentation contains information on how to install sbt manually. If you need to re-install sbt on your clusters, I highly recommend you create an init action to install sbt when you create a cluster. After some research, it looks like SBT is covered under a BSD-3 license, which means we can probably (no promise) include it in Cloud Dataproc clusters.
Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?
The answer to this is it depends on what you mean.
binaries - /usr/bin
config - /etc/spark/conf
spark_home - /usr/lib/spark
Importantly, this same pattern is used for other major OSS components installed on Cloud Dataproc clusters, like Hadoop and Hive.
I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?
The Stack Overflow answer Angus sent is probably the easiest way if it can be used as a Spark package. Based on what I can find, however, this is probably not an option. This means you will need to install sbt and manually install.

You can use cassandra along with the mentioned jar and connector from datastax. You can simply download the jar and pass it to dataproc cluster. You can find Google provided template, I contributed to, in this link [1]. This explains how you can use the template to connect to Cassandra using Dataproc.

Related

Overriding Apache Spark dependency (spark-hive)

Tech stack:
Spark 2.4.4
Hive 2.3.3
HBase 1.4.8
sbt 1.5.8
What is the best practice for Spark dependency overriding?
Suppose that Spark app (CLUSTER MODE) already have spark-hive (2.44) dependency (PROVIDED)
I compiled and assembled "custom" spark-hive jar that I want to use in Spark app.
There is not a lot of information about how you're running Spark, so it's hard to answer exactly.
But typically, you'll have Spark running on some kind of server or container or pod (in k8s).
If you're running on a server, go to $SPARK_HOME/jars. In there, you should find the spark-hive jar that you want to replace. Replace that one with your new one.
If running in a container/pod, do the same as above and rebuild your image from the directory with the replaced jar.
Hope this helps!

Installing Spark/Zeppelin on Standalone node

I have a Cloudera cluster which is being managed by an admin team. However there is no Zeppelin installed in the cluster.
I would like to install Zeppelin on a separate node and connect with the Cloudera cluster?
Is it feasible to install zeppelin on a node which is not part of the cluster and submit spark jobs to it?
Any reference is really appreciated?
Thanks
Zeppelin is just another Spark client.
For example, on the machine that you want to use Zeppelin on, you should first make sure that spark shell and spark submit work as expected, then Zeppelin configurations become much easier
An easy way to manage that would be to have the admins use Cloudera Manager to install Spark (and Hive and Hadoop) client libraries into this standalone node, then I assume they give you SSH access, or you tell them how to install it

Spark Standalone Cluster :Configuring Distributed File System

I have just moved from a Spark local setup to a Spark standalone cluster. Obviously, loading and saving files no longer works.
I understand that I need to use Hadoop for saving and loading files.
My Spark installation is spark-2.2.1-bin-hadoop2.7
Question 1:
Am I correct that I still need to separately download, install and configure Hadoop to work with my standalone Spark cluster?
Question 2:
What would be the difference between running with Hadoop and running with Yarn? ...and which is easier to install and configure (assuming fairly light data loads)?
A1. Right. The package you mentioned is just packed with hadoop client with specified version and still you need to install hadoop if you want to use hdfs.
A2. Running with yarn means you're using resource manager of spark as yarn. (http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications) So, when the case you don't need DFS, like when you're only running spark streaming applications, you still can install Hadoop but only run yarn processes to use its resource management functionality.

How to setup YARN with Spark in cluster mode

I need to setup spark cluster (1 Master and 2 slaves nodes) on centos7 along with resource manager as YARN. I am new to all this and still exploring. Can somebody share me detailed steps of setting up Spark with Yarn in cluster mode.
Afterwards i have to integrate Livy too(an open source REST interface for using Spark from anywhere).
Inputs are welcome.Thanks
YARN is part of Hadoop. So, a Hadoop installation is necessary to run Spark on YARN.
Check out the page on the Hadoop Cluster Setup.
Then you can utilize the this documentation to learn about Spark on YARN.
Another method to quickly learn about Hadoop, YARN and Spark is to utilize Cloudera Distribution of Hadoop (CDH). Read the CDH 5 Quick Start Guide.
We are currently using the similar setup in aws. AWS EMR is costly hence
we setup our own cluster using ec2 machines with the help of Hadoop Cookbook. The cookbook supports multiple distributions, however we choose HDP.
The setup included following.
Master Setup
Spark (Along with History server)
Yarn Resource Manager
HDFS Name Node
Livy server
Slave Setup
Yarn Node Manager
HDFS Data Node
More information on manually installing can be found in HDP Documentation
You can see the part of that automation in here.

installing spark 1.4 on google cloud cluster

I set up a google compute cluster with Click to Deploy
I want to use spark 1.4 but I get spark 1.1.0
Anyone know if it is possible to set up a cluster with spark 1.4?
I also had issues with this. These are the steps I took:
Download a copy of GCE's bdutil from github https://github.com/GoogleCloudPlatform/bdutil
Download the spark version you want, in this case spark-1.4.1 from the spark website and store it into a google compute bucket that you control. Make sure it's a spark that supports the hadoop you'll also be deploying with bdutil
Edit the spark env file https://github.com/GoogleCloudPlatform/bdutil/blob/master/extensions/spark/spark_env.sh
Change SPARK_HADOOP2_TARBALL_URI='gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz' to SPARK_HADOOP2_TARBALL_URI='gs://[YOUR SPARK PATH]' I'm assuming you want hadoop 2, if you want hadoop 1 make sure you change the right variable.
Once that's all done, from the modified bdutil, build your hadoop+spark cluster, you should have a modern version of spark after this
You'll have to make sure you execute the spark_env.sh with the -e command when executing bdutil, you'll need to also add the hadoop_2 env if you're installing hadoop2 which I was as well.
One option would be to try http://spark-packages.org/package/sigmoidanalytics/spark_gce , this deploys Spark 1.2.0 but you could edit the file to deploy 1.4.0.

Resources