What setup is needed to use the Spark Cassandra Connector with Spark Job Server - apache-spark

I am working with Spark and Cassandra and in general things are straight forward and working as intended; in particular the spark-shell and running .scala processes to get results.
I'm now looking at utilisation of the Spark Job Server; I have the Job Server up and running and working as expected for both the test items, as well as some initial, simple .scala developed.
However I now want to take one of the .scala programs that works in spark-shell and get it onto the Spark Job Server to access via that mechanism. The issue I have is that the Job Server doesn't seem to recognise the import statements around cassandra and fails to build (sbt compile; sbt package) a jar for upload to the Job Server.
At some level it just looks like I need the Job Server equivalent to the spark shell package switch (--packages datastax:spark-cassandra-connector:2.0.1-s_2.11) on the Spark Job Server so that import com.datastax.spark.connector._ and similar code in the .scala files will work.
Currently when I attempt to build (sbt complie) I get message such as:
[error] /home/SparkCassandraTest.scala:10: object datastax is not a member of package com
[error] import com.datastax.spark.connector._
I have added different items to the build.sbt file based on searches and message board advice; but no real change; if that is the answer I'm after what should be added to the base Job Server to enable that usage of the cassandra connector.

I think that you need spark-submit to do this. I am working with Spark and Cassandra also, but only since one month; so I've needed read a lot of information. I had compiled this info in a repository, maybe this could help you, however is an alpha version, sorry about that.

Related

How to run spark sql queries against Hadoop without running a spark job

I develop spark sql to run against hadoop. Today I must run a spark job that invokes my query. Is there another way to do this? I find I spend too much time fixing side issues with running the jobs in spark. Ideally I want to be able to compose and execute Spark SQL queries directly against hadoop/hbase and bypass the spark job altogether. This would permit much more rapid iteration when debugging or trying alternate queries.
Note that my queries are often 100 lines long or more so working from the command line is challenging.
I have to do this from a WIndows workstation
The best thing you can use for HBase is to use Apache Phoenix. It provides an SQL interface.
As an example, on my last project I used NIFI with Phoenix to read and mutate HBase data. Worked great from command line. I did discover a bug in my usage of it.
See https://phoenix.apache.org/Phoenix-in-15-minutes-or-less.html. You van use an SQL file. In addition you can use Hue.
Never tried the following for Windows, but it is possible. See https://community.cloudera.com/t5/Community-Articles/How-to-connect-a-Windows-JDBC-client-to-Cluster-enabled-with/ta-p/247787

Importing vs Installing spark

I am new to the spark world and to some extent coding.
This question might seem too basic but please clear my confusion.
I know that we have to import spark libraries to write spark application. I use intellij and sbt.
After writing the application , I can also run them and see the output on "run".
My question is, why should I install spark separately on my machine(local) if I can just import them as libraries and run them.
Also what is the need for it to be installed on the cluster since we can just submit the jar file and jvm is already present in all the machines of the clustor
Thank you for the help!
I understand your confusion.
Actually you don't really need to install spark on your machine if you are for example running it on scala/java and you can just import spark-core or any other dependancies into your project and once you start your spark job on mainClass it will create an standalone spark runner on your machine and run your job on if (local[*]).
There are many reasons for having spark on your local machine.
One of them is for running spark job on pyspark which requires spark/python/etc libraries and a runner(local[] or remote[]).
Another reason can be if you want to run your job on-premise.
It might be easier to create cluster on your local datacenter and maybe appoint your machine as master and the other machines connected to your master as worker.(this solution might be abit naive but you asked for basics so this might spark your curiosity to read more about infrastructure design of a data processing system more)

Standalone Spark application in IntelliJ

I am trying to run a spark application (written in Scala) on a local server for debug. It seems that YARN is the default in the version of spark (2.2.1) that I have in the sbt build definitions, and according to an error I'm consistently getting, there is no spark/YARN server listening:
Client:920 - Failed to connect to server: 0.0.0.0/0.0.0.0:8032: retries get failed due to exceeded maximum allowed retries number
According to netstat indeed there is really no port 8032 on my local server, in listening state.
How would I typically accomplish running my spark application locally, in a way bypassing this problem? I only need the application to process a small amount of data for debug, and hence would like to be able to run locally, without reliance on specific SPARK/YARN installations and setups on the local server ― that would be an ideal debug setup.
Is that possible?
My sbt definitions already bring in all the necessary spark and spark.yarn jars. The problem also reproduces when running the same project in sbt, outside of IntelliJ.
You can add this property to VM options in your debug configurations instead of hardcoding inside the code
-Dspark.master=local[2]
You could submit spark application in local mode with .master("local[*]") if you have to test pipeline with miniscule data.
Full code:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")
.getOrCreate()
For spark-submit use --master local[*] as one of the arguments. Refer this: https://spark.apache.org/docs/latest/submitting-applications.html
Note: Do not hard code master in your codebase, always try to supply these variables from commandline. This makes application reusable for local/test/mesos/kubernetes/yarn/whatever.

why Livy or spark-jobserver instead of a simple web framework?

I'm building a RESTful API on top of Apache Spark. Serving the following Python script with spark-submit seems to work fine:
import cherrypy
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('myApp').getOrCreate()
sc = spark.sparkContext
class doStuff(object):
#cherrypy.expose
def compute(self, user_input):
# do something spark-y with the user input
return user_output
cherrypy.quickstart(doStuff())
But googling around I see things like Livy and spark-jobserver. I read these projects' documentation and a couple of tutorials but I still don't fully understand the advantages of Livy or spark-jobserver over a simple script with CherryPy or Flask or any other web framework. Is it about scalability? Context management? What am I missing here? If what I want is a simple RESTful API with not many users, are Livy or spark-jobserver worth the trouble? If so, why?
If you use spark-submit, you must upload manually JAR file to cluster and run command. Everything must be prepared before run
If you use Livy or spark-jobserver, then you can programatically upload file and run job. You can add additional applications that will connect to same cluster and upload jar with next job
What's more, Livy and Spark-JobServer allows you to use Spark in interactive mode, which is hard to do with spark-submit ;)
I won't comment on using Livy or spark-jobserver specifically but are at least three reasons to avoid embedding Spark context directly in your application:
Security with the main focus on reducing exposure of your cluster to the outside world. Attacker which gains control over your application can do anything between getting access to your data to executing arbitrary code on your cluster if cluster is not correctly configured.
Stability. Spark is a complex framework and there many factors which can affect its long term performance and stability. Decoupling Spark context and application allows you to handle Spark issues gracefully, without full downtime of your application.
Responsiveness. User facing Spark API is mostly (in PySpark exclusively) synchronous. Using external service basically solves this problem for you.

Running spark streaming forever on production

I am developing a spark streaming application which basically reads data off kafka and saves it periodically to HDFS.
I am running pyspark on YARN.
My question is more for production purpose. Right now, I run my application like this:
spark-submit stream.py
Imagine you are going to deliver this spark streaming application (in python) to a client, what would you do in order to keep it running forever? You wouldn't just give this file and say "Run this on the terminal". It's too unprofessional.
What I want to do , is to submit the job to the cluster (or processors in local) and never have to see logs on the console, or use a solution like linux screen to run it in the background (because it seems too unprofessional).
What is the most professional and efficient way to permanently submit a spark-streaming job to the cluster ?
I hope I was unambiguous. Thanks!
You could use spark-jobserver which provides rest interface for uploading your jar and running it . You can find the documentation here spark-jobserver .

Resources