Importing vs Installing spark - apache-spark

I am new to the spark world and to some extent coding.
This question might seem too basic but please clear my confusion.
I know that we have to import spark libraries to write spark application. I use intellij and sbt.
After writing the application , I can also run them and see the output on "run".
My question is, why should I install spark separately on my machine(local) if I can just import them as libraries and run them.
Also what is the need for it to be installed on the cluster since we can just submit the jar file and jvm is already present in all the machines of the clustor
Thank you for the help!

I understand your confusion.
Actually you don't really need to install spark on your machine if you are for example running it on scala/java and you can just import spark-core or any other dependancies into your project and once you start your spark job on mainClass it will create an standalone spark runner on your machine and run your job on if (local[*]).
There are many reasons for having spark on your local machine.
One of them is for running spark job on pyspark which requires spark/python/etc libraries and a runner(local[] or remote[]).
Another reason can be if you want to run your job on-premise.
It might be easier to create cluster on your local datacenter and maybe appoint your machine as master and the other machines connected to your master as worker.(this solution might be abit naive but you asked for basics so this might spark your curiosity to read more about infrastructure design of a data processing system more)

Related

Spark Standalone Cluster :Configuring Distributed File System

I have just moved from a Spark local setup to a Spark standalone cluster. Obviously, loading and saving files no longer works.
I understand that I need to use Hadoop for saving and loading files.
My Spark installation is spark-2.2.1-bin-hadoop2.7
Question 1:
Am I correct that I still need to separately download, install and configure Hadoop to work with my standalone Spark cluster?
Question 2:
What would be the difference between running with Hadoop and running with Yarn? ...and which is easier to install and configure (assuming fairly light data loads)?
A1. Right. The package you mentioned is just packed with hadoop client with specified version and still you need to install hadoop if you want to use hdfs.
A2. Running with yarn means you're using resource manager of spark as yarn. (http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications) So, when the case you don't need DFS, like when you're only running spark streaming applications, you still can install Hadoop but only run yarn processes to use its resource management functionality.

What setup is needed to use the Spark Cassandra Connector with Spark Job Server

I am working with Spark and Cassandra and in general things are straight forward and working as intended; in particular the spark-shell and running .scala processes to get results.
I'm now looking at utilisation of the Spark Job Server; I have the Job Server up and running and working as expected for both the test items, as well as some initial, simple .scala developed.
However I now want to take one of the .scala programs that works in spark-shell and get it onto the Spark Job Server to access via that mechanism. The issue I have is that the Job Server doesn't seem to recognise the import statements around cassandra and fails to build (sbt compile; sbt package) a jar for upload to the Job Server.
At some level it just looks like I need the Job Server equivalent to the spark shell package switch (--packages datastax:spark-cassandra-connector:2.0.1-s_2.11) on the Spark Job Server so that import com.datastax.spark.connector._ and similar code in the .scala files will work.
Currently when I attempt to build (sbt complie) I get message such as:
[error] /home/SparkCassandraTest.scala:10: object datastax is not a member of package com
[error] import com.datastax.spark.connector._
I have added different items to the build.sbt file based on searches and message board advice; but no real change; if that is the answer I'm after what should be added to the base Job Server to enable that usage of the cassandra connector.
I think that you need spark-submit to do this. I am working with Spark and Cassandra also, but only since one month; so I've needed read a lot of information. I had compiled this info in a repository, maybe this could help you, however is an alpha version, sorry about that.

Can we submit a spark job from spark itself (may be from another spark program)?

Can anyone clarify the question asked in one of my interviews to me? It could be that the question itself is wrong, I am not sure. However, I have searched everywhere and could not find anything related to this question. The question is:
Can we run a spark job from another spark program?
Yup, you are right its not make any sense. Like we can run our application by our driver program but its same as like we are run it from any application using spark launcher https://github.com/phalodi/Spark-launcher . Except that we can't run application inside rdd closures because they run on worker nodes so it will not work.
Can we run a spark job from another spark program?
I'd focus on another part of the question since the following holds for any Spark program:
Can we run a Spark job from any Spark program?
That means that either there was a follow-up question or some introduction to the one you were asked.
If I were you and heard the question, I'd say "Yes, indeed!"
A Spark application is in other words a launcher of Spark jobs and the only reason to have a Spark application is to run Spark jobs sequentially or in parallel.
Any Spark application does this and nothing more.
A Spark application is a Scala application (when Scala is the programming language) and as such it is possible to run a Spark program from another Spark program (where it makes sense in general sense I put aside as there could be conflicts with multiple SparkContexts per one single JVM).
Given the other Spark application is a separate executable jar, you could launch it using Process API in Scala (as any other application):
scala.sys.process This package handles the execution of external processes.

Submit spark application from laptop

I want to submit spark python applications from my laptop. I have a standalone spark cluster, and the master is running at some visible IP (MASTER_IP). After downloading and unzipping Spark on my laptop, I got this to work
./bin/spark-submit --master spark://MASTER_IP:7077 ~/PATHTO/pi.py
From what I understand, it is defaulting to client mode (vs cluster mode). According to Spark (http://spark.apache.org/docs/latest/submitting-applications.html) -
"only YARN supports cluster mode for Python applications." Since I'm not using YARN, I must use client mode.
My question is - do I need to download all of Spark on my laptop? Or just a few libraries?
I want to allow the rest of my team to use my Spark cluster, but I want them to do the least amount of work as possible. They don't need to setup a cluster. They only need to submit jobs to it. Having them downloading all of Spark seems like overkill.
So, what exactly is the minimum that they need?
The spark-1.5.0-bin-hadoop2.6 package I have here is 304MB unpacked. More than half, 175MB is made up of spark-assembly-1.5.0-hadoop2.6.0.jar, the main Spark stuff. You can't get rid of this unless you want to compile your own package maybe. A large part of the rest is spark-examples-1.5.0-hadoop2.6.0.jar, 113MB. Removing this and zipping back up is harmless and saves you a lot already.
However, using some tools such that they don't have to work with the spark package directly, like spark-jobserver (never used but never heard somebody very positive about the current state) or spark-kernel (needs your own code still to interface with it, or when used with notebook (see below) limited compared to alternatives) as suggested by Reactormonk makes it even easier for them.
A popular thing to do in that sense is set up access to a notebook. As you're using Python, IPython with a PySpark profile would be most straightforward to set up. Other alternatives are Zeppelin and spark-notebook (my favourite) for using Scala.

Is spark or spark with mesos the easiest to start with?

If I want a simple setup that would give me a quick start: would a combination of apache-spark and mesos would be the easiest? or maybe apache-spark alone would be better because....i.e. mesos would add complexity to the process given what it does, or maybe mesos does way so many things that would be hard to deal with spark alone, etc...
All I want is to be able to submit jobs and manage the cluster and jobs easily, nothing fancy for now, is spark or spark/mesos better or something else...
The easiest way to start using Spark is starting stand alone spark cluster on EC2.
It is as easy as running single script - spark-ec2 and it will do the rest for you.
The only case when stand alone cluster may not suit you - if you want to run more then single spark job at a time (at least it was the case with Spark 1.1).
For me personally the stand alone Spark cluster was good enough for a long time when I was running ad-hoc jobs - analyzing company's logs on S3 and learning Spark, and then destroy the cluster.
If you want to run more than one Spark at a time - I would go with Mesos.
Alternative would be to install CDH from Cloudera which is relatively easy (they provide install scripts and install instructions) and it is available for free.
CDH would provide you powerful tools to manage the cluster.
Using CDH for running Spark - they use YARN, and we have one or another issue from time to time with running Spark on YARN.
The main disadvantage to me - CDHs provider its own build of Spark - so it usually one minor version behind, which is a lot for such rapid progressing project as Spark.
So I would try Mesos for running Spark if I need to run more then one job at a time.
Just for completeness, Hortonworks provides downloadable HDP sandbox VM as well as supports Spark on HDP. It is also a good starting point.
Additionally, you can spin off your own cluster. I do thisonmy laptop, not for real big data usecases but for learning with moderate amount of data.
import subprocess as s
from time import sleep
cmd = "D:\\spark\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\bin\\spark-class.cmd"
master = "org.apache.spark.deploy.master.Master"
worker = "org.apache.spark.deploy.worker.Worker"
masterUrl="spark://BigData:7077"
cmds={"masters":1,"workers":3}
masterProcess=[cmd,master]
workerProcess=[cmd,worker,masterUrl]
noWorker = 3
pMaster = s.Popen(masterProcess)
sleep(3)
pWorkers = []
for i in range(noWorker):
pw = s.Popen(workerProcess)
pWorkers.append(pw)
The code above starts master and 3 workers, which I can monitor using the UI. This is just to get going and if you need aquick local set up.

Resources