I am using a notebook environment to try out some commands against Spark. Can someone explain how the overall flow works when we run a cell from the notebook? In a notebook environment which component acts as the driver?
Also, can we call the code snippets we run from a notebook as a "Spark Application", or we call a code snippet "Spark Application" only when we use spark-submit to submit it to spark? Basically, I am trying to find out what qualifies a "Spark Application".
Notebook environment like Zeppelin creates a SparkContext during first execution of cell. Once a SparkContext is created all further cell executions are submitted to the same SparkContext which was created earlier.
A driver program is started based on whether you're using spark cluster in standalone mode or cluster mode where a resource manager is managing you're spark cluster. So In case of standalone mode driver program is started on host where the notebook is running. And in case of cluster mode it will be started on one of the node in the cluster.
You can consider a each running SparkContext on cluster as a different Application. Notebooks like Zeppelin provide capability to share same SparkContext for all cell across all notebooks or you can even configure it to be created per notebook.
Most of the notebooks internally calls spark-submit only.
Related
I need help because I don't know if the Jupyter notebook Kernel are usable in a Spark cluster.
In my local Spark I use this and I don't have problems.
I am using this Kernel for PySpark : https://github.com/Anchormen/pyspark-jupyter-kernels
I am using a Standalone Spark cluster with three nodes without Yarn.
Best regard.
You can connect to your spark cluster standalone using the master IP with the python kernel.
import pyspark
sc = pyspark.SparkContext(master='spark://<public-ip>:7077', appName='<your_app_name>')
References
How to connect Jupyter Notebook to remote spark clusters
Set up an Apache Spark cluster and integrate with Jupyter Notebook
Deploy Application from Jupyter Lab to a Spark Standalone Cluster
I am new to Spark. i have developed a pyspark script though the jupyter notebook interactive UI installed in our HDInsight cluster. A of now I ran the code from the jupyter itself but now I have to automate the script. I tried to use Azure Datafactory but could not find a way to run the pyspark script from there. Also tried to use oozie but could not figure out how to use it.I have tried by saving the notebook and reopened it and ran all cells but it is like manual way.
Please Help me to schedule a pyspark job in microsoft Azure.
I searched a discussion about the best practice to run scheduled jobs like crontab with Apache Spark for pyspark, which you might reviewed.
If without oozie, I have a simple idea that is to save jupyter notebook to local and write a shell script to submit the python script to HDInsight Spark via Livy with linux crontab as scheduler. As reference, you can refer to there as below.
IPython Notebook save location
How can I configure pyspark on livy to use anaconda python instead of the default one
Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy
Hope it helps.
Spark has native support by EMR. When using the EMR web interface to create a new cluster, it is possible to add a custom step that would execute a Spark application when the cluster starts, basically an automated spark-submit after cluster startup.
I've been wondering how to specify the master node to the SparkConf within the application, when starting the EMR cluster and submitting the jar file through the designated EMR step?
It is not possible to know the IP of the cluster master beforehand, as would be the case if I started the cluster manually and then used the information to build into my application before calling spark-submit.
Code snippet:
SparkConf conf = new SparkConf().setAppName("myApp").setMaster("spark:\\???:7077");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
Note that I am asking about the "cluster" execution mode, so the driver program runs on the cluster as well.
Short answer: don't.
Longer answer: A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf, so when you run spark-submit, you don't even have to include "--master ...".
However, since you are asking about cluster execution mode (actually, it's called "deploy mode"), you may specify either "--master yarn-cluster" (deprecated) or "--deploy-mode cluster" (preferred). This will make the Spark driver run on a random cluster mode rather than on the EMR master.
The documentation on spark-submit says the following:
The spark-submit script in Spark’s bin directory is used to launch
applications on a cluster.
Regarding the pyspark it says the following:
You can also use bin/pyspark to launch an interactive Python shell.
This question may sound stupid, but when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
There is no practical difference between these two. If not configured otherwise both will execute code in a local mode. If master is configured (either by --master command line parameter or spark.master configuration) corresponding cluster will be used to execute the program.
If you are using EMR , there are three things
using pyspark(or spark-shell)
using spark-submit without using --master and --deploy-mode
using spark-submit and using --master and --deploy-mode
although using all the above three will run the application in spark cluster, there is a difference how the driver program works.
in 1st and 2nd the driver will be in client mode whereas in 3rd the
driver will also be in the cluster.
in 1st and 2nd, you will have to wait untill one application complete
to run another, but in 3rd you can run multiple applications in
parallel.
Just adding a clarification that others have not addressed (you may already know this, but it was unclear from the wording of your question):
..when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
As with spark-submit, standard Python code will run only on the driver. When you call operations through the various pyspark APIs, you will trigger transformations or actions that will be registered/executed on the cluster.
As others have pointed out, spark-submit can also launch jobs in cluster mode. In this case, driver still executes standard Python code, but the driver is a different machine to the one that you call spark-submit from.
Pyspark compare to Scala spark and Java Spark have extreme differences, for Python spark in only support YARN for scheduling the cluster.
If you are running python spark on a local machine, then you can use pyspark. If in the cluster, use the spark-submit.
If you have any dependencies in your python spark job, you need a zip file for submission.
I just added the example project to my Zeppelin Notebook from http://zeppelin-project.org/docs/tutorial/tutorial.html (section "Tutorial with Streaming Data"). The problem I now have is that the application seems only to work local. If I change the Spark interpreter setting "master" from "local[*]" to "spark://master:7077" the application won't bring any result anymore when I'm doing the same SQL statement. Am I doing anything wrong? I already restarted the Zeppelin interpreter, also the whole Zeppelin daemon and the Spark cluster, nothing solved the issue! Can someone help.
I use the following installation:
Spark 1.5.1 (prebuild for Hadoop 2.6+), Master + 2x Slaves
Zeppelin 0.5.5 (installed on Spark's master node)
EDIT
Also the following installation won't work for me:
Spark 1.5.0 (prebuild for Hadoop 2.6+), Master + 2x Slaves
Zeppelin 0.5.5 (installed on Spark's master node)
Screenshot: local setting (works!)
Screenshot: cluster setting (won't work!)
The job seems to run correctly in cluster mode:
I got it after 2 days of trying around!
The difference between the local Zeppelin Spark interpreter and the Spark Cluster seems to be, that the local one has included the Twitter Utils which are needed for executing the Twitter Streaming example, and the Spark Cluster doesn't have this library by default.
Therefore you have to add the dependency manually in the Zeppelin Notebook before starting the application with Spark cluster as master. So the first paragraph of the Notebook must be:
%dep
z.reset
z.load("org.apache.spark:spark-streaming-twitter_2.10:1.5.1")
If an error occures on running this paragraph, just try to restart the Zeppelin server via ./bin/zeppelin-daemon.sh stop (& start)!