Spark shell for Databricks - apache-spark

Notebooks are nice, but REPL is sometimes more useful. Am I somehow able to run spark-shell that executes on Databricks? Like:
spark-shell --master https://adb-5022.2.azuredatabricks.net/
I looked through available tools related to Databricks (databricks connect, dbx, ...), but it seems there's no such functionality.

Databricks connect is the tool that you need if you want to execute code from you local machine in the Databricks cluster. Same as the spark-shell, the driver will be on your local machine, and executors are remove. The databricks-connect package installs the modified distribution of the Apache Spark so you can use spark-shell, pyspark, spark-submit, etc. - just make sure that that directory is in the PATH.
P.S. but I really don't understand why notebooks doesn't work for you - spark-shell doesn't have any superior features compared to them.

Related

do we need to install spark on yarn to read data from HDFS into Py Spark?

I am having a Hadoop 3.1.1 multi-node cluster, i want to make use of PySpark to read files from my HDFS into PySpark for ETL operations and then load it to target MySQL databases.
Given below is the ask.
can I install spark in standalone mode?
do I need to install spark on my yarn first?
if no, how can I install spark separately?
You can use any mode for communicating with HDFS and MySQL, including Kubernetes. Or, you just use --master="local[*]" and you don't need a scheduler at all. This is useful, for example, from a Jupyter Notebook.
YARN would be recommended as you already have HDFS, and therefore the scripts to start YARN processes as well.
You don't really "install Spark on YARN". Applications from clients get submitted to the YARN cluster. spark.yarn.archives HDFS path will get unpacked into the classes necessary to run the job.
Refer https://spark.apache.org/docs/latest/running-on-yarn.html

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

How to get access to HDFS files in Spark standalone cluster mode?

I am trying to get access to HDFS files in Spark. Everything works fine when I run Spark in local mode, i.e.
SparkSession.master("local")
and get access to HDFS files by
hdfs://localhost:9000/$FILE_PATH
But when I am trying to run Spark in standalone cluster mode, i.e.
SparkSession.master("spark://$SPARK_MASTER_HOST:7077")
Error throws
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
So far I have only
start-dfs.sh
in Hadoop and does not really config anything in Spark. Do I need to run Spark using YARN cluster manager instead so that Spark and Hadoop are using the same cluster manager, hence can get access to HDFS files?
I have tried to config yarn-site.xml in Hadoop following tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm, and specified HADOOP_CONF_DIR in spark-env.sh, but it does not seem to work and the same error throws. Am I missing some other configurations?
Thanks!
EDIT
The initial Hadoop version is 2.8.0 and the Spark version is 2.1.1 with Hadoop 2.7. Tried to download hadoop-2.7.4 but the same error still exists.
The question here suggests this as a java syntax issue rather than spark hdfs issue. I will try this approach and see if this solves the error here.
Inspired by the post here, solved the problem by myself.
This map-reduce job depends on a Serializable class, so when running in Spark local mode, this serializable class can be found and the map-reduce job can be executed dependently.
When running in Spark standalone cluster mode, the best is to submit the application through spark-submit, rather than running in an IDE. Packaged everything in jar and spark-submit the jar, works as a charm!

Configure external jars with HDI Jupyter Spark (Scala) notebook

I have an external custom jar that I would like to use with Azure HDInsight Jupyter notebooks; the Jupyter notebooks in HDI use Spark Magic and Livy.
Within the first cell of the notebook, I'm trying to use the jars configuration:
%%configure -f
{"jars": ["wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar"]}
But the error message I receive is:
Starting Spark application
The code failed because of a fatal error:
Status 'shutting_down' not supported by session..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context. For instructions on how to assign resources see http://go.microsoft.com/fwlink/?LinkId=717038
b) Contact your cluster administrator to make sure the Spark magics library is configured correctly.
Current session configs: {u'jars': [u'wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar'], u'kind': 'spark'}
An error was encountered:
Status 'shutting_down' not supported by session.
I'm wondering if I'm just not understanding how Livy works in this case as I was able to successfully include a spark-package (GraphFrames) on the same cluster:
%%configure -f
{ "conf": {"spark.jars.packages": "graphframes:graphframes:0.3.0-spark2.0-s_2.11" }}
Some additional references that may be handy (just in case I missed something):
Jupyter notebooks kernels with Apache Spark clusters in HDInsight
Livy Documentation
Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy
Oh, I was able to figure it out and forgot to update my question. This can work if you put the jar in the default storage account of your HDI cluster.
HTH!
in case people come here for adding jars on EMR.
%%configure -f
{"name": "sparkTest", "conf": {"spark.jars": "s3://somebucket/artifacts/jars/spark-avro_2.11-2.4.4.jar"}}
contrary to the document, use jars directly won't work.

Spark pyspark vs spark-submit

The documentation on spark-submit says the following:
The spark-submit script in Spark’s bin directory is used to launch
applications on a cluster.
Regarding the pyspark it says the following:
You can also use bin/pyspark to launch an interactive Python shell.
This question may sound stupid, but when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
There is no practical difference between these two. If not configured otherwise both will execute code in a local mode. If master is configured (either by --master command line parameter or spark.master configuration) corresponding cluster will be used to execute the program.
If you are using EMR , there are three things
using pyspark(or spark-shell)
using spark-submit without using --master and --deploy-mode
using spark-submit and using --master and --deploy-mode
although using all the above three will run the application in spark cluster, there is a difference how the driver program works.
in 1st and 2nd the driver will be in client mode whereas in 3rd the
driver will also be in the cluster.
in 1st and 2nd, you will have to wait untill one application complete
to run another, but in 3rd you can run multiple applications in
parallel.
Just adding a clarification that others have not addressed (you may already know this, but it was unclear from the wording of your question):
..when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
As with spark-submit, standard Python code will run only on the driver. When you call operations through the various pyspark APIs, you will trigger transformations or actions that will be registered/executed on the cluster.
As others have pointed out, spark-submit can also launch jobs in cluster mode. In this case, driver still executes standard Python code, but the driver is a different machine to the one that you call spark-submit from.
Pyspark compare to Scala spark and Java Spark have extreme differences, for Python spark in only support YARN for scheduling the cluster.
If you are running python spark on a local machine, then you can use pyspark. If in the cluster, use the spark-submit.
If you have any dependencies in your python spark job, you need a zip file for submission.

Resources