Configure external jars with HDI Jupyter Spark (Scala) notebook - apache-spark

I have an external custom jar that I would like to use with Azure HDInsight Jupyter notebooks; the Jupyter notebooks in HDI use Spark Magic and Livy.
Within the first cell of the notebook, I'm trying to use the jars configuration:
%%configure -f
{"jars": ["wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar"]}
But the error message I receive is:
Starting Spark application
The code failed because of a fatal error:
Status 'shutting_down' not supported by session..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context. For instructions on how to assign resources see http://go.microsoft.com/fwlink/?LinkId=717038
b) Contact your cluster administrator to make sure the Spark magics library is configured correctly.
Current session configs: {u'jars': [u'wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar'], u'kind': 'spark'}
An error was encountered:
Status 'shutting_down' not supported by session.
I'm wondering if I'm just not understanding how Livy works in this case as I was able to successfully include a spark-package (GraphFrames) on the same cluster:
%%configure -f
{ "conf": {"spark.jars.packages": "graphframes:graphframes:0.3.0-spark2.0-s_2.11" }}
Some additional references that may be handy (just in case I missed something):
Jupyter notebooks kernels with Apache Spark clusters in HDInsight
Livy Documentation
Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy

Oh, I was able to figure it out and forgot to update my question. This can work if you put the jar in the default storage account of your HDI cluster.
HTH!

in case people come here for adding jars on EMR.
%%configure -f
{"name": "sparkTest", "conf": {"spark.jars": "s3://somebucket/artifacts/jars/spark-avro_2.11-2.4.4.jar"}}
contrary to the document, use jars directly won't work.

Related

Spark shell for Databricks

Notebooks are nice, but REPL is sometimes more useful. Am I somehow able to run spark-shell that executes on Databricks? Like:
spark-shell --master https://adb-5022.2.azuredatabricks.net/
I looked through available tools related to Databricks (databricks connect, dbx, ...), but it seems there's no such functionality.
Databricks connect is the tool that you need if you want to execute code from you local machine in the Databricks cluster. Same as the spark-shell, the driver will be on your local machine, and executors are remove. The databricks-connect package installs the modified distribution of the Apache Spark so you can use spark-shell, pyspark, spark-submit, etc. - just make sure that that directory is in the PATH.
P.S. but I really don't understand why notebooks doesn't work for you - spark-shell doesn't have any superior features compared to them.

How to get access to HDFS files in Spark standalone cluster mode?

I am trying to get access to HDFS files in Spark. Everything works fine when I run Spark in local mode, i.e.
SparkSession.master("local")
and get access to HDFS files by
hdfs://localhost:9000/$FILE_PATH
But when I am trying to run Spark in standalone cluster mode, i.e.
SparkSession.master("spark://$SPARK_MASTER_HOST:7077")
Error throws
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
So far I have only
start-dfs.sh
in Hadoop and does not really config anything in Spark. Do I need to run Spark using YARN cluster manager instead so that Spark and Hadoop are using the same cluster manager, hence can get access to HDFS files?
I have tried to config yarn-site.xml in Hadoop following tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm, and specified HADOOP_CONF_DIR in spark-env.sh, but it does not seem to work and the same error throws. Am I missing some other configurations?
Thanks!
EDIT
The initial Hadoop version is 2.8.0 and the Spark version is 2.1.1 with Hadoop 2.7. Tried to download hadoop-2.7.4 but the same error still exists.
The question here suggests this as a java syntax issue rather than spark hdfs issue. I will try this approach and see if this solves the error here.
Inspired by the post here, solved the problem by myself.
This map-reduce job depends on a Serializable class, so when running in Spark local mode, this serializable class can be found and the map-reduce job can be executed dependently.
When running in Spark standalone cluster mode, the best is to submit the application through spark-submit, rather than running in an IDE. Packaged everything in jar and spark-submit the jar, works as a charm!

Getting "AssertionError("Unknown application type")" when Connecting to DSE 5.1.0 Spark

I am connecting to DSE (Spark) using this:
new SparkConf()
.setAppName(name)
.setMaster("spark://localhost:7077")
With DSE 5.0.8 works fine (Spark 1.6.3) but now fails with DSE 5.1.0 getting this error:
java.lang.AssertionError: Unknown application type
at org.apache.spark.deploy.master.DseSparkMaster.registerApplication(DseSparkMaster.scala:88) ~[dse-spark-5.1.0.jar:2.0.2.6]
After checking the use-spark jar, I've come up with this:
if(rpcendpointref instanceof DseAppProxy)
And within spark, seems to be RpcEndpointRef (NettyRpcEndpointRef).
How can I fix this problem?
I had a similar issue, and fixed it by following this:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkRemoteCommands.html
Then you need to run your job using dse spark-submit, without specifying any master.
Resource Manager Changes
The DSE Spark Resource manager is different than the OSS Spark Standalone Resource Manager. The DSE method uses a different uri "dse://" because under the hood it actually is performing a CQL based request. This has a number of benefits over the Spark RPC but as noted does not match some of the submission
mechanisms possible in OSS Spark.
There are several articles on this on the Datastax Blog as well as documentation notes
Network Security with DSE 5.1 Spark Resource Manager
Process Security with DSE 5.1 Spark Resource Manager
Instructions on the URL Change
Programmatic Spark Jobs
While it is still possible to launch an application using "setJars" you must also add the DSE specific jars and config options to talk with the resource manager. In DSE 5.1.3+ there is a class provided
DseConfiguration
Which can be applied to your Spark Conf DseConfiguration.enableDseSupport(conf) (or invoked via implicit) which will set these options for you.
Example
Docs
This is of course for advanced users only and we strongly recommend using dse spark-submit if at all possible.
I found a solution.
First of all, I think is impossible to run a Spark job within an Application within DSE 5.1. Has to be sent with dse spark-submit
Once sent, it works perfectly. In order to do the communications to the job I used Apache Kafka.
If you don't want to use a job, you can always go back to a Apache Spark.

How to configure Hive to use Spark execution engine on Google Dataproc?

I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.
Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.
Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.
This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.
If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.
If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.

Unable to add a new service with Cloudera Manager within Cloudera Quickstart VM 5.3.0

I'm using Cloudera Quickstart VM 5.3.0 (running in Virtual Box 4.3 on Windows 7) and I wanted to learn Spark (on YARN).
I started Cloudera Manager. In the sidebar I can see all the services, there is Spark but in standalone mode. So I click on "Add a new service", select "Spark". Then I have to select the set of dependencies for this service, I have no choices I must pick HDFS/YARN/zookeeper.
Next step I have to choose a History Server and a Gateway, I run the VM in local mode so I can only choose localhost.
I click on "Continue" and this error occures (+ 69 traces) :
A server error as occurred. Send the following information to
Cloudera.
Path : http://localhost:7180/cmf/clusters/1/add-service/reviewConfig
Version: Cloudera Express 5.3.0 (#155 built by jenkins on
20141216-1458 git: e9aae1d1d1ce2982d812b22bd1c29ff7af355226)
org.springframework.web.bind.MissingServletRequestParameterException:Required
long parameter 'serviceId' is not present at
AnnotationMethodHandlerAdapter.java line 738 in
org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter$ServletHandlerMethodInvoker
raiseMissingParameterException()
I don't know if an internet connection is needed but I precise that I can't connect to the internet with the VM. (EDIT : Even with an internet connection I get the same error)
I have no ideas how to add this service, I tried with or without gateway, many network options but it never worked. I checked the known issues; nothing...
Someone knows how I can solve this error or how I can work around ? Thanks for any help.
Julien,
Before I answer your question I'd like to make some general notes about Spark in Cloudera Distribution of Hadoop 5 (CDH5):
Spark runs in three different formats: (1) local, (2) Spark's own stand-alone manager, and (3) other cluster resource managers like Hadoop YARN, Apache Mesos, and Amazon EC2.
Spark works out-of-the-box with CHD 5 for (1) and (2). You can initiate a local
interactive spark session in Scala using the spark-shell command
or pyspark for Python without passing any arguments. I find the interactive Scala and Python
interpreters help learning to program with Resilient Distributed
Datasets (RDDs).
I was able to recreate your error on my CDH 5.3.x distribution. I didn't mean to take credit for the bug you discovered, but I posted to the Cloudera developer community for feedback.
In order to use Spark in the QuickStart pseudo-distributed environment, see if all of the Spark daemons are running using the following command (you can do this inside the Cloudera Manager (CM) UI):
[cloudera#quickstart simplesparkapp]$ sudo service --status-all | grep -i spark
Spark history-server is not running [FAILED]
Spark master is not running [FAILED]
Spark worker is not running [FAILED]
I've manually stopped all of the stand-alone Spark services so we can try to submit the Spark job within Yarn.
In order to run Spark inside a Yarn container on the quick start cluster, we have to do the following:
Set the HADOOP_CONF_DIR to the root of the directory containing the yarn-site.xml configuration file. This is typically /etc/hadoop/conf in CHD5. You can set this variable using the command export HADOOP_CONF_DIR="/etc/hadoop/conf".
Submit the job using spark-submit and specify you are using Hadoop YARN.
spark-submit --class CLASS_PATH --master yarn JAR_DIR ARGS
Check the job status in Hue and compare to the Spark History server. Hue should show the job placed in a generic Yarn container and Spark History should not have a record of the submitted job.
References used:
Learning Spark, Chapter 7
Sandy Ryza's Blog Post on Spark and CDH5
Spark Documentation for Running on Yarn

Resources