How to add external dependencies on HDInsight Jupyter? - azure

I am using HDInsight Spark clusters on Azure, and Jupyter fails to add external dependencies. Tried this:
However, if I make an intentional mistake:
%%configure
{ "packages":["com.websudos:phantom_2.10:1.27.111111111111"] }
So this is trying to resolve packages, just not loading them?
Is there any other way to make this thing work?

The package you are using is not the right one. The intentional mistake is actually telling you that it cannot resolve that package.
It seems the package you might actually want to use is com.websudos:phantom-spark since that's what they built Spark support on? Link
%%configure -f
{ "packages":["com.websudos:phantom-spark_2.10:1.8.0"] }
and then you can import
import com.websudos.phantom.spark._
However, if what you want is a Spark-Cassandra connector, the datastax connector seems to be the one to use.
I should say I've never used Spark with Cassandra before, so please do follow tutorials online on how to set them up.

This article from HDInsight site might help you:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-jupyter-notebook-use-external-packages/

Related

How can we connect to remote spark cluster via jupyterhub?

First of all my agenda is to be able to use spark codes inside jupyterhub. In other words I want to connect a remote spark cluster to jupyterhub. After searching about it I came up with two solutions:1)Livy and 2)spark magic. I have tried Livy but since it doesn't support spark version 3 I have put that one aside, also another way we considered was via spark magic but we couldn't install or even find a good documentation about it.
what came into our minds was somehow merge the two image of spark and jupyterhub together to gain what we need.
Does anyone have any idea how it can be done or a better suggestion?
Even a good documentation about spark magic that we can use would be wonderful.
thank you for your help.

SparkSessionExtensions injectFunction in Databricks environment

SparkSessionExtensions injectFunction works locally, but I can't get it working in the Databricks environment.
The itachi project defines Catalyst expressions, like age that I can successfully use locally via spark-sql:
bin/spark-sql --packages com.github.yaooqinn:itachi_2.12:0.1.0 --conf spark.sql.extensions=org.apache.spark.sql.extra.PostgreSQLExtensions
spark-sql> select age(timestamp '2000', timestamp'1990');
10 years
I'm having trouble getting this working in the Databricks environment.
I started up a Databricks community cluster with the spark.sql.extensions=org.apache.spark.sql.extra.PostgreSQLExtensions configuration option set.
Then I attached the library.
The array_append function that's defined in itachi isn't accessible like I expected it to be:
Confirm configuration option is properly set:
spark-alchemy has another approach that works in the Databricks environment. Do we need to mess around with Spark internals to get this working in the Databricks environment? Or is there a way to get injectFunction working in Databricks?
The spark.sql.extensions works just fine on full Databricks (until it's going too deep into the internals of the Spark - sometimes there are incompatibilities), but not on Community Edition. The problem is that spark.sql.extensions are called during session initialization, and library specified in UI is installed afterwards, so this happens after/in parallel with initialization. On full Databricks that's workarounded by using init script to install library before cluster starts, but this functionality is not available on Community Edition.
The workaround would be to register functions explicitly, like this:
%scala
import org.apache.spark.sql.catalyst.expressions.postgresql.{Age, ArrayAppend, ArrayLength, IntervalJustifyLike, Scale, SplitPart, StringToArray, UnNest}
import org.apache.spark.sql.extra.FunctionAliases
spark.sessionState.functionRegistry.registerFunction(Age.fd._1, Age.fd._2, Age.fd._3)
spark.sessionState.functionRegistry.registerFunction(FunctionAliases.array_cat._1, FunctionAliases.array_cat._2, FunctionAliases.array_cat._3)
spark.sessionState.functionRegistry.registerFunction(ArrayAppend.fd._1, ArrayAppend.fd._2, ArrayAppend.fd._3)
spark.sessionState.functionRegistry.registerFunction(ArrayLength.fd._1, ArrayLength.fd._2, ArrayLength.fd._3)
spark.sessionState.functionRegistry.registerFunction(IntervalJustifyLike.justifyDays._1, IntervalJustifyLike.justifyDays._2, IntervalJustifyLike.justifyDays._3)
spark.sessionState.functionRegistry.registerFunction(IntervalJustifyLike.justifyHours._1, IntervalJustifyLike.justifyHours._2, IntervalJustifyLike.justifyHours._3)
spark.sessionState.functionRegistry.registerFunction(IntervalJustifyLike.justifyInterval._1, IntervalJustifyLike.justifyInterval._2, IntervalJustifyLike.justifyInterval._3)
spark.sessionState.functionRegistry.registerFunction(Scale.fd._1, Scale.fd._2, Scale.fd._3)
spark.sessionState.functionRegistry.registerFunction(SplitPart.fd._1, SplitPart.fd._2, SplitPart.fd._3)
spark.sessionState.functionRegistry.registerFunction(StringToArray.fd._1, StringToArray.fd._2, StringToArray.fd._3)
spark.sessionState.functionRegistry.registerFunction(UnNest.fd._1, UnNest.fd._2, UnNest.fd._3)
After that it works:
It's not so handy as extensions, but that's a limitation of CE.

Importing vs Installing spark

I am new to the spark world and to some extent coding.
This question might seem too basic but please clear my confusion.
I know that we have to import spark libraries to write spark application. I use intellij and sbt.
After writing the application , I can also run them and see the output on "run".
My question is, why should I install spark separately on my machine(local) if I can just import them as libraries and run them.
Also what is the need for it to be installed on the cluster since we can just submit the jar file and jvm is already present in all the machines of the clustor
Thank you for the help!
I understand your confusion.
Actually you don't really need to install spark on your machine if you are for example running it on scala/java and you can just import spark-core or any other dependancies into your project and once you start your spark job on mainClass it will create an standalone spark runner on your machine and run your job on if (local[*]).
There are many reasons for having spark on your local machine.
One of them is for running spark job on pyspark which requires spark/python/etc libraries and a runner(local[] or remote[]).
Another reason can be if you want to run your job on-premise.
It might be easier to create cluster on your local datacenter and maybe appoint your machine as master and the other machines connected to your master as worker.(this solution might be abit naive but you asked for basics so this might spark your curiosity to read more about infrastructure design of a data processing system more)

Spark/k8s: How do I install Spark 2.4 on an existing kubernetes cluster, in client mode?

I want to install Apache Spark v2.4 on my Kubernetes cluster, but there does not seem to be a stable helm chart for this version. An older/stable chart (for v1.5.1) exists at
https://github.com/helm/charts/tree/master/stable/spark
How can I create/find a v2.4 chart?
Then: The reason for needing v2.4 is to enable client-mode, because I would like to be able to submit (PySpark/Jupyter notebook) jobs to the cluster from my laptop's dev environment. What extra steps are required to enable client-mode (including exposing the service)?
The closest attempt so far (but for Spark v2.0.0) that I have found, but which I haven't yet got working, is at
https://github.com/Uninett/kubernetes-apps/tree/master/spark
At https://github.com/phatak-dev/kubernetes-spark (also two years old), there is nothing about jupyter deployment.
Pangeo-specific: https://discourse.jupyter.org/t/spark-integration-documentation/243
SO thread: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1030
I have searched for up-to-date resources on this but have found nothing that has everything in one place. I will update this question with other relevant links if and when people are able to point them out to me. Hopefully it will be possible to cobble together an answer.
As ever, huge thanks in advance.
Update:
https://github.com/SnappyDataInc/spark-on-k8s for v2.2 is extremely easy to deploy - looks promising...
see https://hub.helm.sh/charts/microsoft/spark this is based off https://github.com/helm/charts/tree/master/stable/spark and uses spark 2.4.6 with hadoop 3.1. You can check the source for this chat at https://github.com/dbanda/charts. The Livy service makes it easy to submit spark jobs via REST API. You can also submit jobs using Zeppelin. We made this chart as alternative way to run spark on K8s without using the spark-submit k8s mode. I hope it helps.

not able to find class in spark package HBaseContext

I am trying to use "HBaseContext" through spark but not able fine any details all the details coming blank
[https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/spark/example/hbasecontext/][1]
I am trying to implement some method explained here
http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
can anyone help who has implemented any of these
Although HBaseContext already implemented and documented in HBase Reference Guide, but the author/the community has not release it yet, you can see from this link HBaseContext Commit History, the community is still working on it recently(there is no update for project SparkOnHBase for a long time), and for any download version of HBase, hbase-spark module is not included at all.
This is a big confusion for beginners, hope the community can improve it, to access HBase from Spark RDD, you can consider it as normal Hadoop DataSource, HBase does provide TableInputFormat and TableOutputFormat for this purpose.

Resources