How to give Jupyterhub access to hive tables through spark in EMR - apache-spark

The default installation of JupytherHub in EMR has no access to the Hive context in Spark. How can I fix this?

To grant spark access to the Hive context, you need to edit the livy.conf file (/etc/livy/conf.dist/livy.conf) like this
livy.repl.enableHiveContext = true
and then restart your notebook and the livy service, following the instructions here, basically:
sudo stop livy-server
sudo start livy-server
An easy way to check if it's working, is to check for the databases on your spark notebook:
spark.sql("show databases").show
Yo may want to configure this on the EMR booting time, by using the standard configuration features of the EMR, https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

Related

do we need to install spark on yarn to read data from HDFS into Py Spark?

I am having a Hadoop 3.1.1 multi-node cluster, i want to make use of PySpark to read files from my HDFS into PySpark for ETL operations and then load it to target MySQL databases.
Given below is the ask.
can I install spark in standalone mode?
do I need to install spark on my yarn first?
if no, how can I install spark separately?
You can use any mode for communicating with HDFS and MySQL, including Kubernetes. Or, you just use --master="local[*]" and you don't need a scheduler at all. This is useful, for example, from a Jupyter Notebook.
YARN would be recommended as you already have HDFS, and therefore the scripts to start YARN processes as well.
You don't really "install Spark on YARN". Applications from clients get submitted to the YARN cluster. spark.yarn.archives HDFS path will get unpacked into the classes necessary to run the job.
Refer https://spark.apache.org/docs/latest/running-on-yarn.html

Installing Spark/Zeppelin on Standalone node

I have a Cloudera cluster which is being managed by an admin team. However there is no Zeppelin installed in the cluster.
I would like to install Zeppelin on a separate node and connect with the Cloudera cluster?
Is it feasible to install zeppelin on a node which is not part of the cluster and submit spark jobs to it?
Any reference is really appreciated?
Thanks
Zeppelin is just another Spark client.
For example, on the machine that you want to use Zeppelin on, you should first make sure that spark shell and spark submit work as expected, then Zeppelin configurations become much easier
An easy way to manage that would be to have the admins use Cloudera Manager to install Spark (and Hive and Hadoop) client libraries into this standalone node, then I assume they give you SSH access, or you tell them how to install it

Configure external jars with HDI Jupyter Spark (Scala) notebook

I have an external custom jar that I would like to use with Azure HDInsight Jupyter notebooks; the Jupyter notebooks in HDI use Spark Magic and Livy.
Within the first cell of the notebook, I'm trying to use the jars configuration:
%%configure -f
{"jars": ["wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar"]}
But the error message I receive is:
Starting Spark application
The code failed because of a fatal error:
Status 'shutting_down' not supported by session..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context. For instructions on how to assign resources see http://go.microsoft.com/fwlink/?LinkId=717038
b) Contact your cluster administrator to make sure the Spark magics library is configured correctly.
Current session configs: {u'jars': [u'wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar'], u'kind': 'spark'}
An error was encountered:
Status 'shutting_down' not supported by session.
I'm wondering if I'm just not understanding how Livy works in this case as I was able to successfully include a spark-package (GraphFrames) on the same cluster:
%%configure -f
{ "conf": {"spark.jars.packages": "graphframes:graphframes:0.3.0-spark2.0-s_2.11" }}
Some additional references that may be handy (just in case I missed something):
Jupyter notebooks kernels with Apache Spark clusters in HDInsight
Livy Documentation
Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy
Oh, I was able to figure it out and forgot to update my question. This can work if you put the jar in the default storage account of your HDI cluster.
HTH!
in case people come here for adding jars on EMR.
%%configure -f
{"name": "sparkTest", "conf": {"spark.jars": "s3://somebucket/artifacts/jars/spark-avro_2.11-2.4.4.jar"}}
contrary to the document, use jars directly won't work.

Privileges for spark sql with sentry

I'm trying to make the privileges to access Spark-SQL with sentry, and spark sql connects with thrift port withe hiveserver2( --hiveconf hive.server2.thrift.port). However, while I can limit users' privileges on hive successfully, I cannot limit the access with spark SQL through sentry.
Anyone who met with the same problem?
Follow the doc,config hive in Spark.
Add Sentry jars to the classpath, Spark will load them automatically.
It works for me.

How to connect to Spark EMR from the locally running Spark Shell

I have created a Spark EMR cluster. I would like to execute jobs either on my localhost or EMR cluster.
Assuming I run spark-shell on my local computer how can I tell it to connect to the Spark EMR cluster, what would be the exact configuration options and/or commands to run.
It looks like others have also failed at this and ended up running the Spark driver on EMR, but then making use of e.g. Zeppelin or Jupyter running on EMR.
Setting up our own machines as spark drivers that connected to the core nodes on EMR would have been ideal. Unfortunately, this was impossible to do and we forfeited after trying many configuration changes. The driver would start up and then keep waiting unsuccessfully, trying to connect to the slaves.
Most of our Spark development is on pyspark using Jupyter Notebook as our IDE. Since we had to run Jupyter from the master node, we couldn’t risk losing our work if the cluster were to go down. So, we created an EBS volume and attached it to the master node and placed all of our work on this volume. [...]
source
Note: If you go down this route, I would consider using S3 for storing notebooks, then you don't have to manage EBS volumes.
One way of doing this is to add your spark job as an EMR step to your EMR cluster. For this, you need AWS CLI installed on your local computer
(see here for installation guide), and your jar file on s3.
Once you have aws cli, assuming your spark class to run is com.company.my.MySparkJob and your jar file is located on s3 at s3://hadi/my-project-0.1.jar, you can run the following command from your terminal:
aws emr add-steps --cluster-id j-************* --steps Type=spark,Name=My_Spark_Job,Args=[-class,com.company.my.MySparkJob,s3://hadi/my-project-0.1.jar],ActionOnFailure=CONTINUE

Resources