Connect local jupyter notebook to HDInsight Cluster via sparkmagic - azure

I have deployed a HDInsight 3.5 Spark (2.0) cluster on Microsoft Azure with the standard configurations (Location = US East, Head Nodes = D12 v2 (x2), Worker Nodes = D4 v2 (x4)). Locally I have installed sparkmagic following the steps in https://github.com/jupyter-incubator/sparkmagic/blob/master/README.md#installation and https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-notebook-install-locally and changed the config.json file. When starting jupyter notebook I can chose the PySpark kernel. Even tough I get the message that the kernel is ready, when I try to execute a simple statement (e.g. t = 4), the kernel starts to run infinitely. Could you provide possible solution(s)?

Most probably, this is an issue where the config.json is configured with the wrong endpoint, username, or password. If you are using the base64 password field, make sure the password is base64 encoded.
Without more information regarding errors (log file should be in ~/.sparkmagic/logs), it's hard to say why you couldn't connect.

Related

How to get root access to a pod in openshift 4.0

We have an OpenShift v4.0 deployed and running. We are using Open Data Hub pods framework within Openshift wherein we have got our jupyterhub along with spark.
Goal is to read a bunch of csv files with spark and load it into mysql. Error I was getting is mentioned in this tread How to set up JDBC driver for MySQL in Jupyter notebook for pyspark?.
One of the solution is to copy the jar file in spark master node. But I am not having access to pod as root user.
How can I get access to root within a pod in Openshift?
#roar S, your answer is correct, however, it is preferable to create your own SCC identical to the "anyuid" SCC (call it "my-anyuid") and link the new SCC it to the system account.
(and your link points to OCP v3.2 where the question is about OCP v4.x)
We had previous bad experience with this as the upgrade from OCP v4.2 to v4.3 failed because we did what you proposed. In fact "add-scc-to-user" "modify" the target SCC and the upgrade process ddidn't like it
To create a SCC similar toanyuid, just extract the anyuid manifest (oc get scc anyuid -o yaml)save it, remove all linked SA in the manifest, change the name and create the new one
https://docs.okd.io/latest/authentication/managing-security-context-constraints.html

How do I restart a stopped Spark Context?

I'm running Spark with apache zeppelin and hadoop. My understanding is that Zeppelin is like a kube app that sends commands to a remote machine that's running Spark and accessing files with Hadoop.
I often run into a situation where the Spark Context gets stopped. In the past, I believed it was because I overloaded the system with a data pull that required too much data, but now I'm less enthusiastic about that theory. I've frequently had it happen after running totally reasonable and normal queries.
In order to restart the Spark Context, I've gone to the interpreter binding settings and restarted spark.
I've also run this command
%python
JSESSIONID="09123q-23se-12ae-23e23-dwtl12312
YOURFOLDERNAME="[myname]"
import requests
import json
cookies = {"JSESSIONID": JSESSIONID}
notebook_response = requests.get('http://localhost:8890/api/notebook/jobmanager', cookies=cookies)
body = json.loads(notebook_response.text)["body"]["jobs"]
notebook_ids = [(note["noteId"]) for note in body if note.get("interpreter") == "spark" and YOURFOLDERNAME in note.get("noteName", "")]
for note_id in notebook_ids:
requests.put("http://localhost:8890/api/interpreter/setting/restart/spark", data=json.dumps({"noteId": note_id}), cookies=cookies)
I've also gone to the machine running spark and entered yarn top and I don't see my username listed within the list of running applications.
I know that I can get it working if I restart the machine, but that'll also restart the machine for everyone else using it.
What other ways can I restart a Spark Context?
I assume that you have configured you spark interpreter to run in isolated mode:
In this case you get separate instances for each user:
You can restart your own instance and get a new SparkContext from the interpreter binding menu of a notebook by pressing the refresh button (tested with zeppelin 0.82):

Azure HDInsights Spark Cluster Install External Libraries

I have a HDInsights Spark Cluster. I installed tensorflow using a script action. The installation went fine (Success).
But now when I go and create a Jupyter notebook, I get:
import tensorflow
Starting Spark application
The code failed because of a fatal error:
Session 8 unexpectedly reached final status 'dead'. See logs:
YARN Diagnostics:
Application killed by user..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context. For instructions on how to assign resources see http://go.microsoft.com/fwlink/?LinkId=717038
b) Contact your cluster administrator to make sure the Spark magics library is configured correctly.
I don't know how to fix this error... I tried some things like looking at logs but they are not helping.
I just want to connect to my data and train a model using tensorflow.
This looks like error with Spark application resources. Check resources available on your cluster and close any applications that you don't need. Please see more details here: https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-resource-manager#kill-running-applications

Connecting IPython notebook to spark master running in different machines

I don't know if this is already answered in SO but I couldn't find a solution to my problem.
I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook
I have also a spark cluster created with google cloud dataproc
Spark master and the notebook are running in different VMs but in the same region and zone.
My problem is that I'm trying to connect to the spark master from the IPython notebook but without success. I use this snippet of code in my python notebook
import pyspark
conf = pyspark.SparkConf()
conf.setMaster("spark://<spark-master-ip or spark-master-hostname>:7077")
I just started working with spark, so I'm sure I'm missing something (authentication, security ...),
What I found over there is connecting a local browser over an SSH tunnel
Somebody already did this kind of set up?
Thank you in advance
Dataproc runs Spark on YARN, so you need to set master to 'yarn-client'. You also need to point Spark at your YARN ResourceManager, which requires a under-documented SparkConf -> Hadoop Configuration conversion. You also have to tell Spark about HDFS on the cluster, so it can stage resources for YARN. You could use Google Cloud Storage instead of HDFS, if you baked The Google Cloud Storage Connector for Hadoop into your image.
Try:
import pyspark
conf = pyspark.SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('My Jupyter Notebook')
# 'spark.hadoop.foo.bar' sets key 'foo.bar' in the Hadoop Configuaration.
conf.set('spark.hadoop.yarn.resourcemanager.address', '<spark-master-hostname>')
conf.set('spark.hadoop.fs.default.name', 'hdfs://<spark-master-hostname>/')
sc = pyspark.SparkContext(conf=conf)
For a more permanent config, you could bake these into a local file 'core-site.xml' as described here, place that in a local directory, and set HADOOP_CONF_DIR to that directory in your environment.
It's also worth noting that while being in the same Zone is important for performance, it is being in the same Network and allowing TCP between internal IP addresses in that network that allows your VMs to communicate. If you are using the default network, then the default-allow-internal firewall rule, should be sufficient.
Hope that helps.

Remote Desktop Not Working on Hadoop on Azure

I am able to allocate a Hadoop cluster on Windows Azure by entering my Windows Live ID, but after that, I am unable to do Remote Desktop to the master node there.
Before the cluster creation, it's showing a message that says "Microsoft has got overwhelming positive feedback from Hadoop On Azure users, hence it's giving a free trial for 5 days with 2 slave nodes."
[P.S. that this Preview Version of HoA was working before]
Any suggestions for this problem?
Thanks in advance..
When you created your Hadoop cluster, you were asked to enter the DNS name for cluster which could something like your_hadoop_cluster.cloudapp.net.
So first please ping to your Hadoop cluster name to see if it returns back an IP address, this will prove if you really have any cluster configured at all. IF you dont get an IP back then you don't have a Hadoop cluster on Azure and trying creating one.
IF you are sure you do have a Hadoop cluster on Winodws Azure, try to post your question the following Hadoop on Azure CTP forum and you will get proper help you need:
http://tech.groups.yahoo.com/group/HadoopOnAzureCTP/

Resources