Azure HDInsights Spark Cluster Install External Libraries - azure

I have a HDInsights Spark Cluster. I installed tensorflow using a script action. The installation went fine (Success).
But now when I go and create a Jupyter notebook, I get:
import tensorflow
Starting Spark application
The code failed because of a fatal error:
Session 8 unexpectedly reached final status 'dead'. See logs:
YARN Diagnostics:
Application killed by user..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context. For instructions on how to assign resources see http://go.microsoft.com/fwlink/?LinkId=717038
b) Contact your cluster administrator to make sure the Spark magics library is configured correctly.
I don't know how to fix this error... I tried some things like looking at logs but they are not helping.
I just want to connect to my data and train a model using tensorflow.

This looks like error with Spark application resources. Check resources available on your cluster and close any applications that you don't need. Please see more details here: https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-resource-manager#kill-running-applications

Related

Cannot get PySpark working in Kubernetes getting (Initial job has not accepted any resources)

I'm trying to use the following Helm Chart for Spark on Kubernetes
https://github.com/bitnami/charts/tree/main/bitnami/spark
The documentation is of course spotty but I've muddled along. So I have it installed with custom values that assign things like resource limits etc. I'm accessing the master through a NodePort and the WebUI through a port forward. I am NOT using spark-submit, I'm writing Python code to drive the Spark Cluster as follows:
import pyspark
sc = pyspark.SparkContext(appName="Testy", master="spark://<IP>:<PORT>")
This Python code is running locally on my Windows laptop, the Kubernetes cluster is on a separate set of servers. It connects and I can see the app appear in the WebUI but the second it tries to do something I get the following:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
The master seems to be in a cycle of removing and launching executors and the 3 workers each just fail to run a launch command. Interestingly the command has the hostname of my laptop in here:
"--driver-url" "spark://CoarseGrainedScheduler#<laptop hostname>:60557"
Got to imagine that's not right. So in this setup where should I be actually running the python code? On the kubernetes cluster? Can I run it locally on my laptop? These details are of course missing from the docs. I'm new to Spark so just looking for the absolute basics. My preferred workflow would be to develop code locally on my laptop then run it on the Kubernetes cluster I have access to.

Azure databricks Job fails with error message

When node restart, job fails with the following message:
ImportError: No module named mlflow
I have installed mlflow from Databricks Cluster UI, still facing this issue.
Cluster Configuration: Runtime 10.4 LTS Scala 2.12, Spark 3.2.1
The Cluster Manager is part of the Azure Databricks service that manages customer Apache Spark clusters. It sends commands to install Python and R libraries when it restarts each node. Sometimes, library installation or downloading of artifacts from the internet can take more time than expected. This occurs due to network latency, or it occurs if the library that is being attached to the cluster has many dependent libraries.
Solution:
Use notebook-scoped library installation commands in the notebook. You can enter the following commands in one cell, which ensures that all of the specified libraries are installed.
dbutils.library.installPyPI("mlflow")
dbutils.library.restartPython()
Refer - https://learn.microsoft.com/en-us/azure/databricks/kb/libraries/library-install-latency

Configure external jars with HDI Jupyter Spark (Scala) notebook

I have an external custom jar that I would like to use with Azure HDInsight Jupyter notebooks; the Jupyter notebooks in HDI use Spark Magic and Livy.
Within the first cell of the notebook, I'm trying to use the jars configuration:
%%configure -f
{"jars": ["wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar"]}
But the error message I receive is:
Starting Spark application
The code failed because of a fatal error:
Status 'shutting_down' not supported by session..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context. For instructions on how to assign resources see http://go.microsoft.com/fwlink/?LinkId=717038
b) Contact your cluster administrator to make sure the Spark magics library is configured correctly.
Current session configs: {u'jars': [u'wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar'], u'kind': 'spark'}
An error was encountered:
Status 'shutting_down' not supported by session.
I'm wondering if I'm just not understanding how Livy works in this case as I was able to successfully include a spark-package (GraphFrames) on the same cluster:
%%configure -f
{ "conf": {"spark.jars.packages": "graphframes:graphframes:0.3.0-spark2.0-s_2.11" }}
Some additional references that may be handy (just in case I missed something):
Jupyter notebooks kernels with Apache Spark clusters in HDInsight
Livy Documentation
Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy
Oh, I was able to figure it out and forgot to update my question. This can work if you put the jar in the default storage account of your HDI cluster.
HTH!
in case people come here for adding jars on EMR.
%%configure -f
{"name": "sparkTest", "conf": {"spark.jars": "s3://somebucket/artifacts/jars/spark-avro_2.11-2.4.4.jar"}}
contrary to the document, use jars directly won't work.

Spark gives error(Initial Job has not been accepted)

I am trying spark on amazon EC2 but i am facing this issue.
When i tried to run with local spark worker it works fine but when i tried with only EC2 instance it give error of Initial Job has been not accepted.
How I resolve this error?
This error aises when your all resources are either held by another Job or the spark cluster has not enough resources to run your application. Try to tune the resources and check the configuration from Spark UI.

Unable to add a new service with Cloudera Manager within Cloudera Quickstart VM 5.3.0

I'm using Cloudera Quickstart VM 5.3.0 (running in Virtual Box 4.3 on Windows 7) and I wanted to learn Spark (on YARN).
I started Cloudera Manager. In the sidebar I can see all the services, there is Spark but in standalone mode. So I click on "Add a new service", select "Spark". Then I have to select the set of dependencies for this service, I have no choices I must pick HDFS/YARN/zookeeper.
Next step I have to choose a History Server and a Gateway, I run the VM in local mode so I can only choose localhost.
I click on "Continue" and this error occures (+ 69 traces) :
A server error as occurred. Send the following information to
Cloudera.
Path : http://localhost:7180/cmf/clusters/1/add-service/reviewConfig
Version: Cloudera Express 5.3.0 (#155 built by jenkins on
20141216-1458 git: e9aae1d1d1ce2982d812b22bd1c29ff7af355226)
org.springframework.web.bind.MissingServletRequestParameterException:Required
long parameter 'serviceId' is not present at
AnnotationMethodHandlerAdapter.java line 738 in
org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter$ServletHandlerMethodInvoker
raiseMissingParameterException()
I don't know if an internet connection is needed but I precise that I can't connect to the internet with the VM. (EDIT : Even with an internet connection I get the same error)
I have no ideas how to add this service, I tried with or without gateway, many network options but it never worked. I checked the known issues; nothing...
Someone knows how I can solve this error or how I can work around ? Thanks for any help.
Julien,
Before I answer your question I'd like to make some general notes about Spark in Cloudera Distribution of Hadoop 5 (CDH5):
Spark runs in three different formats: (1) local, (2) Spark's own stand-alone manager, and (3) other cluster resource managers like Hadoop YARN, Apache Mesos, and Amazon EC2.
Spark works out-of-the-box with CHD 5 for (1) and (2). You can initiate a local
interactive spark session in Scala using the spark-shell command
or pyspark for Python without passing any arguments. I find the interactive Scala and Python
interpreters help learning to program with Resilient Distributed
Datasets (RDDs).
I was able to recreate your error on my CDH 5.3.x distribution. I didn't mean to take credit for the bug you discovered, but I posted to the Cloudera developer community for feedback.
In order to use Spark in the QuickStart pseudo-distributed environment, see if all of the Spark daemons are running using the following command (you can do this inside the Cloudera Manager (CM) UI):
[cloudera#quickstart simplesparkapp]$ sudo service --status-all | grep -i spark
Spark history-server is not running [FAILED]
Spark master is not running [FAILED]
Spark worker is not running [FAILED]
I've manually stopped all of the stand-alone Spark services so we can try to submit the Spark job within Yarn.
In order to run Spark inside a Yarn container on the quick start cluster, we have to do the following:
Set the HADOOP_CONF_DIR to the root of the directory containing the yarn-site.xml configuration file. This is typically /etc/hadoop/conf in CHD5. You can set this variable using the command export HADOOP_CONF_DIR="/etc/hadoop/conf".
Submit the job using spark-submit and specify you are using Hadoop YARN.
spark-submit --class CLASS_PATH --master yarn JAR_DIR ARGS
Check the job status in Hue and compare to the Spark History server. Hue should show the job placed in a generic Yarn container and Spark History should not have a record of the submitted job.
References used:
Learning Spark, Chapter 7
Sandy Ryza's Blog Post on Spark and CDH5
Spark Documentation for Running on Yarn

Resources