Azure databricks Job fails with error message - azure

When node restart, job fails with the following message:
ImportError: No module named mlflow
I have installed mlflow from Databricks Cluster UI, still facing this issue.
Cluster Configuration: Runtime 10.4 LTS Scala 2.12, Spark 3.2.1

The Cluster Manager is part of the Azure Databricks service that manages customer Apache Spark clusters. It sends commands to install Python and R libraries when it restarts each node. Sometimes, library installation or downloading of artifacts from the internet can take more time than expected. This occurs due to network latency, or it occurs if the library that is being attached to the cluster has many dependent libraries.
Solution:
Use notebook-scoped library installation commands in the notebook. You can enter the following commands in one cell, which ensures that all of the specified libraries are installed.
dbutils.library.installPyPI("mlflow")
dbutils.library.restartPython()
Refer - https://learn.microsoft.com/en-us/azure/databricks/kb/libraries/library-install-latency

Related

Kubernetes + Spark Job not Progressing/Stuck

I am trying to run pyspark code on Kubernetes cluster.
The application flow should be: read data -> cache -> perform multiple actions, but the job is not progressing at all. It is stuck on the log message:
WatchConnectionManager: The resource version -some number- no longer exists. Scheduling a reconnect.
What could be the problem?
Looks like an issue within Spark, should be fixed in versions 3.0.2, 3.1.0
https://issues.apache.org/jira/browse/SPARK-24266

Spark Standalone Cluster :Configuring Distributed File System

I have just moved from a Spark local setup to a Spark standalone cluster. Obviously, loading and saving files no longer works.
I understand that I need to use Hadoop for saving and loading files.
My Spark installation is spark-2.2.1-bin-hadoop2.7
Question 1:
Am I correct that I still need to separately download, install and configure Hadoop to work with my standalone Spark cluster?
Question 2:
What would be the difference between running with Hadoop and running with Yarn? ...and which is easier to install and configure (assuming fairly light data loads)?
A1. Right. The package you mentioned is just packed with hadoop client with specified version and still you need to install hadoop if you want to use hdfs.
A2. Running with yarn means you're using resource manager of spark as yarn. (http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications) So, when the case you don't need DFS, like when you're only running spark streaming applications, you still can install Hadoop but only run yarn processes to use its resource management functionality.

Azure HDInsights Spark Cluster Install External Libraries

I have a HDInsights Spark Cluster. I installed tensorflow using a script action. The installation went fine (Success).
But now when I go and create a Jupyter notebook, I get:
import tensorflow
Starting Spark application
The code failed because of a fatal error:
Session 8 unexpectedly reached final status 'dead'. See logs:
YARN Diagnostics:
Application killed by user..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context. For instructions on how to assign resources see http://go.microsoft.com/fwlink/?LinkId=717038
b) Contact your cluster administrator to make sure the Spark magics library is configured correctly.
I don't know how to fix this error... I tried some things like looking at logs but they are not helping.
I just want to connect to my data and train a model using tensorflow.
This looks like error with Spark application resources. Check resources available on your cluster and close any applications that you don't need. Please see more details here: https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-resource-manager#kill-running-applications

What version of Apache spark is used in my IBM Analytics for Apache Spark for IBM Cloud service?

I saw an email indicating the sunset of support for 1.6 apache spark within IBM Cloud. I am pretty sure my version is 2.x, but I wanted to confirm. I couldn't find anywhere in the UI that indicated the version, and the bx cli command that I thought would show it didn't.
[chrisr#oc5287453221 ~]$ bx service show "Apache Spark-bc"
Invoking 'cf service Apache Spark-bc'...
Service instance: Apache Spark-bc
Service: spark
Bound apps:
Tags:
Plan: ibm.SparkService.PayGoPersonal
Description: IBM Analytics for Apache Spark for IBM Cloud.
Documentation url: https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html
Dashboard: https://spark-dashboard.ng.bluemix.net/dashboard
Last Operation
Status: create succeeded
Message:
Started: 2018-01-22T16:08:46Z
Updated: 2018-01-22T16:08:46Z
How do I determine the version of spark that I am using? Also, I tried going to the "Dashboard" URL from above, and I got an "Internal Server Error" message after logging in.
The information found on How to check the Spark version doesn't seem to help, because it seems to be related to locally installed spark instances. I need to find out the information from the IBM Cloud (ie. Bluemix) using either the UI or the bluemix CLI. Other possibilities would be running some command from a Jupyter Notebook in iPython running in Data Science Experience (part of IBM Cloud).
The answer was given by ptitzler above, just adding an answer as requested by the email I was sent.
The Spark service itself is not version specific. To find out whether
or not you need to migrate you need to inspect the apps/tools that
utilize the service. For example if you've created notebooks in DSX
you associated them with a kernel that was bound to a specific Spark
version and you'd need to open each notebook to find out which Spark
version they are utilizing. – ptitzler Jan 31 at 16:32

Unable to add a new service with Cloudera Manager within Cloudera Quickstart VM 5.3.0

I'm using Cloudera Quickstart VM 5.3.0 (running in Virtual Box 4.3 on Windows 7) and I wanted to learn Spark (on YARN).
I started Cloudera Manager. In the sidebar I can see all the services, there is Spark but in standalone mode. So I click on "Add a new service", select "Spark". Then I have to select the set of dependencies for this service, I have no choices I must pick HDFS/YARN/zookeeper.
Next step I have to choose a History Server and a Gateway, I run the VM in local mode so I can only choose localhost.
I click on "Continue" and this error occures (+ 69 traces) :
A server error as occurred. Send the following information to
Cloudera.
Path : http://localhost:7180/cmf/clusters/1/add-service/reviewConfig
Version: Cloudera Express 5.3.0 (#155 built by jenkins on
20141216-1458 git: e9aae1d1d1ce2982d812b22bd1c29ff7af355226)
org.springframework.web.bind.MissingServletRequestParameterException:Required
long parameter 'serviceId' is not present at
AnnotationMethodHandlerAdapter.java line 738 in
org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter$ServletHandlerMethodInvoker
raiseMissingParameterException()
I don't know if an internet connection is needed but I precise that I can't connect to the internet with the VM. (EDIT : Even with an internet connection I get the same error)
I have no ideas how to add this service, I tried with or without gateway, many network options but it never worked. I checked the known issues; nothing...
Someone knows how I can solve this error or how I can work around ? Thanks for any help.
Julien,
Before I answer your question I'd like to make some general notes about Spark in Cloudera Distribution of Hadoop 5 (CDH5):
Spark runs in three different formats: (1) local, (2) Spark's own stand-alone manager, and (3) other cluster resource managers like Hadoop YARN, Apache Mesos, and Amazon EC2.
Spark works out-of-the-box with CHD 5 for (1) and (2). You can initiate a local
interactive spark session in Scala using the spark-shell command
or pyspark for Python without passing any arguments. I find the interactive Scala and Python
interpreters help learning to program with Resilient Distributed
Datasets (RDDs).
I was able to recreate your error on my CDH 5.3.x distribution. I didn't mean to take credit for the bug you discovered, but I posted to the Cloudera developer community for feedback.
In order to use Spark in the QuickStart pseudo-distributed environment, see if all of the Spark daemons are running using the following command (you can do this inside the Cloudera Manager (CM) UI):
[cloudera#quickstart simplesparkapp]$ sudo service --status-all | grep -i spark
Spark history-server is not running [FAILED]
Spark master is not running [FAILED]
Spark worker is not running [FAILED]
I've manually stopped all of the stand-alone Spark services so we can try to submit the Spark job within Yarn.
In order to run Spark inside a Yarn container on the quick start cluster, we have to do the following:
Set the HADOOP_CONF_DIR to the root of the directory containing the yarn-site.xml configuration file. This is typically /etc/hadoop/conf in CHD5. You can set this variable using the command export HADOOP_CONF_DIR="/etc/hadoop/conf".
Submit the job using spark-submit and specify you are using Hadoop YARN.
spark-submit --class CLASS_PATH --master yarn JAR_DIR ARGS
Check the job status in Hue and compare to the Spark History server. Hue should show the job placed in a generic Yarn container and Spark History should not have a record of the submitted job.
References used:
Learning Spark, Chapter 7
Sandy Ryza's Blog Post on Spark and CDH5
Spark Documentation for Running on Yarn

Resources