How to run oozie jobs in HDInsight cluster? - azure

I have an oozie workflow that I'd like to run on an HDInsight cluster. My job has a jar file as well as a workflow.xml file that I store on the Azure blob storage. However the only way I found to store the job.config file is on the local storage of the HDInsight headnode. However my concern is what happens when the VM gets re-imaged? does it remove my job.config file?

In general, you can use Script Actions on HDInsight. Script actions perform customization on the HDInsight clusters during provisioning. So every time the cluster is created, the scripts will be run. (You were smart to be concerned about what happens when the cluster is re-created!)
In these advanced configuration options, it shows HDInsight cluster customization during the provision process using PowerShell. There is an oozie section:
# oozie-site.xml configuration
$OozieConfigValues = new-object 'Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightOozieConfiguration'
$OozieConfigValues.Configuration = #{ "oozie.service.coord.normal.default.timeout"="150" } # default 120
Does that help?
Other resources:
Customizing HDInsight Cluster provisioning
Oozie tutorial on HDInsight

Related

Extracting Spark logs (Spark UI contents) from Databricks

I am trying to save Apache Spark logs (the contents of Spark UI), not necessarily stderr, stdout and log4j files (although they might be useful too) to a file so that I can send it over to someone else to analyze.
I am following the manual described in the Apache Spark documentation here:
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
The problem is that I am running the code on Azure Databricks. Databricks saves the logs elsewhere and you can display them from the web UI but cannot export it.
When I ran the Spark job with spark.eventLog.dir set to a location in DBFS, the file was created but it was empty.
Is there a way to export the full Databricks job log so that anyone can open it without giving them the access to the workspace?
The simplest way of doing it as following:
You create a separate storage account + container in it or a separate container in existing storage account & give access to it to developers
You mount that container to the Databricks workspace
You configure clusters/jobs to write logs into mount location (you can enforce it for new objects using the cluster policies). This will create sub-directories with the cluster name, containing logs of driver & executors + result of execution of init scripts
(optional) you can setup retention policy on that container to automatically remove old logs.

Spark cluster on Kubernetes without spark-submit

I have a spark application and want to deploy this on a Kubernetes cluster.
Following the below documentation I have managed to create an empty Kubernetes cluster, generated docker image using the Dockerfile provided under kubernetes/dockerfiles/spark/Dockerfile and deployed this on the cluster using spark-submit in a Dev environment.
https://spark.apache.org/docs/latest/running-on-kubernetes.html
However, in a 'proper' environment we have a managed Kubernetes cluster (bespoke unlike EKS etc.) and will have to provide pod configuration files to get deployed.
I believe you can supply Pod template file as an argument to the spark-submit command.
https://spark.apache.org/docs/latest/running-on-kubernetes.html#pod-template
How can I do this without spark-submit? And are there any example yaml files?
PS: we have limited access to this cluster, e.g. we can install Helm charts but not operator or controller.
You could try to use k8s Spark CRD https://github.com/GoogleCloudPlatform/spark-on-k8s-operator and provide a pod configuration through it.

how to rename Databricks job cluster name during runtime

I have created an ADF pipeline with Notebook activity. This notebook activity automatically creates databricks job clusters with autogenerated job cluster names.
1. Rename Job Cluster during runtime from ADF
I'm trying to rename this job cluster name with the process/other names during runtime from ADF/ADF linked service.
instead of job-59, i want it to be replaced with <process_name>_
2. Rename ClusterName Tag
Wanted to replace Default generated ClusterName Tag to required process name
Settings for the job can be updated using the Reset or Update endpoints.
Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. You can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to cloud resources like VMs and disk volumes, as well as DBU usage reports.
For detailed information about how pool and cluster tag types work together, see Monitor usage using cluster, pool, and workspace tags.
For convenience, Azure Databricks applies four default tags to each cluster: Vendor, Creator, ClusterName, and ClusterId.
These tags propagate to detailed cost analysis reports that you can access in the Azure portal.
Checkout an example how billing works.

How to Pass Variables into Azure Databricks Cluster Init Script

I'm trying to use workspace environment variables to pass access tokens into my custom cluster init scripts.
It appears that there are only a few supported environment variables that we can access in our custom cluster init scripts as described at https://docs.databricks.com/clusters/init-scripts.html#environment-variables
I've attempted to write to the base cluster configuration using
Microsoft.Azure.Databricks.Client.SparkEnvironmentVariables.Add("WORKSPACE_ID", workspaceId)
My init scripts are still failing to uptake this variable in the following line:
[[ -z "${WORKSPACE_ID}" ]] && LOG_ANALYTICS_WORKSPACE_ID='default' || LOG_ANALYTICS_WORKSPACE_ID="${WORKSPACE_ID}"
With the above lines of code, my init script causes the cluster to fail with the following error:
Spark Error: Spark encountered an error on startup. This issue can be caused by
invalid Spark configurations or malfunctioning init scripts. Please refer to the Spark
driver logs to troubleshoot this issue, and contact Databricks if the problem persists.
Internal error message: Spark error: Driver down
The logs don't say that any part of my bash script is failing, so I'm assuming that it's just failing to pick up the variable from the environment variables.
Has anyone else dealt with a problem with this? I realize that I could write this information to dbfs, and then read it into the init script, but I'd like to avoid doing that since I'll be passing in access tokens. What other approaches can I try?
Thanks for any help!
This article shows how to send application logs and metrics from Azure Databricks to a Log Analytics workspace. It uses the Azure Databricks Monitoring Library, which is available on GitHub.
Prerequisites: Configure your Azure Databricks cluster to use the monitoring library, as described in the GitHub readme.
Steps to build the Azure monitoring library and configure an Azure Databricks cluster:
Step1: Build the Azure Databricks monitoring library
Step2: Create and configure the Azure Databricks cluster
For more details, refer "Monitoring Azure Databricks".
Hope this helps.

How do I install Custom Jar in HDInsight

I am new to Hadoop/HDInsight.
I have followed the steps here to create the jar package of SerDe. After the package json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar is created the post says that I need to upload it to the head-node.
Does it mean that I have to RDP into the HDInsight VM and then manually upload the file?
If I don't have remote connection enabled to that VM what else can I do?
PS: The HDInsight Cluster is already provisioned.
You don't have to add it to the head-node for HDInsight. If you upload the jar to the storage account associated with your cluster, you can access it using the add jar command used in your example.
add jar wasb://<storageaccount>#<containername>/<jarfolder>/json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar;
For example:
add jar wasb://andrewsstorage#datacontainer/myjars/json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar
This is a more scaleable approach because the jar asset will remain after the HDI cluster is destroyed.

Resources