How do I install Custom Jar in HDInsight - azure-hdinsight

I am new to Hadoop/HDInsight.
I have followed the steps here to create the jar package of SerDe. After the package json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar is created the post says that I need to upload it to the head-node.
Does it mean that I have to RDP into the HDInsight VM and then manually upload the file?
If I don't have remote connection enabled to that VM what else can I do?
PS: The HDInsight Cluster is already provisioned.

You don't have to add it to the head-node for HDInsight. If you upload the jar to the storage account associated with your cluster, you can access it using the add jar command used in your example.
add jar wasb://<storageaccount>#<containername>/<jarfolder>/json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar;
For example:
add jar wasb://andrewsstorage#datacontainer/myjars/json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar
This is a more scaleable approach because the jar asset will remain after the HDI cluster is destroyed.

Related

Extracting Spark logs (Spark UI contents) from Databricks

I am trying to save Apache Spark logs (the contents of Spark UI), not necessarily stderr, stdout and log4j files (although they might be useful too) to a file so that I can send it over to someone else to analyze.
I am following the manual described in the Apache Spark documentation here:
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
The problem is that I am running the code on Azure Databricks. Databricks saves the logs elsewhere and you can display them from the web UI but cannot export it.
When I ran the Spark job with spark.eventLog.dir set to a location in DBFS, the file was created but it was empty.
Is there a way to export the full Databricks job log so that anyone can open it without giving them the access to the workspace?
The simplest way of doing it as following:
You create a separate storage account + container in it or a separate container in existing storage account & give access to it to developers
You mount that container to the Databricks workspace
You configure clusters/jobs to write logs into mount location (you can enforce it for new objects using the cluster policies). This will create sub-directories with the cluster name, containing logs of driver & executors + result of execution of init scripts
(optional) you can setup retention policy on that container to automatically remove old logs.

Spark cluster on Kubernetes without spark-submit

I have a spark application and want to deploy this on a Kubernetes cluster.
Following the below documentation I have managed to create an empty Kubernetes cluster, generated docker image using the Dockerfile provided under kubernetes/dockerfiles/spark/Dockerfile and deployed this on the cluster using spark-submit in a Dev environment.
https://spark.apache.org/docs/latest/running-on-kubernetes.html
However, in a 'proper' environment we have a managed Kubernetes cluster (bespoke unlike EKS etc.) and will have to provide pod configuration files to get deployed.
I believe you can supply Pod template file as an argument to the spark-submit command.
https://spark.apache.org/docs/latest/running-on-kubernetes.html#pod-template
How can I do this without spark-submit? And are there any example yaml files?
PS: we have limited access to this cluster, e.g. we can install Helm charts but not operator or controller.
You could try to use k8s Spark CRD https://github.com/GoogleCloudPlatform/spark-on-k8s-operator and provide a pod configuration through it.

How to download an installed dbfs jar file from databricks cluster to local machine?

I am new to Databricks and I wish to download an installed library of a databricks cluster to my local machine. Could you please help me with that?
So to elaborate I already have a running cluster on which libraries are already installed. I need to download some of those libraries (which are dbfs jar files) to my local machine. I actually have been trying to use the '''dbfs cp''' command through the databricks-cli but that is not working. It is not giving any error but it's not doing anything either. I hope that clears things a bit.
Note: When you installed libraries via Jars, Maven, PyPI, those are located in the folderpath dbfs:/FileStore.
For Interactive cluster Jars located at - dbfs:/FileStore/jars
For Automated cluster Jars located at - dbfs:/FileStore/job-jars
There are couple of ways to download an installed dbfs jar file from databricks cluster to local machine.
GUI Method: You can use DBFS Explorer
DBFS Explorer was created as a quick way to upload and download files to the Databricks filesystem (DBFS). This will work with both AWS and Azure instances of Databricks.
You will need to create a bearer token in the web interface in order to connect.
Step1: Download DBFS explorer from Here:https://datathirst.net/projects/dbfs-explorer and install.
Step2: How to create a bearer token?
Click the user profile icon User Profile in the upper right corner of
your Databricks workspace.
Click User Settings.
Go to the Access Tokens tab.
Click the Generate New Token button.
Note: Copy the generated token and store in a secure location.
Step3: Open DBFS explorer for Databricks and Enter Host URL and Bearer Token and continue.
Step4: Navigate to the DBFS folder named FileStore => jars => Select the jar which you want to download and click download and select the folder on the local machine.
CLI Method: You can use Databricks CLI
Step1: Install the Databricks CLI, configure it with your Databricks credentials.
Step2: Use the CLI "dbfs cp" command used to Copy files to and from DBFS.
Syntax: dbfs cp <SOURCE> <DESTINATION>
Example: dbfs cp "dbfs:/FileStore/azure.txt" "C:\Users\Name\Downloads\"

How to ship Airflow logs to Azure Blob Store

I'm having trouble following this guide section 3.6.5.3 "Writing Logs to Azure Blob Storage"
The documentation states you need an active hook to Azure Blob storage. I'm not sure how to create this. Some sources say you need to create the hook in the UI, and some say you can use an environment variable. Either way, none of my logs are getting written to blob store and I'm at my wits end.
Azure Blob Store hook(or any hook for that matter) tells overflow how to write to into Azure Blob Store. This is already included in recent versions of airflow, wasb_hook.
You will need to make sure that the hook is able to write to Azure Blob Store. Just mention the REMOTE_BASE_LOG_FOLDER bucket should be named like wasb-xxx. Once you take care of these two things instructions works without a hitch,
I achieved writing logs to blob using below steps
Create folder named config inside airflow folder
Create empty __init__.py and log_config.py files inside config folder
Search airflow_local_settings.py in your machine
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.pyc
run
cp /home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py config/log_config.py
Edit airflow.cfg [core] section
remote_logging = True
remote_log_conn_id = log_sync
remote_base_log_folder=wasb://airflow-logs#storage-account.blob.core.windows.net/logs/
logging_config_class =log_config.DEFAULT_LOGGING_CONFIG
Add log_sync connection object as below
install airflow azure dependency
pip install apache-airflow[azure]
Restart webserver and scheduler

How to run oozie jobs in HDInsight cluster?

I have an oozie workflow that I'd like to run on an HDInsight cluster. My job has a jar file as well as a workflow.xml file that I store on the Azure blob storage. However the only way I found to store the job.config file is on the local storage of the HDInsight headnode. However my concern is what happens when the VM gets re-imaged? does it remove my job.config file?
In general, you can use Script Actions on HDInsight. Script actions perform customization on the HDInsight clusters during provisioning. So every time the cluster is created, the scripts will be run. (You were smart to be concerned about what happens when the cluster is re-created!)
In these advanced configuration options, it shows HDInsight cluster customization during the provision process using PowerShell. There is an oozie section:
# oozie-site.xml configuration
$OozieConfigValues = new-object 'Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightOozieConfiguration'
$OozieConfigValues.Configuration = #{ "oozie.service.coord.normal.default.timeout"="150" } # default 120
Does that help?
Other resources:
Customizing HDInsight Cluster provisioning
Oozie tutorial on HDInsight

Resources