automate HDInsight Spark provisioning and submit jobs on Schedule? - azure

I want to make an automated Spark job submit system/program.
Of course, the system need to provisioning HDInsight first before submit Spark jobs.
Also, the system submit spark job on schedule base(e.g 7PM submit job1, 9PM submit job2)
What is the best way to acheive those?
c.f) What I can do
provisioning HDIsinght with Powershell
Submit Spark job with Livy

It sounds like Azure Data Factory would fit your needs. From their website:
"Data Factory allows you to create data-driven workflows to move data between both on-premises and cloud data stores as well as process/transform data using compute services such as Azure HDInsight and Azure Data Lake Analytics. After you create a pipeline that performs the action that you need, you can schedule it to run periodically (hourly, daily, weekly etc.)."
Resources:
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-faq

It sounds like you want to run your spark jobs automatically on schedule. So I think using Oozie is very suitable for your current scenario, please refer to Azure offical tutorial for Windows or Linux to know the concept about Oozie. Meanwhile, the tutorial Use time-based Oozie coordinator with Hadoop in HDInsight to define workflows and coordinate jobs introduces how to do it via time trigger. As reference, a hortonworks thread shows the steps in details for running Spark job from Oozie Workflow on HDP (Azure HDInsight is based on HDP).
Hope it helps.

You can use .Net SDK or Powershell to automate the provisioning of the HDInsight instance.
I would use Livy to submit Spark jobs as explained here

Related

How I can call the any script on Microsoft HDInsight Cluster delete event

I Want to execute some script on HDInsight Cluster before delete or after delete.
This should be get call If I delete cluster using Azure Web UI and cluster Delete API also.
Using script action I can execute script but at the time of cluster creation only but not at the time of cluster deletion.
Did you think about using hdinsight rest api https://learn.microsoft.com/en-us/rest/api/hdinsight/scriptactions/clusters/executescriptactions#examples and then automate it with ADF using web activity
Or Using Powershell https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#apply-a-script-action-to-a-running-cluster-from-azure-powershell and automate it using ADF Custom activity https://learn.microsoft.com/en-us/answers/questions/106584/executing-powershell-script-throuh-azure-data-fact.html
And then you can delete the cluster using rest-api-https://learn.microsoft.com/en-us/rest/api/hdinsight/clusters/delete
Or Powershell-https://learn.microsoft.com/en-us/powershell/module/azurerm.hdinsight/remove-azurermhdinsightcluster?view=azurermps-6.13.0
Note: You should have the cluster name handy(The above steps might get complex for the on-Demand cluster)

Azure Data Factory, How get output from scala (jar job)?

We have a Azure Data Factory pipeline and one step is a jar job that should return output used in the next steps.
It is possible to get output from notebook with dbutils.notebook.exit(....)
I need the similar feature to retrieve output from main class of jar.
Thanks!
Image of my pipeline
Actually,there is no built-in feature to execute jar job directly as i know.However, you could implement it easily with Azure Databricks Service.
Two ways in Azure Databricks workspace:
If your jar is executable jar,then just use Set JAR which could set main class and parameters:
Conversely,you could try to use Notebook to execute dbutils.notebook.exit(....) or something else.
Back to ADF, ADF has Databricks Activity and you can get output of it for next steps.Any concern,please let me know.
Updates:
There is no similar feature to dbutils.notebook.exit(....) in Jar activity as i know.So far i just provide a workaround here: storing the parameters into specific file which resides in the (for example)blob storage inside the jar execution.Then use LookUp activity after jar activity to get the params for next steps.
Updates at 1.21.2020
Got some updates from MSFT in the github link: https://github.com/MicrosoftDocs/azure-docs/issues/46347
Sending output is a feature that only notebooks support for notebook
workflows and not jar or python executions in databricks. This should
be a feature ask for databricks and only then ADF can support it.
I would recommend you to submit this as a product feedback on Azure
Databricks feedback forum.
It seems that output from jar execution is not supported by azure databricks,ADF only supports features of azure databricks naturally. Fine...,you could push the related progress by contacting with azure databricks team. I just shared all my knowledges here.

How to submit custom spark application on Azure Databricks?

I have created a small application that submits a spark job at certain intervals and creates some analytical reports. These jobs can read data from a local filesystem or a distributed filesystem (fs could be HDFS, ADLS or WASB). Can I run this application on Azure databricks cluster?
The application works fine on HDInsights cluster as I was able to access the nodes. I kept my deployable jar at one location, started it using the start-script similarly I could also stop it using the stop-script that I prepared.
One thing I found is that Azure Databricks has its own File System: ADFS, I can also add support for this file system but then will I be able to deploy and run my application as I was able to do it on the HDInsight cluster? If not, is there a way I can submit jobs from an edge node, my HDInsight cluster or any other OnPrem Cluster to Azure Databricks cluster.
Have you looked at Jobs? https://docs.databricks.com/user-guide/jobs.html. You can submit jars to spark-submit just like on HDInsight.
Databricks file system is DBFS - ABFS is used for Azure Data Lake. You should not need to modify your application for these - the file paths will be handled by databricks.

CDAP with Azure Data bricks

Has anyone tried using Azure data bricks as the spark cluster for CDAP job processing. CDAP documentation details how to add it to Azure HDInsight, but just wondering is there a way to configure CDAP to point to data bricks spark cluster, is it even possible? OR this kind of integration needs a specific data bricks client connector jar? If anyone has any insights that would be helpful.
There is no out of box support for Databricks spark on Azure. But, that said you can develop a new Cloud Runtime that is capable of submitting the jobs to Databricks spark cluster. Here is example of how to write a runtime extension for Cloud Dataproc and EMR.

HDInsight query console Job History

I am new to Microsoft Azure. I created a trial account on Azure. Installed the azure powershell and submitted the default wordcount map reduce program and it works fine and am able to see the results in the powershell. Now when I open the query console of my cluster in the HDInsight tab, the job history is empty. What am I missing here? Where can I view the job results in the Azure?
The Query Console does not display M/R jobs, only hive jobs. You can see the history of all jobs by using the PowerShell cmdlet for get-azurehdinsightjob while will return all of these.
The query console is designed to submit/monitor Hive jobs, so I think it only shows job history for Hive jobs. You can see the job result using HDInsight Powershell SDK though. The

Resources