HDInsight query console Job History

HDInsight query console Job History - azure

I am new to Microsoft Azure. I created a trial account on Azure. Installed the azure powershell and submitted the default wordcount map reduce program and it works fine and am able to see the results in the powershell. Now when I open the query console of my cluster in the HDInsight tab, the job history is empty. What am I missing here? Where can I view the job results in the Azure?

The Query Console does not display M/R jobs, only hive jobs. You can see the history of all jobs by using the PowerShell cmdlet for get-azurehdinsightjob while will return all of these.

The query console is designed to submit/monitor Hive jobs, so I think it only shows job history for Hive jobs. You can see the job result using HDInsight Powershell SDK though. The

Related

How to log custom Python application logs in Databricks and move it to Azure

I have a requirement to develop an application in python. The python application will interact with any database and execute sql statements against it. It can also interact with Databricks instance too and query the tables in databricks.
The requirement is that the python application should be platform independent. So the application is developed in such a way that if it runs on databricks, only then it will trigger the spark specific code with in the application. If it is run on a standalone node, it skips. The python programs interacts with Azure blob storages for accessing some files/folders. The python application is deployed on Standalone Node/Databricks as a Wheel.
The issue here is with custom logging. I have implemented custom logging in the python application. There are two scenarios here based on where the application is being run.
Standalone Node
Databricks Cluster.
If the code is run on Standalone Node, then the custom log is initially getting logged into local OS folder and after the application completes successfully/fails, it is moved to azure blob storage. But for some reason if it fails to move the log file to azure storage, it is still available in the local file system of Standalone Node.
If the same approach is followed on Databricks, if the application fails to upload the log file to blob storage, we cannot recover it as the databricks OS storage is volatile. I tried to write the log to dbfs. But It doesn't allow to append.
Is there a way to get the application logs from databricks? Is there a possibility that the databricks can record my job execution and store the logs? As I mentioned, the python application is deployed as wheel and it contains very limited spark code.

Is there a way to get the application logs from databricks? Is there a
possibility that the databricks can record my job execution and store
the logs?
I think you are able to do that now , but once the cluster is shut down ( to minimize cost ) , the logs will be gone . I am thankful to you to share that the logs in DBFS can only be appended , i was not aware about that .
Is your standalone application is open to internet , if yes than may be you can explore the option of writing the logs to Azure event hub . You can write to eventhub from ADb and standalone application and then write that to blob etc for further visualization .
This tutorial should get you started .
https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-python-get-started-send
HTH

How I can call the any script on Microsoft HDInsight Cluster delete event

I Want to execute some script on HDInsight Cluster before delete or after delete.
This should be get call If I delete cluster using Azure Web UI and cluster Delete API also.
Using script action I can execute script but at the time of cluster creation only but not at the time of cluster deletion.

Did you think about using hdinsight rest api https://learn.microsoft.com/en-us/rest/api/hdinsight/scriptactions/clusters/executescriptactions#examples and then automate it with ADF using web activity
Or Using Powershell https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#apply-a-script-action-to-a-running-cluster-from-azure-powershell and automate it using ADF Custom activity https://learn.microsoft.com/en-us/answers/questions/106584/executing-powershell-script-throuh-azure-data-fact.html
And then you can delete the cluster using rest-api-https://learn.microsoft.com/en-us/rest/api/hdinsight/clusters/delete
Or Powershell-https://learn.microsoft.com/en-us/powershell/module/azurerm.hdinsight/remove-azurermhdinsightcluster?view=azurermps-6.13.0
Note: You should have the cluster name handy(The above steps might get complex for the on-Demand cluster)

Azure Data Factory, How get output from scala (jar job)?

We have a Azure Data Factory pipeline and one step is a jar job that should return output used in the next steps.
It is possible to get output from notebook with dbutils.notebook.exit(....)
I need the similar feature to retrieve output from main class of jar.
Thanks!
Image of my pipeline

Actually,there is no built-in feature to execute jar job directly as i know.However, you could implement it easily with Azure Databricks Service.
Two ways in Azure Databricks workspace:
If your jar is executable jar,then just use Set JAR which could set main class and parameters:
Conversely,you could try to use Notebook to execute dbutils.notebook.exit(....) or something else.
Back to ADF, ADF has Databricks Activity and you can get output of it for next steps.Any concern,please let me know.
Updates:
There is no similar feature to dbutils.notebook.exit(....) in Jar activity as i know.So far i just provide a workaround here: storing the parameters into specific file which resides in the (for example)blob storage inside the jar execution.Then use LookUp activity after jar activity to get the params for next steps.
Updates at 1.21.2020
Got some updates from MSFT in the github link: https://github.com/MicrosoftDocs/azure-docs/issues/46347
Sending output is a feature that only notebooks support for notebook
workflows and not jar or python executions in databricks. This should
be a feature ask for databricks and only then ADF can support it.
I would recommend you to submit this as a product feedback on Azure
Databricks feedback forum.
It seems that output from jar execution is not supported by azure databricks,ADF only supports features of azure databricks naturally. Fine...,you could push the related progress by contacting with azure databricks team. I just shared all my knowledges here.

submit hdinsight Spark job using IntelliJ IDEA failure

When I submit hdinsight Spark job using IntelliJ IDEA Community
Error :
Failed to submit application to spark cluster.
Exception : Forbidden. Attached Azure DataLake Store is not supported in Automated login model.
Please logout first and try Interactive login model

The expcetion are shown when selecting Automated option in Azure Sign In dialog and sumitting a Spark Job into a Cluster which store is Azure DataLake Store. So, use the Interactive option for the cluster, please.
The Automated login model is only used for the Azure Blob Store Cluster.
You could try the following steps:
Sign out from the Azure Explorer firstly
Sign in with the Interactive option
Select the Spark cluster with Azure DataLake Store in Spark job submission dialog and submit the job.
Refer to https://learn.microsoft.com/en-us/azure/azure-toolkit-for-intellij-sign-in-instructions for more instructions.
[Update]
If your account have no permission to access that Azure DataLake Store, the same exception will be thrown.
Refer to https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-security-overview
The compiled Spark Job will be uploaded to the ADL folder adl://<adls>.azuredatalakestore.net/<cluster attached folder>/SparkSubmission/**. So the user needs
permission to write. You'd better to ask admin to check your role access.

automate HDInsight Spark provisioning and submit jobs on Schedule?

I want to make an automated Spark job submit system/program.
Of course, the system need to provisioning HDInsight first before submit Spark jobs.
Also, the system submit spark job on schedule base(e.g 7PM submit job1, 9PM submit job2)
What is the best way to acheive those?
c.f) What I can do
provisioning HDIsinght with Powershell
Submit Spark job with Livy

It sounds like Azure Data Factory would fit your needs. From their website:
"Data Factory allows you to create data-driven workflows to move data between both on-premises and cloud data stores as well as process/transform data using compute services such as Azure HDInsight and Azure Data Lake Analytics. After you create a pipeline that performs the action that you need, you can schedule it to run periodically (hourly, daily, weekly etc.)."
Resources:
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-faq

It sounds like you want to run your spark jobs automatically on schedule. So I think using Oozie is very suitable for your current scenario, please refer to Azure offical tutorial for Windows or Linux to know the concept about Oozie. Meanwhile, the tutorial Use time-based Oozie coordinator with Hadoop in HDInsight to define workflows and coordinate jobs introduces how to do it via time trigger. As reference, a hortonworks thread shows the steps in details for running Spark job from Oozie Workflow on HDP (Azure HDInsight is based on HDP).
Hope it helps.

You can use .Net SDK or Powershell to automate the provisioning of the HDInsight instance.
I would use Livy to submit Spark jobs as explained here

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string