When I submit hdinsight Spark job using IntelliJ IDEA Community
Error :
Failed to submit application to spark cluster.
Exception : Forbidden. Attached Azure DataLake Store is not supported in Automated login model.
Please logout first and try Interactive login model
The expcetion are shown when selecting Automated option in Azure Sign In dialog and sumitting a Spark Job into a Cluster which store is Azure DataLake Store. So, use the Interactive option for the cluster, please.
The Automated login model is only used for the Azure Blob Store Cluster.
You could try the following steps:
Sign out from the Azure Explorer firstly
Sign in with the Interactive option
Select the Spark cluster with Azure DataLake Store in Spark job submission dialog and submit the job.
Refer to https://learn.microsoft.com/en-us/azure/azure-toolkit-for-intellij-sign-in-instructions for more instructions.
[Update]
If your account have no permission to access that Azure DataLake Store, the same exception will be thrown.
Refer to https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-security-overview
The compiled Spark Job will be uploaded to the ADL folder adl://<adls>.azuredatalakestore.net/<cluster attached folder>/SparkSubmission/**. So the user needs
permission to write. You'd better to ask admin to check your role access.
Related
I have a requirement to develop an application in python. The python application will interact with any database and execute sql statements against it. It can also interact with Databricks instance too and query the tables in databricks.
The requirement is that the python application should be platform independent. So the application is developed in such a way that if it runs on databricks, only then it will trigger the spark specific code with in the application. If it is run on a standalone node, it skips. The python programs interacts with Azure blob storages for accessing some files/folders. The python application is deployed on Standalone Node/Databricks as a Wheel.
The issue here is with custom logging. I have implemented custom logging in the python application. There are two scenarios here based on where the application is being run.
Standalone Node
Databricks Cluster.
If the code is run on Standalone Node, then the custom log is initially getting logged into local OS folder and after the application completes successfully/fails, it is moved to azure blob storage. But for some reason if it fails to move the log file to azure storage, it is still available in the local file system of Standalone Node.
If the same approach is followed on Databricks, if the application fails to upload the log file to blob storage, we cannot recover it as the databricks OS storage is volatile. I tried to write the log to dbfs. But It doesn't allow to append.
Is there a way to get the application logs from databricks? Is there a possibility that the databricks can record my job execution and store the logs? As I mentioned, the python application is deployed as wheel and it contains very limited spark code.
Is there a way to get the application logs from databricks? Is there a
possibility that the databricks can record my job execution and store
the logs?
I think you are able to do that now , but once the cluster is shut down ( to minimize cost ) , the logs will be gone . I am thankful to you to share that the logs in DBFS can only be appended , i was not aware about that .
Is your standalone application is open to internet , if yes than may be you can explore the option of writing the logs to Azure event hub . You can write to eventhub from ADb and standalone application and then write that to blob etc for further visualization .
This tutorial should get you started .
https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-python-get-started-send
HTH
I Want to execute some script on HDInsight Cluster before delete or after delete.
This should be get call If I delete cluster using Azure Web UI and cluster Delete API also.
Using script action I can execute script but at the time of cluster creation only but not at the time of cluster deletion.
Did you think about using hdinsight rest api https://learn.microsoft.com/en-us/rest/api/hdinsight/scriptactions/clusters/executescriptactions#examples and then automate it with ADF using web activity
Or Using Powershell https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#apply-a-script-action-to-a-running-cluster-from-azure-powershell and automate it using ADF Custom activity https://learn.microsoft.com/en-us/answers/questions/106584/executing-powershell-script-throuh-azure-data-fact.html
And then you can delete the cluster using rest-api-https://learn.microsoft.com/en-us/rest/api/hdinsight/clusters/delete
Or Powershell-https://learn.microsoft.com/en-us/powershell/module/azurerm.hdinsight/remove-azurermhdinsightcluster?view=azurermps-6.13.0
Note: You should have the cluster name handy(The above steps might get complex for the on-Demand cluster)
I have created a small application that submits a spark job at certain intervals and creates some analytical reports. These jobs can read data from a local filesystem or a distributed filesystem (fs could be HDFS, ADLS or WASB). Can I run this application on Azure databricks cluster?
The application works fine on HDInsights cluster as I was able to access the nodes. I kept my deployable jar at one location, started it using the start-script similarly I could also stop it using the stop-script that I prepared.
One thing I found is that Azure Databricks has its own File System: ADFS, I can also add support for this file system but then will I be able to deploy and run my application as I was able to do it on the HDInsight cluster? If not, is there a way I can submit jobs from an edge node, my HDInsight cluster or any other OnPrem Cluster to Azure Databricks cluster.
Have you looked at Jobs? https://docs.databricks.com/user-guide/jobs.html. You can submit jars to spark-submit just like on HDInsight.
Databricks file system is DBFS - ABFS is used for Azure Data Lake. You should not need to modify your application for these - the file paths will be handled by databricks.
I want to make an automated Spark job submit system/program.
Of course, the system need to provisioning HDInsight first before submit Spark jobs.
Also, the system submit spark job on schedule base(e.g 7PM submit job1, 9PM submit job2)
What is the best way to acheive those?
c.f) What I can do
provisioning HDIsinght with Powershell
Submit Spark job with Livy
It sounds like Azure Data Factory would fit your needs. From their website:
"Data Factory allows you to create data-driven workflows to move data between both on-premises and cloud data stores as well as process/transform data using compute services such as Azure HDInsight and Azure Data Lake Analytics. After you create a pipeline that performs the action that you need, you can schedule it to run periodically (hourly, daily, weekly etc.)."
Resources:
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-faq
It sounds like you want to run your spark jobs automatically on schedule. So I think using Oozie is very suitable for your current scenario, please refer to Azure offical tutorial for Windows or Linux to know the concept about Oozie. Meanwhile, the tutorial Use time-based Oozie coordinator with Hadoop in HDInsight to define workflows and coordinate jobs introduces how to do it via time trigger. As reference, a hortonworks thread shows the steps in details for running Spark job from Oozie Workflow on HDP (Azure HDInsight is based on HDP).
Hope it helps.
You can use .Net SDK or Powershell to automate the provisioning of the HDInsight instance.
I would use Livy to submit Spark jobs as explained here
I am new to Microsoft Azure. I created a trial account on Azure. Installed the azure powershell and submitted the default wordcount map reduce program and it works fine and am able to see the results in the powershell. Now when I open the query console of my cluster in the HDInsight tab, the job history is empty. What am I missing here? Where can I view the job results in the Azure?
The Query Console does not display M/R jobs, only hive jobs. You can see the history of all jobs by using the PowerShell cmdlet for get-azurehdinsightjob while will return all of these.
The query console is designed to submit/monitor Hive jobs, so I think it only shows job history for Hive jobs. You can see the job result using HDInsight Powershell SDK though. The