I need to know , how to stop a azure databricks cluster by doing configuration when it is running infinitely for executing a job.(without manual stopping)and as well as create an email alert for it, as the job running time exceeds its usual running time.
You can do this in the Jobs UI, Select your job, under Advanced, edit the Alerts and Timeout values.
This Databricks docs page may help you: https://docs.databricks.com/jobs.html
Related
Anyone has a way to monitor a group of job clusters in Azure Databricks?
We just want to make sure the job cluster are up and running, maybe have a Dashboard or Workbook in Azure that can be red or green depending on the status of the job cluster.
We have this NRT interfaces pulling data from a source application via these job cluster and would like to see when they are down. We already get an alert when the service goes down but having a panel where we can see these interfaces would be really useful. Prhaps something that will make use of an API call would be needed unless there is something out of the box like those Ganglia reports bur haven't seen anything close to what I'm looking for.
Thanks in advance for any answer you may provide.
I want to automatically start a job on an Azure Batch AI cluster once a week. The jobs are all identical except for the starting time. I thought of writing a PowerShell Azure Function that does this, but Azure Functions v2 doesn't support PowerShell and I don't want to use v1 in case it will be phased out. I would prefer not to do this in C# or Java. How can I do this?
Currently, there's no option available to trigger a job on Azure Batch AI cluster. Maybe you want to run a shell script which in turn can create a regular schedule using system's task scheduler. Please see if this doc by Said Bleik helps:
https://github.com/saidbleik/batchai_mm_ad#scheduling-jobs
I assume this way you can add multiple schedules for the job!
Azure Batch portal has "Job schedules" tab. You can go there, add a Job, and set a schedule for the Job. You can specify the recurrence in the Schedule
Scheduled jobs
Job schedules enable you to create recurring jobs within the Batch service. A job schedule specifies when to run jobs and includes the specifications for the jobs to be run. You can specify the duration of the schedule--how long and when the schedule is in effect--and how frequently jobs are created during the scheduled period.
I have a pyspark job that I submit to a standalone spark cluster - this is an auto scaling cluster on ec2 boxes so when jobs are submitted and not enough nodes are available, after a few minutes a few more boxes spin up and become available.
We have a #timeout decorator on the main part of the spark job to timeout and error when it's exceeded a certain time threshold (put in place because of some jobs hanging). The issue is that sometimes a job may not have gotten to actually starting because its waiting on resources yet #timeout function is evaluated and jobs error out as a result.
So I'm wondering if there's anyway to tell from within the application itself, with code, if the job is waiting for resources?
To know the status of the application then you need to access the Spark Job History server from where you can get the current status of the job.
You can solve your problem as follows:
Get Application ID of your job through sc.applicationId.
Then use this application Id with Spark History Server REST APIs to get the status of the submitted job.
You can find the Spark History Server Rest APIs at link.
I am running spark job on ec2 cluster, I have a trigger that submits job periodically. I do not want to submit job if one job is already running on cluster. Is there any api that can give me this information?
Spark, and by extension, Spark Streaming offer an operational REST API at http://<host>:4040/api/v1
Consulting the status of the current application will give you the information sought.
Check the documentation: https://spark.apache.org/docs/2.1.0/monitoring.html#rest-api
you can consult the UI to see the status
eg.
If you run locally, take a look at the localhost:4040
For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure