Databricks Notebook Schedule - azure

I have scheduled an ADB notebook to run on a schedule. Will the notebook run if the cluster is down? Right now the cluster is busy so unable to stop and try it out. Will the notebook start the cluster and run or would wait for the cluster to be up?

If you're scheduling the notebook to run on the existing cluster, then cluster will be started if it's stopped. But in reality, it's better to execute the notebook on the new cluster - there will be less chance of breaking things if you change library version or something like. If you need to speedup the job execution you may look onto instance pools.

Related

Azure Databricks cell execution stuck on waiting to run state

I am using Azure Databricks for connecting to SAP system and ADLS. For SAP connection I am installing the latest version of JDBC library(ngdbc-2.10.14.jar). After installing the library, the notebook cells have stopped executing. When I try to run the cell, it gets stuck in a waiting to run state.
You cannot perform any future commands in a notebook tied to a Databricks Runtime cluster after cancelling a running streaming cell. The commands are stuck in a "waiting to execute" state, and you'll have to clear the notebook's state or detach and reconnect the cluster before you can run commands on it.
This problem only happens when you cancel a single cell; it does not occur when you run all cells and cancel all of them.
To fix an impacted notebook without having to restart the cluster, go to the Clear menu and choose Clear State:

Using databricks for twtter sentiment analysis - issue running the official tutorial

I am starting to use Databricks and tried to implement one of the official tutorials (https://learn.microsoft.com/en-gb/azure/azure-databricks/databricks-sentiment-analysis-cognitive-services) from the website. However, I run into an issue - not even sure if I can call it an issue - when I run the second notebook (analysetweetsfromeventhub) then all commands (2nd, 3rd, 4th ...) are officially waiting to run, but never run. See the picture. Any idea what might be? Thanks.
After you cancel a running streaming cell in a notebook attached to a Databricks Runtime cluster, you cannot run any subsequent commands in the notebook. The commands are left in the “waiting to run” state, and you must clear the notebook’s state or detach and reattach the cluster before you can successfully run commands on the notebook.
Note that this issue occurs only when you cancel a single cell; it does not apply when you run all and cancel all cells.
In the meantime, you can do either of the following:
To remediate an affected notebook without restarting the cluster, go to the notebook’s Clear menu and select Clear State:
If restarting the cluster is acceptable, you can solve the issue by turning off idle context tracking. Set the following Spark configuration value on the cluster:
spark.databricks.chauffeur.enableIdleContextTracking false
Then restart the cluster.

What's the most elegant/right way to stop a spark job running on a Kubernetes cluster?

I'm new to apache spark and I'm trying to run a spark job using spark-submit on my Kubernetes cluster. I was wondering if there's a right way to stop spark jobs once the driver and executor pods are spawned? Would deleting the pods themselves be enough?
Thanks!
When you will delete executor it will be recreated again and spark application will work. However if you will delete driver pod it will stop application.
So killing driver pod is actually the way to stop the Spark
Application during the execution.
As you are new to Spark and you want to run it on Kubernetes, you should check this tutorial.
At present the only way to stop Spark job running on Kuberentes is to delete the Driver Pod (unless you have an app controlling Spark context which is able to manipulate it). Since all other job-related resources are linked to Spark Driver Pod with such as called ownerReferences, they will be removed automatically by Kubernetes.
It should clean things up when the job completes automatically.

Kill Spark Job or terminate EMR Cluster if job takes longer than expected

I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.
Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.
What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.
Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.
EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.
I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh
Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.

Google Dataproc Jobs Never Cancel, Stop, or Terminate

I have been using Google Dataproc for a few weeks now and since I started I had a problem with canceling and stopping jobs.
It seems like there must be some server other than those created on cluster setup, that keeps track of and supervises jobs.
I have never had a process that does its job without error actually stop when I hit stop in the dev console. The spinner just keeps spinning and spinning.
Cluster restart or stop does nothing, even if stopped for hours.
Only when the cluster is entirely deleted will the jobs disappear... (But wait there's more!) If you create a new cluster with the same settings, before the previous cluster's jobs have been deleted, the old jobs will start on the new cluster!!!
I have seen jobs that terminate on their own due to OOM errors restart themselves after cluster restart! (with no coding for this sort of fault tolerance on my side)
How can I forcefully stop Dataproc jobs? (gcloud beta dataproc jobs kill does not work)
Does anyone know what is going on with these seemingly related issues?
Is there a special way to shutdown a Spark job to avoid these issues?
Jobs keep running
In some cases, errors have not been successfully reported to the Cloud Dataproc service. Thus, if a job fails, it appears to run forever even though it (has probably) failed on the back end. This should be fixed by a soon-to-be released version of Dataproc in the next 1-2 weeks.
Job starts after restart
This would be unintended and undesirable. We have tried to replicate this issue and cannot. If anyone can replicate this reliably, we'd like to know so we can fix it! This may (is provably) be related to the issue above where the job has failed but appears to be running, even after a cluster restarts.
Best way to shutdown
Ideally, the best way to shutdown a Cloud Dataproc cluster is to terminate the cluster and start a new one. If that will be problematic, you can try a bulk restart of the Compute Engine VMs; it will be much easier to create a new cluster, however.

Resources