Which of my Databricks notebook uses the cluster nodes? - apache-spark

I run several notebooks on Azure Databricks Spark cluster at the same time.
How can I see the cluster nodes usage rate of each notebook \ app over a period of time?
Both the "Spark Cluster UI - Master" and "Spark UI" tabs didn't provide such information

There is no automated/built in support today for isolating the usage of particular notebooks on Databricks.
That being said, one approach would be to use the Ganglia Metrics available for Databricks clusters.
If you run both notebooks at the same time it will be difficult to discern which is responsible for a specific quantity of usage. I would recommend running one notebook to completion and taking note of the usage on the cluster. Then, run the second notebook to completion and observe its usage. You can then compare the two and have a baseline for how each one is utilizing resources on the cluster.

Related

In which scenario should one prefer to create Spark cluster on EC2 machines instead of using Elastic Map Reduce?

Between processing realtime data using Spark cluster on EC2 machines and using Elastic map reduce, some of the differences are:
In Elastic Map Reduce, one would not have to manage the infrastructure and cluster as compared to Spark cluster on EC2 machines where one has to create the cluster and manage it.
In case of Spark cluster on EC2, one has more control over the cluster as compared to Elastic Map Reduce which is a PAAS component.
I went through the below related link:
Hadoop on EC2 vs Elastic Map Reduce
I understand that going with Elastic Map reduce would give the advantage of not having to manage the infrastructure and cluster. What I want to know is that when should one prefer the other option, that is to create Spark cluster on EC2 machines instead of using Elastic Map Reduce? Thanks.
You and the answer you shared have have summed pretty much the advantages and disadvantages for both. But i would like to mention few things
Someone mentioned in comment on the answer you share (and there is infact impression in people) that EMR adds some cost on top of ec2 nodes (which is underlying master/compute nodes of spark) and provides just the cluster, which isnt the case.
But what elastic map reduce is focused on is elastic and scalability part , meaning to provide scalability for your jobs, where scalability is not just number of node in cluster but different parameters like
Dynamically resizing the cluster with running jobs
Reduces and optimizes spin time , provides efficient resubmitting steps and option like automatic termination on step completion
Configuration, management and updation time. Just as an small you have things like release version that automatically handles spark/hadoop/other-application versions providing you way to easy update the version which you have to do manually with ec2.
the ecosystem availability. EMR ecosystem is growing,it doesnt reflect when you start but for example when your requirements grow, for example when you start to integrate other systems stream processing with flink for example) then it is more easier to just select at time of launching flink, pig , hive and moany more etc if you need to use other things in future.
There are already implementing libraries with AWS SDK like boto3 in python that help you to submit steps, poll for completion etc, which are very helpful when you need to scale. Also, you have integration of emr with orchestration frameworks like airflow where can can sense the state, resubmit, one command spin the cluster within the pipeline.
Expanding on previous point, EMR notebook for example provide you the quick and interactive way to submit spark jobs from Jupiter notebook and see the result, progress of jobs immediately which can boost your productivity.
This point is most important from my experience, Sometimes, scaling up the jobs with more nodes save you more money then long running jobs with low number of nodes. Because the adding node cost sometime cost you low than the normalized hours you will be spending with ec2 or small emr cluster. Just to share my experience, we had a job that used to run for 3 days, we satrted to run it with bigger EMR cluster that reduced it to 6-8 hours and it still was in the same cost and was infact a bit less.

Scale up the spark worker nodes using code

I want to scale up the spark cluster to make all the worker nodes up and running before I start my processing. The issue is because the autoscaling of worker nodes is not happening immediately on load and is leading to worker node crashes. The cluster has 32 nodes but is overloading only 4 nodes and crashing so what I am trying to do is write some lines of code in the start of the python notebook which will kick start the remaining nodes and have 24 nodes up and running and then do the actual data processing. Is this possible using code ? Please advise.
In general, autoscale is for interactive workloads. I've rarely seen it provide benefits in jobs, though marketing makes a good job of selling it as a cost saving feature.
You can use Databricks jobs to create an automated cluster. When you run a job on a new automated cluster and terminates the cluster when the job is complete.
If you know when scaling up should happen better than auto scale then you can use this resize API: https://docs.databricks.com/dev-tools/api/latest/clusters.html#resize

Is it possible to run Hive on Spark with YARN capacity scheduler?

I use Apache Hive 2.1.1-cdh6.2.1 (Cloudera distribution) with MR as execution engine and YARN's Resource Manager using Capacity scheduler.
I'd like to try Spark as an execution engine for Hive. While going through the docs, I found a strange limitation:
Instead of the capacity scheduler, the fair scheduler is required. This fairly distributes an equal share of resources for jobs in the YARN cluster.
Having all the queues set up properly, that's very undesirable for me.
Is it possible to run Hive on Spark with YARN capacity scheduler? If not, why?
I'm not sure you can execute Hive using spark Engines. I highly recommend you configure Hive to use Tez https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez which is faster than MR and it's pretty similar to Spark due to it uses DAG as the task execution engine.
We are running it at work using the command on Beeline as described https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started just writing it at the beginning of the sql file to run
set hive.execution.engine=spark;
select ... from table....
We are not using capacity scheduler because there are hundreds of jobs run per yarn queue, and when jobs are resource avid, we have other queues to let them run. That also allows designing a configuration based on job consumption per queue more realistic based on the actual need of the group of jobs
Hope this helps

Spark local mode: How to query the number of executor slots?

I'm following tutorial Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data where it's claimed that the "local mode" Spark cluster available in Databricks "Community Edition" provides you with 3 executor slots. (So 3 Tasks should be able to run concurrently.)
However, when I look at the "Event Timeline" visualization for job stages with multiple tasks in my own notebook on Databricks "Community Edition", it looks like up to 8 tasks were running concurrently:
Is there a way to query the number of executor slots from PySpark or from a Databricks notebook? Or can I directly see the number in the Spark UI somewhere?
Databricks "slots" = Spark "cores" = available threads
"Slots" is a term Databricks uses (or used?) for the threads available to do parallel work for Spark. The Spark documentation and Spark UI calls the same concept "cores", even though they are unrelated to physical CPU cores.
(See this answer on Hortonworks community, and this "Spark Tutorial: Learning Apache Spark" databricks notebook.)
View number of slots/cores/threads in Spark UI (on Databricks)
To see how many there are in your Databricks cluster, click "Clusters" in the navigation area to the left, then hover over the entry for your cluster and click the "Spark UI" link. In the Spark UI, click the "Executors" tab.
You can see the number of executor cores (=executor slots) in both the summary and for each individual executor1 in the "cores" column of the respective table there:
1There's only one executor in "local mode" clusters, which are the cluster available in Databricks community edition.
Query number of slots/cores/threads
How to query this number from within a notebook, I'm not sure.
spark.conf.get('spark.executor.cores')
results in java.util.NoSuchElementException: spark.executor.cores

Pausing Dataproc cluster - Google Compute engine

is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link: https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/clusters/
only show how to destroy a cluster but I have installed spark cassandra connector API for example. Is my only alternative to just creating an image that I'll need to install every time ?
In general, the best thing to do is to distill out the steps you used to customize your cluster into some setup scripts, and then use Dataproc's initialization actions to easily automate doing the installation during cluster deployment.
This way, you can easily reproduce the customizations without requiring manual involvement if you ever want, for example, to do the same setup on multiple concurrent Dataproc clusters, or want to change machine types, or receive sub-minor-version bug fixes that Dataproc releases occasionally.
There's indeed no officially supported way of pausing a Dataproc cluster at the moment, in large part simply because being able to have reproducible cluster deployments along with several other considerations listed below means that 99% of the time it's better to use initialization-action customizations instead of pausing a cluster in-place. That said, there are possible short-term hacks, such as going into the Google Compute Engine page, selecting the instances that are part of the Dataproc cluster you want to pause, and clicking "stop" without deleting them.
The Compute Engine hourly charges and Dataproc's per-vCPU charges are only incurred when the underlying instance is running, so while you've "stopped" the instances manually, you won't incur Dataproc or Compute Engine's instance-hour charges despite Dataproc still listing the cluster as "RUNNING", albeit with warnings that you'll see if you go to the "VM Instances" tab of the Dataproc cluster summary page.
You should then be able to just click "start" from the Google Compute Engine page page to have the cluster running again, but it's important to consider the following caveats:
The cluster may occasionally fail to start up into a healthy state again; anything using local SSDs already can't be stopped and started again cleanly, but beyond that, Hadoop daemons may have failed for whatever reason to flush something important to disk if the shutdown wasn't orderly, or even user-installed settings may have broken the startup process in unknown ways.
Even when VMs are "stopped", they depend on the underlying Persistent Disks remaining, so you'll continue to incur charges for those even while "paused"; if we assume $0.04 per GB-month, and a default 500GB disk per Dataproc node, that comes out to continuing to pay ~$0.028/hour per instance; generally your data will be more accessible and also cheaper to just put in Google Cloud Storage for long term storage rather than trying to keep it long-term on the Dataproc cluster's HDFS.
If you come to depend on a manual cluster setup too much, then it'll become much more difficult to re-do if you need to size up your cluster, or change machine types, or change zones, etc. In contrast, with Dataproc's initialization actions, you can use Dataproc's cluster scaling feature to resize your cluster and automatically run the initialization actions for new workers created.
Update
Dataproc recently launched the ability to stop and start clusters: https://cloud.google.com/dataproc/docs/guides/dataproc-start-stop

Resources