I'm following tutorial Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data where it's claimed that the "local mode" Spark cluster available in Databricks "Community Edition" provides you with 3 executor slots. (So 3 Tasks should be able to run concurrently.)
However, when I look at the "Event Timeline" visualization for job stages with multiple tasks in my own notebook on Databricks "Community Edition", it looks like up to 8 tasks were running concurrently:
Is there a way to query the number of executor slots from PySpark or from a Databricks notebook? Or can I directly see the number in the Spark UI somewhere?
Databricks "slots" = Spark "cores" = available threads
"Slots" is a term Databricks uses (or used?) for the threads available to do parallel work for Spark. The Spark documentation and Spark UI calls the same concept "cores", even though they are unrelated to physical CPU cores.
(See this answer on Hortonworks community, and this "Spark Tutorial: Learning Apache Spark" databricks notebook.)
View number of slots/cores/threads in Spark UI (on Databricks)
To see how many there are in your Databricks cluster, click "Clusters" in the navigation area to the left, then hover over the entry for your cluster and click the "Spark UI" link. In the Spark UI, click the "Executors" tab.
You can see the number of executor cores (=executor slots) in both the summary and for each individual executor1 in the "cores" column of the respective table there:
1There's only one executor in "local mode" clusters, which are the cluster available in Databricks community edition.
Query number of slots/cores/threads
How to query this number from within a notebook, I'm not sure.
spark.conf.get('spark.executor.cores')
results in java.util.NoSuchElementException: spark.executor.cores
Related
I read the Cluster mode overview (link: https://spark.apache.org/docs/latest/cluster-overview.html) and I was wondering how the components such as the Driver, Executor and Work nodes can be mapped on the components of the Spark Ecosystem such as Spark core, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX and Scheduling/cluster managers. Which of these components are for the Drivers, the Executors and the Work nodes?
So basically my question is if there is a link between these two figures of the components of Spark (figure 1) and the ecosystem of Spark (figure 2). If so can somebody please explain to my what belongs to the drivers/executors/work nodes
Figure 1: Components of Spark
Figure 2: Spark Ecosystem
The cluster manager in the figure 1(as mentioned in the question) is related to (Standalone Scheduler, Yarn, Mesos) in the figure 2(as mentioned in the question).
The cluster manager can be any one of the cluster/resource managers like Yarn, Mesos, kubernates etc.
Nodes or worker Nodes are the machines that are part of the cluster on which you want to run your spark application in distributed manner. You cannot relate this to something on the spark ecosystem diagram.
Nodes/Worker Nodes are actual physical machines like your computer/laptop.
Now the drivers and executors are the processes that runs on machines that are part of the cluster.
One of the node from the cluster is selected as the master/driver node and this is where the driver process runs which creates sparkContext and runs your main method and split up your code in a way that it can be executed in distributed fashion by creating jobs, stages and tasks.
Other nodes from the cluster are selected as Worker nodes and executor process runs the tasks assigned to them by driver process on this nodes.
Now coming to Spark Core , it is the component/framework that has been created which allows all of this communications, Scheduling and data transfer to happen between driver node and worker nodes and you don't have to worry about all these things and just focus on your business logic t get the required work done.
Structured Streaming, Spark SQL, MLib, GraphX are some functionality that is implemented utilizing Spark Core as the base functionality so you get some of common functionality that you can utilize to make your life easier. You would have spark installed on all the nodes i.e driver node and worker nodes and have all these components on those nodes by default.
You cannot compare both the figures exactly because one shows the working of how the spark application is executed when you submit your code to cluster and other just shows the various components that the spark framework in whole provides.
we have Hadoop cluster ( HDP 2.6.5 cluster with ambari , with 25 datanodes machines )
we are using spark streaming application (spark 2.1 run over Hortonworks 2.6.x )
the current situation is that spark streaming applications runs on all datanodes machines
but now we want the spark streaming application to run only on the first 10 datanodes machines
so the others last 15 datanodes machines will be restricted , and spark application will runs only on the first 10 datanodes machines
is this scenario can be done by ambary features or other approach?
for example we found the - https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/configuring_node_labels.html ,
and
http://crazyadmins.com/configure-node-labels-on-yarn/
but not sure if Node Labes can help us
#Jessica Yes, you are absolutely onto the right path. Yarn Node Labels and Yarn Queues are how Ambari Administrators control team level access to portions of the entire yarn cluster. You can start very basic with just a non default queues or get very in-depth with many queues for many different teams. Node labels take it to another level, allow you to map queues and teams to nodes specifically.
Here is a post with the syntax for spark to use the yarn queue:
How to choose the queue for Spark job using spark-submit?
I tried to find 2.6 version of these docs, but was not able.... they have really mixed up the docs since the merger...
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/ch_node_labels.html
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/data-operating-system/content/configuring_node_labels.html
The actual steps you may have to take may be a combination of items from both. Typical experience for me when working in Ambari HDP/HDF.
I use Apache Hive 2.1.1-cdh6.2.1 (Cloudera distribution) with MR as execution engine and YARN's Resource Manager using Capacity scheduler.
I'd like to try Spark as an execution engine for Hive. While going through the docs, I found a strange limitation:
Instead of the capacity scheduler, the fair scheduler is required. This fairly distributes an equal share of resources for jobs in the YARN cluster.
Having all the queues set up properly, that's very undesirable for me.
Is it possible to run Hive on Spark with YARN capacity scheduler? If not, why?
I'm not sure you can execute Hive using spark Engines. I highly recommend you configure Hive to use Tez https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez which is faster than MR and it's pretty similar to Spark due to it uses DAG as the task execution engine.
We are running it at work using the command on Beeline as described https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started just writing it at the beginning of the sql file to run
set hive.execution.engine=spark;
select ... from table....
We are not using capacity scheduler because there are hundreds of jobs run per yarn queue, and when jobs are resource avid, we have other queues to let them run. That also allows designing a configuration based on job consumption per queue more realistic based on the actual need of the group of jobs
Hope this helps
I run several notebooks on Azure Databricks Spark cluster at the same time.
How can I see the cluster nodes usage rate of each notebook \ app over a period of time?
Both the "Spark Cluster UI - Master" and "Spark UI" tabs didn't provide such information
There is no automated/built in support today for isolating the usage of particular notebooks on Databricks.
That being said, one approach would be to use the Ganglia Metrics available for Databricks clusters.
If you run both notebooks at the same time it will be difficult to discern which is responsible for a specific quantity of usage. I would recommend running one notebook to completion and taking note of the usage on the cluster. Then, run the second notebook to completion and observe its usage. You can then compare the two and have a baseline for how each one is utilizing resources on the cluster.
I have an analytics node running, with Spark Sql Thriftserver running on it. Now I can't run another Spark Application with spark-submit.
It says it doesn't have resources. How to configure the dse node, to be able to run both?
The SparkSqlThriftServer is a Spark application like any other. This means it requests and reserves all resources in the cluster by default.
There are two options if you want to run multiple applications at the same time:
Allocate only part of your resources to each application.
This is done by setting spark.cores.max to a smaller value than the max resources in your cluster.
See Spark Docs
Dynamic Allocation
Which allows applications to change the amount of resources they use depending on how much work they are trying to do.
See Spark Docs