Spark execution memory monitoring [closed] - apache-spark

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What I want is to be able to monitor Spark execution memory as opposed to storage memory available in SparkUI. I mean, execution memory NOT executor memory.
By execution memory I mean:
This region is used for buffering intermediate data when performing shuffles, joins, sorts and aggregations. The size of this region is configured through spark.shuffle.memoryFraction (default0.2).
According to: Unified Memory Management in Spark 1.6
After intense search for answers I found nothing but unanswered StackOverflow questions, answers that relate only to storage memory or ones with vague answers of the type use Ganglia, use Cloudera console etc...
There seems to be a demand for this information on Stack Overflow, and yet not a single satisfactory answer is available. Here are some top posts of StackOverflow when searching monitoring spark memory
Monitor Spark execution and storage memory utilisation
Monitoring the Memory Usage of Spark Jobs
SPARK: How to monitor the memory consumption on Spark cluster?
Spark - monitor actual used executor memory
How can I monitor memory and CPU usage by spark application?
How to get memory and cpu usage by a Spark application?
Questions
Spark version > 2.0
Is it possible to monitor Execution memory of Spark job? By monitoring I mean at minimum see used/available just like for storage memory per executor in Executor tab of SparkUI. Yes or No?
Could I do it with SparkListeners (#JacekLaskowski ?) How about history-server? Or the only way is through the external tools? Graphana, Ganglia, others? If external tools, could you please point to a tutorial or provide some more detailed guidelines?
I saw this SPARK-9103 Tracking spark's memory usage seems like it is not yet possible to monitor execution memory. Also this seems relevant SPARK-23206 Additional Memory Tuning Metrics.
Does Peak Execution memory is reliable estimate of usage/occupation of execution memory in a task? If for example it a Stage UI says that a task uses 1 Gb at peak, and I have 5 cpu per executor, does it mean I need at least 5 Gb execution memory available on each executor to finish a stage?
Are there some other proxies we could use to get a glimpse of execution memory?
Is there a way to know when the execution memory starts to eat into storage memory? When my cached table disappears from Storage tab in SparkUI or only part of it remains, does it mean it was evicted by the execution memory?

Answering my own question for future reference:
We are using Mesos as cluster manager. In the Mesos UI I found a page that lists all executors on a given worker and there one can find a Memory usage of the executor. It seems to be a total memory usage storage+execution. I can clearly see that when the memory fills up the executor dies.
To access:
Go to Agents tab which lists all cluster workers
Choose worker
Choose Framework - the one with the name of your script
Inside you will have a list of executors for your job running on this particular worker.
For memory usage see: Mem (Used / Allocated)
The similar can be done for driver. For a framework you choose the one with a name Spark Cluster
If you want to know how to extract this number programatically see my response to this question: How to get Mesos Agents Framework Executor Memory

I enable Spark internal metrics for executor and I can get information about JVMHeapMemory, jvm.heap.usage, OnHeapExecutionMemory, OnHeapStroageMemory and OnHeapUnifiedMemory for my research. Please refer to the doc (https://spark.apache.org/docs/3.0.0-preview/monitoring.html) for more information.

Related

How to get Mesos Agents Framework Executor Memory

Inside Mesos Web UI I can see memory usage of my Spark executors in a table
Agents -> Framework -> Executors
There is a table listing all executors for my Spark driver and their memory usage is indicated in column Mem (Used / Allocated).
Is there a way to obtain this number directly via a link and if yes how?
For example I can obtain a bunch of Mesos metrics via http://IP/mesos/metrics/snapshot but memory usage of executors is not one of them.
The memory usage of executors in fact is related with mesos task, means for every task how many memory the executors will consume.
If above is what you need, you can use following rest api to get a json and then parse the memory used from it.
http://mesos_ip:5050/master/tasks
FYI.
Found the answer myself. For each worker/agent on which executors may run, direct access to memory info is here:
http://IP_of_worker1:5051/slave(1)/monitor/statistics
http://IP_of_worker2:5051/slave(1)/monitor/statistics
etc
The content is in the form of a json and framework_id allows to find the related executors and their memory consumption, cpu usage etc what is given in the table.

What is pyspark driver? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I saw that a common setup to start pyspark is using pyspark --master yarn --deploy-mode client --num-executors 4 --executor-memory 2g --driver-memory 4g, but how does driver memory differ from the executory memory? Could you please explain what a driver is and how does setting it here affects the pyspark workflow/performance?
Thanks!
Spark uses a master/slave architecture. As you can see in the figure, it has one central coordinator (Driver) that communicates with many distributed workers (executors). The driver and each of the executors run in their own Java processes.
DRIVER
The driver is the process where the main method runs. First it converts the user program into tasks and after that it schedules the tasks on the executors.
EXECUTORS
Executors are worker nodes' processes in charge of running individual tasks in a given Spark job. They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application. Once they have run the task they send the results to the driver. They also provide in-memory storage for RDDs that are cached by user programs through Block Manager.
A Spark Driver is the process of running the main() function of the application and creating the SparkContext.
--driver-memory setup the memory used by this driver. If you run your application in client mode, this will most probably be the max-memory use by the Master Node. The master node is only used to coordinate jobs between the executors, so he's not really used to do any computation.
The memory in the Driver can be filled calling an operation an Action like collect that return a list that contains all of the elements in this RDD. If the RDD is bigger than the driver memory, the Spark Application will through an OutOfMemory error.
Here you can find more information about Spark components: http://spark.apache.org/docs/latest/cluster-overview.html#components
This is a wonderful link describing the various parameters for tuning spark applications.
It includes description for driver memory, executor memory, etc.
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Spark Job Architecture
A Spark application consists of a single driver process and a set of executor processes scattered across nodes on the cluster.
The driver is the process that is in charge of the high-level control flow of work that needs to be done. The executor processes are responsible for executing this work, in the form of tasks, as well as for storing any data that the user chooses to cache. Both the driver and the executors typically stick around for the entire time the application is running, although dynamic resource allocation changes that for the latter. A single executor has a number of slots for running tasks, and will run many concurrently throughout its lifetime. Deploying these processes on the cluster is up to the cluster manager in use (YARN, Mesos, or Spark Standalone), but the driver and executor themselves exist in every Spark application.

How can I monitor memory and CPU usage by spark application?

After running my spark application, I want to monitor its memory and cpu usage to evaluate its performance but couldn't find any option. Is it possible to monitor it? How can I monitor memory and CPU usage by spark application?
There are a few options:
Ganglia is one
If you're running on your own cluster, HDP or Cloudera both have real time CPU & memory consumption charts.
If you want specific JVM metrics, then I'd recommend FlameGraph, though it's not real time.
There's also Grafana, it's extremely powerful, you can track many metrics with it, and it's real time.

What is and how to control Memory Storage in Executors tab in web UI?

I use Spark 1.5.2 for a Spark Streaming application.
What is this Storage Memory in Executors tab in web UI? How was this to reach 530 MB? How to change that value?
CAUTION: You use the very, very old and currently unsupported Spark 1.5.2 (which I noticed after I had posted the answer) and my answer is about Spark 1.6+.
The tooltip of Storage Memory may say it all:
Memory used / total available memory for storage of data like RDD partitions cached in memory.
It is part of Unified Memory Management feature that was introduced in SPARK-10000: Consolidate storage and execution memory management that (quoting verbatim):
Memory management in Spark is currently broken down into two disjoint regions: one for execution and one for storage. The sizes of these regions are statically configured and fixed for the duration of the application.
There are several limitations to this approach. It requires user expertise to avoid unnecessary spilling, and there are no sensible defaults that will work for all workloads. As a Spark user, I want Spark to manage the memory more intelligently so I do not need to worry about how to statically partition the execution (shuffle) memory fraction and cache memory fraction. More importantly, applications that do not use caching use only a small fraction of the heap space, resulting in suboptimal performance.
Instead, we should unify these two regions and let one borrow from another if possible.
Spark Properties
You can control the storage memory using spark.driver.memory or spark.executor.memory Spark properties that set up the entire memory space for a Spark application (the driver and executors) with the split between regions controlled by spark.memory.fraction and spark.memory.storageFraction.
You should consider watching the slides Memory Management in Apache Spark by the author Andrew Or and the video Deep Dive: Apache Spark Memory Management by the author himself (again).
You may want to read how the Storage Memory values (in web UI and internally) are calculated in How does web UI calculate Storage Memory (in Executors tab)?

Limit Spark application from grabbing all the resources in a YARN cluster

We (an engineering team) are running an EMR cluster with YARN and Spark. What is typically happening is that when one user submits a heavy memory intensive job, it grabs all the YARN available memory and then all the subsequent users submitted jobs have to wait for that memory to clear (I know that autoscaling will solve this problem to a certain extent and we are looking into that, but we would like to avoid a single user occupying all the memory even when the cluster is autoscaled to it's full limits).
Is there a way to configure YARN such that any application (Spark or otherwise) may not occupy more than, say 75% of available memory?
Thanks
According to the documentation, you can manage the amount of memory allocated to an executor using the parameter: spark.executor.memory

Resources