The documentation on enabling general metrics in Apache Spark is kind of thin:
Within an instance, a "source" specifies a particular set of grouped metrics. there are two kinds of sources:
Spark internal sources, like MasterSource, WorkerSource, etc, which will collect a Spark component's internal state. Each instance is paired with a Spark source that is added automatically.
Common sources, like JvmSource, which will collect low level state. These can be added through configuration options and are then loaded using reflection.
All the examples are of the form:
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
However, none of the plausible-seeming variations on this allowed me to publish the metrics generated in ExecutorAllocationManagerSource
The class isn't unit tested, and I can't locate any other documentation or examples.
In fact, these metrics are published without any special configuration for ExecutorAllocationManagerSource. However, they only manifest if the relevant code paths are active. In this case, that means enabling dynamic executor allocation.
My cluster had been mistakenly configured without dynamic executor allocation. When that is turned on, these metrics are published under driver metrics, without any special configuration.
Related
spark exposes many metrics to monitor the work of the driver and the executors.
Let's say I use Prometheus. Can the metrics be used to see information about a specific spark run? To investigate for example the memory usage of specific execution, and not in general? Not just make big picture graphs in Grafana (as an example). I do not see how can I do it with Prometheus or graphite.
Is there a tool that is better suitable for what I need?
We're using Spark to run an ETL process by which data gets loaded in from a massive (500+GB) MySQL database and converted into aggregated JSON files, then gets written out to Amazon S3.
My question is two-fold:
This job could take a long time to run, and it would be nice to know how that mapping is going. I know Spark has a built in log manager. Is it as simple as just putting a log statement inside of each map? I'd like to know when each record gets mapped.
Suppose this massive job fails in the middle (maybe it chokes on a DB record or the MYSQL connection drops). Is there an easy way to recover from this in Spark? I've heard that caching/checkpointing can potentially solve this, but I'm not sure how?
Thanks!
Seems like 2 questions with lost of answers and detail. Anyway, assuming non-SPARK Streaming answer and referencing other based on my own reading / research, a limited response:
The following on logging progress checking of stages, tasks, jobs:
Global Logging via log4j and tailoring of this by using under the template stored under SPARK_HOME/conf folder, the template log4j.properties.template file which serves as a basis for defining logging requirements for ones own purposes but at SPARK level.
Programmtically by using Logger, using import org.apache.log4j.{Level, Logger}.
REST API to get status of SPARK Jobs. See this enlightening blog: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
There is also a Spark Listener that can be used
:http://:8080 to see progress via Web UI.
Depends on type of failure. Graceful vs. non-graceful, fault tolerance aspects or memory usage issues and things like serious database duplicate key errors depending on API used.
See How does Apache Spark handles system failure when deployed in YARN? SPARK handles its own failures by looking at DAG and attempting to reconstruct a partition by re-execution of what is needed. This all encompasses aspects under fault tolerance for which nothing needs to be done.
Things outside of SPARK's domain and control mean it's over. E.g. memory issues that may result from exceeding various parameters on at large scale computations, DF JDBC write against a store with a duplicate error, JDBC connection outages. This means re-execution.
As an aside, some aspects are not logged as failures even though they are, e.g. duplicate key inserts on some Hadoop Storage Managers.
How can I ensure that an entire DAG of spark is highly available i.e. not recomputed from scratch when the driver is restarted (default HA in yarn cluster mode).
Currently, I use spark to orchestrate multiple smaller jobs i.e.
read table1
hash some columns
write to HDFS
this is performed for multiple tables.
Now when the driver is restarted i.e. when working on the second table the first one is reprocessed - though it already would have been stored successfully.
I believe that the default mechanism of checkpointing (the raw input values) would not make sense.
What would be a good solution here?
Is it possible to checkpoint the (small) configuration information and only reprocess what has not already been computed?
TL;DR Spark is not a task orchestration tool. While it has built-in scheduler and some fault tolerance mechanisms built-in, it as suitable for granular task management, as for example server orchestration (hey, we can call pipe on each machine to execute bash scripts, right).
If you want granular recovery choose a minimal unit of computation that makes sense for a given process (read, hash, write looks like a good choice, based on the description), make it an application and use external orchestration to submit the jobs.
You can build poor man's alternative, by checking if expected output exist and skipping part of the job in that case, but really don't - we have variety of battle tested tools which can do way better job than this.
As a side note Spark doesn't provide HA for the driver, only supervision with automatic restarts. Also independent jobs (read -> transform -> write) create independent DAGs - there is no global DAG and proper checkpoint of the application would require full snapshot of its state (like good old BLCR).
when the driver is restarted (default HA in yarn cluster mode).
When the driver of a Spark application is gone, your Spark application is gone and so are all the cached datasets. That's by default.
You have to use some sort of caching solution like https://www.alluxio.org/ or https://ignite.apache.org/. Both work with Spark and both claim to be offering the feature to outlive a Spark application.
There has been times when people used Spark Job Server to share data across Spark applications (which is similar to restarting Spark drivers).
I want to get information about the workers that are being used by an application in Spark cluster. I need to get its IP address, CPU cores, memory available etc.
Is there any API in spark for this purpose?
Above image shows the same info on Spark UI but I am not able to figure out the way to get it by JAVA code.
It is specific to JAVA.
I want all worker nodes information.
Thanks.
There are multiple ways to do this:
Parse the output log messages and see what workers are started on each machine in your cluster. You can get the names/IPs of all the hosts, when tasks are started and where, how much memory each worker gets, etc. If you want to see the exact HW configuration, you will then need to log in to the worker nodes or use different tools.
The same information as in the web frontend is contained in the eventLogs of the spark applications (this is actually where the data you see comes from). I prefer to use the eventLog as it is very easy to parse in python rather than the log messages.
If you want to have real-time monitoring of the cluster you can use either ganglia (gives nice graphical displays of CPU/memory/network/disks) or use colmux that gives you the same data but in a text format. I personally prefer colmux (easier to set up, you get immediate stats, etc).
I'm using Apache Spark and the metrics UI (found on 4040) is very useful.
I wonder if it's possible to add custom metrics in this UI, custom task metrics but maybe custom RDD metrics too. (like executing time just for a RDD transformation )
It could be nice to have custom metrics grouped by stream batch jobs and tasks.
I have seen the TaskMetrics object but it's marked as a dev api and it looks just useful for input or output sources and do not support custom values.
There is spark way to do that ? Or an alternative?
You could use the shared variables support [1] built-in in Spark. I often used them for implementing something like that.
[1] http://spark.apache.org/docs/latest/programming-guide.html#shared-variables