Custom metrics in Apache Spark Ui - apache-spark

I'm using Apache Spark and the metrics UI (found on 4040) is very useful.
I wonder if it's possible to add custom metrics in this UI, custom task metrics but maybe custom RDD metrics too. (like executing time just for a RDD transformation )
It could be nice to have custom metrics grouped by stream batch jobs and tasks.
I have seen the TaskMetrics object but it's marked as a dev api and it looks just useful for input or output sources and do not support custom values.
There is spark way to do that ? Or an alternative?

You could use the shared variables support [1] built-in in Spark. I often used them for implementing something like that.
[1] http://spark.apache.org/docs/latest/programming-guide.html#shared-variables

Related

Spark metrics - Disable all metrics

I'm building a monitoring system for our Spark. I sent the metrics with spark's graphite sink. I want to have the ability to stop all the metrics dynamically. So that means I need to set it with sc.set.
How can I just disable all metrics in the spark configuration? Because I couldn't find something like spark.metrics.enable property.
I couldn't find a way of disabling it. What I do is only set it if I want to monitor (per application).
sc.set("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink")

Sending Spark streaming metrics to open tsdb

How can I send metrics from my spark streaming job to open tsdb database? I am trying to use open tsdb as data source in Grafana. Can you please help me with some references where I can start.
I do see open tsdb reporter here which does similar job. How can I integrate the metrics from Spark streaming job to use this? Is there any easy options to do it.
One way to send the metrics to opentsdb is to use it's REST API. To use it, simply convert the metrics to JSON strings and then utilize the Apache Http Client library to send the data (it's in java and can therefore be used in scala). Example code can be found on github.
A more elegant solution would be to use the Spark metrics library and add a sink to the database. There has been a discussion on adding an OpenTSDB sink for the Spark metrics library, however, finally it was not added into Spark itself. The code is avaiable on github and should be possible to use. Unfortunalty the code is compatible on Spark 1.4.1, however, in worst case it should still be possible to get some indications of what is necessary to add.

How to display step-by-step execution of sequence of statements in Spark application?

I have an Apache Spark data loading and transformation application with pyspark.sql that runs for half an hour before throwing an AttributeError or other run-time exceptions.
I want to test my application end-to-end with a small data sample, something like Apache Pig's ILLUSTRATE. Sampling down the data does not help much. Is there a simple way to do this?
It sounds like an idea that could easily be handled by a SparkListener. It gives you access to all the low-level details that the web UI of any Spark application could ever be able to show you. All the events that are flying between the driver (namely DAGScheduler and TaskScheduler with SchedulerBackend) and executors are posted to registered SparkListeners, too.
A Spark listener is an implementation of the SparkListener developer API (that is an extension of SparkListenerInterface where all the callback methods are no-op/do-nothing).
Spark uses Spark listeners for web UI, event persistence (for Spark History Server), dynamic allocation of executors and other services.
You can develop your own custom Spark listeners and register them using SparkContext.addSparkListener method or spark.extraListeners setting.
Go to a Spark UI of your job and you will find a DAG Visualization there. That's a graph representing your job
To test your job on a sample use sample as an input first of all ;) Also you may run your spark locally, not on a cluster and then debug it in IDE of your choice (like IDEA)
More info:
This great answer explaining DAG
DAG introduction from DataBricks

How to enable ExecutorAllocationManagerSource metrics in Apache Spark?

The documentation on enabling general metrics in Apache Spark is kind of thin:
Within an instance, a "source" specifies a particular set of grouped metrics. there are two kinds of sources:
Spark internal sources, like MasterSource, WorkerSource, etc, which will collect a Spark component's internal state. Each instance is paired with a Spark source that is added automatically.
Common sources, like JvmSource, which will collect low level state. These can be added through configuration options and are then loaded using reflection.
All the examples are of the form:
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
However, none of the plausible-seeming variations on this allowed me to publish the metrics generated in ExecutorAllocationManagerSource
The class isn't unit tested, and I can't locate any other documentation or examples.
In fact, these metrics are published without any special configuration for ExecutorAllocationManagerSource. However, they only manifest if the relevant code paths are active. In this case, that means enabling dynamic executor allocation.
My cluster had been mistakenly configured without dynamic executor allocation. When that is turned on, these metrics are published under driver metrics, without any special configuration.

Bluemix Spark Service

Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup

Resources