Grafana for Spark Structured Streaming - apache-spark

I followed these steps to setup Prometheus, Graphite Exporter and Grafana to plot metrics for Spark 2.2.1 running Structured Streaming. The collection metrics on this post are quite dated; and does not include any metrics (I believe) that can be used to monitor structured streaming. I am especially interested in the resources and duration to execute the streaming queries that perform various aggregations.
Is there any pre-configured dashboard for spark - I was a little surprised not to find one on https://grafana.com/dashboards
Which makes me suspect that Grafana is not widely used to monitor metrics for Spark. If that's the case, what works better?

It looks like it is not any dashboard in the oficial Grafana dashboard, but you can check the next Spark dashboard that display metrics collected from Spark applications.
https://github.com/hammerlab/grafana-spark-dashboards

Related

Is there a monitoring endpoint for Spark Structured streaming?

In Spark's official docs, we see that there are monitoring endpoints for DStream like
/streaming/statistics
However, there does not seem to be ones for structured streaming mentioned here. I'm looking to monitor streaming statistics for a structured streaming job.
https://spark.apache.org/docs/latest/monitoring.html

Retrieve graphical information using Spark Structured Streaming

Spark Streaming provided a "Streaming" tab within the deployed Web UI (http://localhost:4040 for running applications or http://localhost:18080 for completed applications, both by default) for each application executed, where graphs representative of application performance could be obtained, which is no more available using Spark Structured Streaming. In my case, I am developing a streaming application with Spark Structured Streaming that reads from a Kafka broker and I would like to obtain a graph of records processed per second, such as the one I could obtain when using Spark Streaming instead of Spark Structured Streaming, among other graphical information.
What is the best alternative to achieve this? I am using Spark 3.0.1 (via pyspark library), and deploying my application on a YARN cluster.
I've checked Monitoring Structured Streaming Applications Using Web UI by Jacek Laskowski, but it is still not very clear how to obtain this type of information in a graphic way.
Thank you in advance!
I managed to get what I wanted. For some reason I still don't know, the Spark History Server UI for completed apps (on http://localhost:18080 by default) did not show the new tab ("Structured Streaming" tab) that is available for Spark Structured Streaming applications that are executed on Spark 3.0.1. However, the web UI that I managed to access through the URL http://localhost:4040 does show me the information that I wanted to retrieve. You just need to click on the 'runId' link of the streaming query from which you want to get the statistics.
If you can't see this tab, based on my personal experience, I recommend the following:
Upgrade to Spark latest version (currently 3.0.1)
Consult this information on the UI deployed at port 4040 while the application is running, instead of port 18080 when the application has finished.
I found the Web UI official documentation from latest Apache Spark very useful to achieve this.
Most metrics informations you see in spark UI is exported by spark.
If spark UI don't fit your requirement, you could retrieve theses metrics and process it.
you can use a sink to export the data, for exemple to csv, prometheus, ... or via rest API.
you should take a look at spark monitoring : https://spark.apache.org/docs/latest/monitoring.html

How to make sure that spark structured streaming is processing all the data in kafka

I developed a spark structured streaming application that reads data from a Kafka topic, aggregates the data, and then outputs to S3.
Now, I'm trying to find the most appropriate hardware resources necessary for the application to run properly while also minimizing the costs. Finding very little information on how to calculate the right-sizing of the spark cluster knowing the size of the input, I opted for a trial and error strategy. I deploy applications with minimal resources and add resources until the spark application runs in a stable manner.
That being said, how can I make sure that the spark application is able to process all the data in its Kafka input, and that the application is not falling behind? Is there a specific metric to look for? Job duration time vs trigger processing time?
Thank you for your answers!
Track kafka consumer lag. There should Consumer group created for your Spark streaming job.
> bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --describe --group test-consumer-group
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
test-foo 0 1 3 2 consumer-1-a5d61779-4d04-4c50-a6d6-fb35d942642d /127.0.0.1 consumer-1
If you have a metric saving and plotting tools like prometheus and Grafhana
Save the all Kafka metrics including Kafka consumer lag to prometheus/graphite
Use Grafana to query prometheus and plot them on the graph

Structured streaming - Metrics in Grafana

I am using structured streaming to read data from Kafka and create various aggregate metrics. I have enabled Graphite sink using metrics.properties. I have seen applications in older Spark version have streaming related metrics. I don't see streaming related metrics with Structured streaming. What am I doing wrong?
For example - Not able to find Unprocessed Batches or running batches or last completed batch total delay.
I have enabled streaming metrics by setting:
SparkSession.builder().config("spark.sql.streaming.metricsEnabled",true)
Even then I am getting only 3 metrics:
driver.spark.streaming.inputrate
driver.spark.streaming.latency
driver.spark.streaming.processingrate
These metrics have gaps in between them. Also it starts showing up really late after the application is started. How do I get extensive streaming related metrics to grafana?
I checked StreamingQueryProgress. We can only programmatically creating custom metrics using this one. Is there a way I can consume the metrics which Spark streaming already sends to the sink that I mention?
You can still find some of those metrics. The query which actually starts the streaming harness has two methods - lastProgress and recentProgress
They expose details like number of rows processed, duration of the batch, number of input rows in the batch among other things. There is also a method within called json that can get all this information in a single go which can probably be used for sending to some metrics collector.

spark streaming visualization

I am using spark streaming to stream data from kafka broker. I am performing transformations on the data using spark streaming. Can someone suggest a visualization tool which I can use to show real-time graphs and charts which update as data streams in?
You could store your results in ElasticSearch and then use Kibana to perform visualizations.
Apart from looking at spark's own streaming UI tab, I highly recommend use of graphite sinks. Spark streaming is a long running application so for monitoring purposes this can be really handy.
In no time using graphite dashboards you will kick start monitoring your spark streaming application.
The best literature I know is, here in section monitoring. and [here too] (https://www.inovex.de/blog/247-spark-streaming-on-yarn-in-production/)
It provides configuration and other details. Some of the dashboards you will find ready-made in json format on some or other github links but again I found these two posts most useful in my production application.
I hope this will help you for visualizing and monitoring your application internals in spark streaming application.
you have use Websockets for building real-time streaming Graphs.
As such there are no BI tools but there are JS libraries which can help in building real-time graphs - http://www.pubnub.com/blog/tag/d3-js/
Check out Lightning: A Data Visualization Server
http://lightning-viz.org/
The server is designed to for making web-based interactive visualizations using D3. It is designed for large data sets and continuously updating data streams.
You can use Pro BI Tools like Tableau, Power BI or even MS Excel.. For testing, I use MS Excel with 1 min auto refresh.
You can also write python code for this.

Resources