spark streaming visualization - apache-spark

I am using spark streaming to stream data from kafka broker. I am performing transformations on the data using spark streaming. Can someone suggest a visualization tool which I can use to show real-time graphs and charts which update as data streams in?

You could store your results in ElasticSearch and then use Kibana to perform visualizations.

Apart from looking at spark's own streaming UI tab, I highly recommend use of graphite sinks. Spark streaming is a long running application so for monitoring purposes this can be really handy.
In no time using graphite dashboards you will kick start monitoring your spark streaming application.
The best literature I know is, here in section monitoring. and [here too] (https://www.inovex.de/blog/247-spark-streaming-on-yarn-in-production/)
It provides configuration and other details. Some of the dashboards you will find ready-made in json format on some or other github links but again I found these two posts most useful in my production application.
I hope this will help you for visualizing and monitoring your application internals in spark streaming application.

you have use Websockets for building real-time streaming Graphs.
As such there are no BI tools but there are JS libraries which can help in building real-time graphs - http://www.pubnub.com/blog/tag/d3-js/

Check out Lightning: A Data Visualization Server
http://lightning-viz.org/
The server is designed to for making web-based interactive visualizations using D3. It is designed for large data sets and continuously updating data streams.

You can use Pro BI Tools like Tableau, Power BI or even MS Excel.. For testing, I use MS Excel with 1 min auto refresh.
You can also write python code for this.

Related

Connecting Kafka to SCADA system

Our team want to achieve a solution for real time series data coming from the sensor data. We need to stream that data. What are the viable options for streaming. we also need to do some transformation before the data is stored with raw data.
I have came through Apache Kafka as a solution and Kafka streams can help us to transform data as well.
Please let me know other viable options which can be integrated with Microsoft Azure as our Machine learning models are being built on that

Retrieve graphical information using Spark Structured Streaming

Spark Streaming provided a "Streaming" tab within the deployed Web UI (http://localhost:4040 for running applications or http://localhost:18080 for completed applications, both by default) for each application executed, where graphs representative of application performance could be obtained, which is no more available using Spark Structured Streaming. In my case, I am developing a streaming application with Spark Structured Streaming that reads from a Kafka broker and I would like to obtain a graph of records processed per second, such as the one I could obtain when using Spark Streaming instead of Spark Structured Streaming, among other graphical information.
What is the best alternative to achieve this? I am using Spark 3.0.1 (via pyspark library), and deploying my application on a YARN cluster.
I've checked Monitoring Structured Streaming Applications Using Web UI by Jacek Laskowski, but it is still not very clear how to obtain this type of information in a graphic way.
Thank you in advance!
I managed to get what I wanted. For some reason I still don't know, the Spark History Server UI for completed apps (on http://localhost:18080 by default) did not show the new tab ("Structured Streaming" tab) that is available for Spark Structured Streaming applications that are executed on Spark 3.0.1. However, the web UI that I managed to access through the URL http://localhost:4040 does show me the information that I wanted to retrieve. You just need to click on the 'runId' link of the streaming query from which you want to get the statistics.
If you can't see this tab, based on my personal experience, I recommend the following:
Upgrade to Spark latest version (currently 3.0.1)
Consult this information on the UI deployed at port 4040 while the application is running, instead of port 18080 when the application has finished.
I found the Web UI official documentation from latest Apache Spark very useful to achieve this.
Most metrics informations you see in spark UI is exported by spark.
If spark UI don't fit your requirement, you could retrieve theses metrics and process it.
you can use a sink to export the data, for exemple to csv, prometheus, ... or via rest API.
you should take a look at spark monitoring : https://spark.apache.org/docs/latest/monitoring.html

Grafana for Spark Structured Streaming

I followed these steps to setup Prometheus, Graphite Exporter and Grafana to plot metrics for Spark 2.2.1 running Structured Streaming. The collection metrics on this post are quite dated; and does not include any metrics (I believe) that can be used to monitor structured streaming. I am especially interested in the resources and duration to execute the streaming queries that perform various aggregations.
Is there any pre-configured dashboard for spark - I was a little surprised not to find one on https://grafana.com/dashboards
Which makes me suspect that Grafana is not widely used to monitor metrics for Spark. If that's the case, what works better?
It looks like it is not any dashboard in the oficial Grafana dashboard, but you can check the next Spark dashboard that display metrics collected from Spark applications.
https://github.com/hammerlab/grafana-spark-dashboards

Sending Spark streaming metrics to open tsdb

How can I send metrics from my spark streaming job to open tsdb database? I am trying to use open tsdb as data source in Grafana. Can you please help me with some references where I can start.
I do see open tsdb reporter here which does similar job. How can I integrate the metrics from Spark streaming job to use this? Is there any easy options to do it.
One way to send the metrics to opentsdb is to use it's REST API. To use it, simply convert the metrics to JSON strings and then utilize the Apache Http Client library to send the data (it's in java and can therefore be used in scala). Example code can be found on github.
A more elegant solution would be to use the Spark metrics library and add a sink to the database. There has been a discussion on adding an OpenTSDB sink for the Spark metrics library, however, finally it was not added into Spark itself. The code is avaiable on github and should be possible to use. Unfortunalty the code is compatible on Spark 1.4.1, however, in worst case it should still be possible to get some indications of what is necessary to add.

how to benchmark the kafka spark-streaming?

I have to perform the benchmarking of spark streaming processing. My process gets pulls messages from the kafka, process and loads into ElasticSearch. The upstream generates 100k records per second. So I would like to calculate how many messages processed in 1 second and the latency time. Is there any tools available to monitor this or is there any process to calculate this.
Spark UI can help you,providing the necessary details you need.
By default, the spark ui is available on http://:4040 in a web browser(For a single spark Context).
For the help,you can use this link: http://spark.apache.org/docs/latest/monitoring.html
Beyond the Spark UI, which is useful for determining the rate of processing of your data, you can also use third-party tools like spark-perf to perform load testing on your cluster and obtain benchmark data in that way as well.
Maybe someone should try Yahoo's streaming-benchmarks, I found databricks use that tool to do benchmark between spark streaming and flink.
https://github.com/yahoo/streaming-benchmarks
https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html

Resources