Retrieve graphical information using Spark Structured Streaming - apache-spark

Spark Streaming provided a "Streaming" tab within the deployed Web UI (http://localhost:4040 for running applications or http://localhost:18080 for completed applications, both by default) for each application executed, where graphs representative of application performance could be obtained, which is no more available using Spark Structured Streaming. In my case, I am developing a streaming application with Spark Structured Streaming that reads from a Kafka broker and I would like to obtain a graph of records processed per second, such as the one I could obtain when using Spark Streaming instead of Spark Structured Streaming, among other graphical information.
What is the best alternative to achieve this? I am using Spark 3.0.1 (via pyspark library), and deploying my application on a YARN cluster.
I've checked Monitoring Structured Streaming Applications Using Web UI by Jacek Laskowski, but it is still not very clear how to obtain this type of information in a graphic way.
Thank you in advance!

I managed to get what I wanted. For some reason I still don't know, the Spark History Server UI for completed apps (on http://localhost:18080 by default) did not show the new tab ("Structured Streaming" tab) that is available for Spark Structured Streaming applications that are executed on Spark 3.0.1. However, the web UI that I managed to access through the URL http://localhost:4040 does show me the information that I wanted to retrieve. You just need to click on the 'runId' link of the streaming query from which you want to get the statistics.
If you can't see this tab, based on my personal experience, I recommend the following:
Upgrade to Spark latest version (currently 3.0.1)
Consult this information on the UI deployed at port 4040 while the application is running, instead of port 18080 when the application has finished.
I found the Web UI official documentation from latest Apache Spark very useful to achieve this.

Most metrics informations you see in spark UI is exported by spark.
If spark UI don't fit your requirement, you could retrieve theses metrics and process it.
you can use a sink to export the data, for exemple to csv, prometheus, ... or via rest API.
you should take a look at spark monitoring : https://spark.apache.org/docs/latest/monitoring.html

Related

Spark Application as a Rest Service

I have a question regarding a specific spark application usage.
So I want our Spark application to run as a REST API Server, like Spring Boot Applications, therefore it will not be a batch process, instead we will load the application and then we want to keep the application live (no call to spark.close()) and to use the application as Realtime query engine via some API which we will define. I am targeting to deploy it to Databricks. Any suggestions will be good.
I have checked Apache Livy, but not sure whether it will be good option or not.
Any suggestions will be helpful.
Spark isn't designed to run like this; it has no REST API server frameworks other than the HistoryServer and Worker UI built-in
If you wanted a long-running Spark action, then you could use Spark Streaming and issue actions to it via raw sockets, Kafka, etc. rather than HTTP methods
Good question let's discuss step by step
You can create it and it's working fine , following is example :
https://github.com/vaquarkhan/springboot-microservice-apache-spark
I am sure you must be thinking to create Dataset or Data frame and keep into memory and use as Cache (Redis,Gemfire etc ) but here is catch
i) If you have data in few 100k then you really not needed Apache Spark power Java app is good to return response really fast.
ii) If you have data in petabyte then loading into memory as dataset or data frame will not help as Apache Spark doesn’t support indexing since Spark is not a data management system but a fast batch data processing engine, and Gemfire you have flexibility to add index to fast retrieval of data.
Work Around :
Using Apache Ignite’s(https://ignite.apache.org/) In-memory indexes (refer Fast
Apache Spark SQL Queries)
Using data formats that supports indexing like ORC, Parquet etc.
So Why not use Sparing application with Apache Spark without using spark.close().
Spring application as micro service you need other services either on container or PCF/Bluemix/AWS /Azure/GCP etc and Apache Spark has own world and need compute power which is not available on PCF.
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
Once Spark job submit you have to wait results in between you cannot fetch data.
How to use Spark with Spring application as Rest API call :
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.
https://livy.apache.org/

how to get the apache spark job state and transfer it to front-end web in real-time?

I want to get the Apache job state or application state in real-time and push the state to front-end web for demonstration(web will show this state to users). How can I do this besides fetching information in json through visting RESTFUL API of spark? or any books can help me with this?
In the past , i have used Apache LIVY. There are REST API's available that allow you to submit spark jobs, monitor status and report errors/completion
You can read more at : https://livy.incubator.apache.org/
Please check on SparkUI. I hope this will help you to get the spark state in Web UI.
Getting spark job state depends on the way you are running your spark application.
So, if you ran your spark application on yarn, you can use Yarn UI and Spark UI
If you are running spark in Standalone for example, and as #Kumar Immanuel said, you can use Spark UI
You can use SparkLauncher and then SparkAppHandler to get the status
of the job.
You can also explore SparkListeners.

Grafana for Spark Structured Streaming

I followed these steps to setup Prometheus, Graphite Exporter and Grafana to plot metrics for Spark 2.2.1 running Structured Streaming. The collection metrics on this post are quite dated; and does not include any metrics (I believe) that can be used to monitor structured streaming. I am especially interested in the resources and duration to execute the streaming queries that perform various aggregations.
Is there any pre-configured dashboard for spark - I was a little surprised not to find one on https://grafana.com/dashboards
Which makes me suspect that Grafana is not widely used to monitor metrics for Spark. If that's the case, what works better?
It looks like it is not any dashboard in the oficial Grafana dashboard, but you can check the next Spark dashboard that display metrics collected from Spark applications.
https://github.com/hammerlab/grafana-spark-dashboards

Sending Spark streaming metrics to open tsdb

How can I send metrics from my spark streaming job to open tsdb database? I am trying to use open tsdb as data source in Grafana. Can you please help me with some references where I can start.
I do see open tsdb reporter here which does similar job. How can I integrate the metrics from Spark streaming job to use this? Is there any easy options to do it.
One way to send the metrics to opentsdb is to use it's REST API. To use it, simply convert the metrics to JSON strings and then utilize the Apache Http Client library to send the data (it's in java and can therefore be used in scala). Example code can be found on github.
A more elegant solution would be to use the Spark metrics library and add a sink to the database. There has been a discussion on adding an OpenTSDB sink for the Spark metrics library, however, finally it was not added into Spark itself. The code is avaiable on github and should be possible to use. Unfortunalty the code is compatible on Spark 1.4.1, however, in worst case it should still be possible to get some indications of what is necessary to add.

How to display step-by-step execution of sequence of statements in Spark application?

I have an Apache Spark data loading and transformation application with pyspark.sql that runs for half an hour before throwing an AttributeError or other run-time exceptions.
I want to test my application end-to-end with a small data sample, something like Apache Pig's ILLUSTRATE. Sampling down the data does not help much. Is there a simple way to do this?
It sounds like an idea that could easily be handled by a SparkListener. It gives you access to all the low-level details that the web UI of any Spark application could ever be able to show you. All the events that are flying between the driver (namely DAGScheduler and TaskScheduler with SchedulerBackend) and executors are posted to registered SparkListeners, too.
A Spark listener is an implementation of the SparkListener developer API (that is an extension of SparkListenerInterface where all the callback methods are no-op/do-nothing).
Spark uses Spark listeners for web UI, event persistence (for Spark History Server), dynamic allocation of executors and other services.
You can develop your own custom Spark listeners and register them using SparkContext.addSparkListener method or spark.extraListeners setting.
Go to a Spark UI of your job and you will find a DAG Visualization there. That's a graph representing your job
To test your job on a sample use sample as an input first of all ;) Also you may run your spark locally, not on a cluster and then debug it in IDE of your choice (like IDEA)
More info:
This great answer explaining DAG
DAG introduction from DataBricks

Resources