I am exploring the spark rest API for structured streaming.
I have looked the all exposed rest API available in below link.
https://spark.apache.org/docs/latest/monitoring.html
however, I could not figure out how to get the list of "Active Streaming Queries" that is displayed in spark UI port 4040 in structured streaming tab.
Related
Extending my POC asked before:
sql in spark structure streaming
I understand that structured spark streaming provides api to manage streaming queries. However, need some help to understand how to use these api.
for example, once I submit my spark application, and If some query needs to be managed (stopped, re-run). Is it possible to do it ?
In Spark's official docs, we see that there are monitoring endpoints for DStream like
/streaming/statistics
However, there does not seem to be ones for structured streaming mentioned here. I'm looking to monitor streaming statistics for a structured streaming job.
https://spark.apache.org/docs/latest/monitoring.html
Spark Streaming provided a "Streaming" tab within the deployed Web UI (http://localhost:4040 for running applications or http://localhost:18080 for completed applications, both by default) for each application executed, where graphs representative of application performance could be obtained, which is no more available using Spark Structured Streaming. In my case, I am developing a streaming application with Spark Structured Streaming that reads from a Kafka broker and I would like to obtain a graph of records processed per second, such as the one I could obtain when using Spark Streaming instead of Spark Structured Streaming, among other graphical information.
What is the best alternative to achieve this? I am using Spark 3.0.1 (via pyspark library), and deploying my application on a YARN cluster.
I've checked Monitoring Structured Streaming Applications Using Web UI by Jacek Laskowski, but it is still not very clear how to obtain this type of information in a graphic way.
Thank you in advance!
I managed to get what I wanted. For some reason I still don't know, the Spark History Server UI for completed apps (on http://localhost:18080 by default) did not show the new tab ("Structured Streaming" tab) that is available for Spark Structured Streaming applications that are executed on Spark 3.0.1. However, the web UI that I managed to access through the URL http://localhost:4040 does show me the information that I wanted to retrieve. You just need to click on the 'runId' link of the streaming query from which you want to get the statistics.
If you can't see this tab, based on my personal experience, I recommend the following:
Upgrade to Spark latest version (currently 3.0.1)
Consult this information on the UI deployed at port 4040 while the application is running, instead of port 18080 when the application has finished.
I found the Web UI official documentation from latest Apache Spark very useful to achieve this.
Most metrics informations you see in spark UI is exported by spark.
If spark UI don't fit your requirement, you could retrieve theses metrics and process it.
you can use a sink to export the data, for exemple to csv, prometheus, ... or via rest API.
you should take a look at spark monitoring : https://spark.apache.org/docs/latest/monitoring.html
I followed these steps to setup Prometheus, Graphite Exporter and Grafana to plot metrics for Spark 2.2.1 running Structured Streaming. The collection metrics on this post are quite dated; and does not include any metrics (I believe) that can be used to monitor structured streaming. I am especially interested in the resources and duration to execute the streaming queries that perform various aggregations.
Is there any pre-configured dashboard for spark - I was a little surprised not to find one on https://grafana.com/dashboards
Which makes me suspect that Grafana is not widely used to monitor metrics for Spark. If that's the case, what works better?
It looks like it is not any dashboard in the oficial Grafana dashboard, but you can check the next Spark dashboard that display metrics collected from Spark applications.
https://github.com/hammerlab/grafana-spark-dashboards
I am creating streaming analytics application using Spark, Flink & Kafka. Each analytics/functionality will implement as a Microservice so that this analytics can able to use in the different project later.
I run my Spark/Flink job perfectly in Simple Scala application and submit this job over Spark & Flink cluster respectively. But I have to start/run this job when REST POST startJob() request invoke to my web service.
How can I integrate my Spark & Flink data processing functionality in a web service oriented application?
Till now I tried Lagom Microservice but i found so many issues you can check
Best approach to ingest Streaming Data in Lagom Microservice
java.io.NotSerializableException using Apache Flink with
Lagom
I think i am not taking the right direction for Stream Processing Microservice Application. Looking for right direction to implement this analytics over REST Service.
Flink has a REST API you can use to submit and control jobs -- it's used by the Flink Web UI. See the docs here. See also this previous question.
I think the REST API provides job running details, Any Flink API provides suppose if Spring Boot REST end point call connects Kafka streaming data, and returns Kafka data?