Structured streaming - Metrics in Grafana - apache-spark

I am using structured streaming to read data from Kafka and create various aggregate metrics. I have enabled Graphite sink using metrics.properties. I have seen applications in older Spark version have streaming related metrics. I don't see streaming related metrics with Structured streaming. What am I doing wrong?
For example - Not able to find Unprocessed Batches or running batches or last completed batch total delay.
I have enabled streaming metrics by setting:
SparkSession.builder().config("spark.sql.streaming.metricsEnabled",true)
Even then I am getting only 3 metrics:
driver.spark.streaming.inputrate
driver.spark.streaming.latency
driver.spark.streaming.processingrate
These metrics have gaps in between them. Also it starts showing up really late after the application is started. How do I get extensive streaming related metrics to grafana?
I checked StreamingQueryProgress. We can only programmatically creating custom metrics using this one. Is there a way I can consume the metrics which Spark streaming already sends to the sink that I mention?

You can still find some of those metrics. The query which actually starts the streaming harness has two methods - lastProgress and recentProgress
They expose details like number of rows processed, duration of the batch, number of input rows in the batch among other things. There is also a method within called json that can get all this information in a single go which can probably be used for sending to some metrics collector.

Related

How to make sure that spark structured streaming is processing all the data in kafka

I developed a spark structured streaming application that reads data from a Kafka topic, aggregates the data, and then outputs to S3.
Now, I'm trying to find the most appropriate hardware resources necessary for the application to run properly while also minimizing the costs. Finding very little information on how to calculate the right-sizing of the spark cluster knowing the size of the input, I opted for a trial and error strategy. I deploy applications with minimal resources and add resources until the spark application runs in a stable manner.
That being said, how can I make sure that the spark application is able to process all the data in its Kafka input, and that the application is not falling behind? Is there a specific metric to look for? Job duration time vs trigger processing time?
Thank you for your answers!
Track kafka consumer lag. There should Consumer group created for your Spark streaming job.
> bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --describe --group test-consumer-group
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
test-foo 0 1 3 2 consumer-1-a5d61779-4d04-4c50-a6d6-fb35d942642d /127.0.0.1 consumer-1
If you have a metric saving and plotting tools like prometheus and Grafhana
Save the all Kafka metrics including Kafka consumer lag to prometheus/graphite
Use Grafana to query prometheus and plot them on the graph

Kafka Spark Streaming ingestion for multiple topics

We are currently ingesting Kafka messages into HDFS using Spark Streaming. So far we spawn a whole Spark job for each topic.
Since messages are produced pretty rarely for some topics (average of 1 per day), we're thinking about organising the ingestion in pools.
The idea is to avoid creating a whole container (and related resources) for this "unfrequent" topics. In fact Spark Streaming accepts a list of topics in input, so we're thinking about using this feature in order to have a single job consuming all of them.
Do you guys think the one exposed is a good strategy? We also thought about batch ingestion, but we like to keep real-time behavior so we excluded this option. Do you have any tip or suggestion?
Does Spark Streaming handle well multiple topics as a source in case of failures in terms of offset consistency etc.?
Thanks!
I think Spark should be able to handle multiple topics fine as they have support for this from a long time and yes Kafka connect is not confluent API. Confluent does provide connectors for their cluster but you can use it too. You can see that Apache Kafka also has documentation for Connect API.
It is little difficult with Apache version of Kafka, but you can use it.
https://kafka.apache.org/documentation/#connectapi
Also if you're opting for multiple kafka topics in single spark streaming job, you may need to think about not creating small files as your frequency seems very less.

Spark Streaming: in-memory aggregation - correct usage

I have a Spark 2.2 Structured streaming flow from an on-premise system into a containerized cloud spark cluster where kafka recieves the data, and SSS maintains a number of queries that flush to disk every ten seconds. A query console-sink is not accessible to external sessions outside the streaming context (hence the CSV flush); the monitoring dashboard runs spark sql from another context to get metrics.
Right now I am only aggregating the data that has come in since streaming was last started. Now I need to aggregate data since forever with the incoming streaming data to provide (near) realtime views. This will mean running a bunch of GROUP BY's on billions of records - maintaining several million aggregate rows in-memory.
My question is regarding how Spark streaming queries can scale like this: how efficient is memory usage (I'll probably use 32 worker contaiers) and is this the correct way to manage a (near-) realtime view of incoming data using kafka and SSS?

Grafana for Spark Structured Streaming

I followed these steps to setup Prometheus, Graphite Exporter and Grafana to plot metrics for Spark 2.2.1 running Structured Streaming. The collection metrics on this post are quite dated; and does not include any metrics (I believe) that can be used to monitor structured streaming. I am especially interested in the resources and duration to execute the streaming queries that perform various aggregations.
Is there any pre-configured dashboard for spark - I was a little surprised not to find one on https://grafana.com/dashboards
Which makes me suspect that Grafana is not widely used to monitor metrics for Spark. If that's the case, what works better?
It looks like it is not any dashboard in the oficial Grafana dashboard, but you can check the next Spark dashboard that display metrics collected from Spark applications.
https://github.com/hammerlab/grafana-spark-dashboards

how to benchmark the kafka spark-streaming?

I have to perform the benchmarking of spark streaming processing. My process gets pulls messages from the kafka, process and loads into ElasticSearch. The upstream generates 100k records per second. So I would like to calculate how many messages processed in 1 second and the latency time. Is there any tools available to monitor this or is there any process to calculate this.
Spark UI can help you,providing the necessary details you need.
By default, the spark ui is available on http://:4040 in a web browser(For a single spark Context).
For the help,you can use this link: http://spark.apache.org/docs/latest/monitoring.html
Beyond the Spark UI, which is useful for determining the rate of processing of your data, you can also use third-party tools like spark-perf to perform load testing on your cluster and obtain benchmark data in that way as well.
Maybe someone should try Yahoo's streaming-benchmarks, I found databricks use that tool to do benchmark between spark streaming and flink.
https://github.com/yahoo/streaming-benchmarks
https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html

Resources