Kafka - get lag - node.js

I am using "node-rdkafka" npm module for our distributed service architecture written in Nodejs. We have a use case for metering where we allow only a certain amount of messages to be consumed and processed every n seconds. For example, a "main" topic has 100 messages pushed by a producer and "worker" consumes from main topic every 30 seconds. There is a lot more to the story of the use case.
The problem I am having is that I need to progamatically get the lag of a given topic(all partitions).
Is there a way for me to do that?
I know that I can use "bin/kafka-consumer-groups.sh" to access some of the data I need but is there another way?
Thank you in advance

You can retrieve that information directly from your node-rdkafka client via several methods:
Client metrics:
The client can emit metrics at defined interval that contain the current and committed offsets as well as the end offset so you can easily calculate the lag.
You first need to enable the metrics events by setting for example 'statistics.interval.ms': 5000 in your client configuration. Then set a listener on the event.stats events:
consumer.on('event.stats', function(stats) {
console.log(stats);
});
The full stats are documented on https://github.com/edenhill/librdkafka/wiki/Statistics but you probably are mostly interested in the partition stats: https://github.com/edenhill/librdkafka/wiki/Statistics#partitions
Query the cluster for offsets:
You can use queryWatermarkOffsets() to retrieve the first and last offsets for a partition.
consumer.queryWatermarkOffsets(topicName, partition, timeout, function(err, offsets) {
var high = offsets.highOffset;
var low = offsets.lowOffset;
});
Then use the consumer's current position (position()) or committed (committed()) offsets to calculate the lag.

Kafka exposes "records-lag-max" mbean which is the max records in lag for a partition via jmx, so you can get the lag querying this mbean
Refer to below doc for the exposed jmx mbean in detail .
https://docs.confluent.io/current/kafka/monitoring.html#consumer-group-metrics

Related

How to set spark Kusto connector write polling interval?

I'm writing data to Kusto using azure-kusto-spark.
I see this write has high latency. On seeing debug logs from Spark cluster, I see KustoConnector does polling on write. I believe there is default long polling time interval value. Is there a way to configure it to lower time interval value?
In azure-kusto-spark codebase I see this piece of code which I think is responsible for polling.
def finalizeIngestionWhenWorkersSucceeded(
...
DelayPeriodBetweenCalls,
(writeOptions.timeout.toMillis / DelayPeriodBetweenCalls + 5).toInt,
res => res.isDefined && res.get.status == OperationStatus.Pending,
res => finalRes = res,
maxWaitTimeBetweenCalls = KDSU.WriteMaxWaitTime.toMillis.toInt)
.await(writeOptions.timeout.toMillis, TimeUnit.MILLISECONDS)
....
Not sure about understanding it.
The polling operations is just to check whether the data was inserted to Kusto. It has a max timeout but this is not impacting the latency.
I believe the latency is coming from the time you have in your Kusto DB batching ingestion policy - see details here https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/batchingpolicy
By default it's 5 minutes, so you might want to reduce this time if it won't have any negative impact (see doc). Note that the policy should be changed on DB, not table, the reason being the following: the connector creates a temporary table in the same Kusto DB and first inserts into this temporary table. So even if you change the policy of your destination table, it'll still take at least 5min to write to the temporary.

Does Spark configs spark.streaming.receiver.maxRate has any effect in a Kafka Beam pipeline

I was wondering if somebody has any experience with rate limiting in Beam KafkaIO component when the runner is a SparkRunner. The versions I am using are:Beam 2.29, Spark 3.2.0 and Kafka client 2.5.0?
I have the Beam parameter maxRecordsPerBatch set to a large number, 100000000. But even when the pipeline stops for 45 minutes, this value is never hit. But when there is a high burst of data above the normal, the Kafka lag increases till it eventually catches up. In the SparkUI I see that parameter batchIntervalMillis=300000 (5 min) is not reached, batches take a maximum of 3 min. It looks like the KafkaIO stops reading at some point, even when the lag is very large. My Kafka parameters --fetchMaxWaitMs=1000
--maxPollRecords=5000 should be able to bring plenty of data. Specially because KafkaIO creates one consumer per partition. In my system there are multiple topics with a total of 992 partitions and my spark.default.parallelism=600. Some partitions have very little data, while others have a large number. Topics are per region and when a region goes down the data is sent through another region/topic. That is when the lag happens.
Does the configuration values for spark.streaming.receiver.maxRate and spark.streaming.receiver.maxRatePerPartition plus spark.streaming.backpressure.enabled play any role at all?
For what I have seen, it looks like Beam controls the whole reading from Kafka with the operator KafkaIO. This component creates its own consumers, therefore the rate of the consumer can only be set by using consumer configs which include fetchMaxWaitMs and maxPollRecords.
The only way those Spark parameters could have any effect if in the rest of the pipeline after the IO source. But I am not sure.
So I finally figure out how it all works. First I want to state that the Spark configuration values: spark.streaming.receiver.maxRate, spark.streaming.receiver.maxRatePerPartition, spark.streaming.backpressure.enabled do not play a factor in Beam because they only work if you are using the source operators from Spark to read from Kafka. Since Beam has its own operator KafkaIO they do not play a role.
So Beam has a set of parameters defined in the class SparkPipelineOptions that are used in the SparkRunner to setup reading from Kafka. Those parameters are:
#Description("Minimum time to spend on read, for each micro-batch.")
#Default.Long(200)
Long getMinReadTimeMillis();
#Description(
"A value between 0-1 to describe the percentage of a micro-batch dedicated to reading from UnboundedSource.")
#Default.Double(0.1)
Double getReadTimePercentage();
Beam create a SourceDStream object that it will pass to spark to use as a source to read from Kafka. In this class the method boundReadDuration returns the result of calculating the larger of two reading values:
proportionalDuration and lowerBoundDuration. The first one is calculated by multiplying BatchIntervalMillis from readTimePercentage. The second is just the value in mills from minReadTimeMillis. Below is the code from SourceDStream. The time duration returned from this function will be used to read from Kafka alone the rest of the time will be allocated to the other tasks in the pipeline.
Last but no least the following parameter also control how many records are process during a batch maxRecordsPerBatch. The pipeline would not process more than those records in a single batch.
private Duration boundReadDuration(double readTimePercentage, long minReadTimeMillis) {
long batchDurationMillis = ssc().graph().batchDuration().milliseconds();
Duration proportionalDuration = new Duration(Math.round(batchDurationMillis * readTimePercentage));
Duration lowerBoundDuration = new Duration(minReadTimeMillis);
Duration readDuration = proportionalDuration.isLongerThan(lowerBoundDuration) ? proportionalDuration: lowerBoundDuration;
LOG.info("Read duration set to: " + readDuration);
return readDuration;
}

Azure Function with Event Hub trigger receives weird amount of events

I have an Event Hub and Azure Function connected to it. With small amounts of data all works well, but when I tested it with 10 000 events, I got very peculiar results.
For test purposes I send into Event hub numbers from 0 to 9999 and log data in application insights and in service bus. For the first test I see in Azure that hub got exactly 10 000 events, but service bus and AI got all messages between 0 and 4500, and every second message after 4500 (so it lost about 30%). In second test, I got all messages from 0 to 9999, but every second message between 3500 and 3200 was duplicated. I would like to get all messages once, what did I do wrong?
public async Task Run([EventHubTrigger("%EventHubName%", Connection = "AzureEventHubConnectionString")] EventData[] events, ILogger log)
{
int id = _random.Next(1, 100000);
_context.Log.TraceInfo("Started. Count: " + events.Length + ". " + id); //AI log
foreach (var message in events)
{
//log with ASB
var mess = new Message();
mess.Body = message.EventBody.ToArray();
await queueClient.SendAsync(mess);
}
_context.Log.TraceInfo("Completed. " + id); //AI log
}
By using EventData[] events, you are reading events from hub in batch mode, thats why you see X events processing at a time then next seconds you process next batch.
Instead of EventData[] use simply EventData.
When you send events to hub check that all events are sent with the same partition key if you want try batch processing otherwise they can be splitted in several partitions depending on TU (throughput units), PU (Processing Units) and CU (Capacity Units).
Egress: Up to 2 MB per second or 4096 events per second.
Refer to this document.
Throughput limits for Basic, Standard, Premium..:
There are a couple of things likely happening, though I can only speculate with the limited context that we have. Knowing more about the testing methodology, tier of your Event Hubs namespace, and the number of partitions in your Event Hub would help.
The first thing to be aware of is that the timing between when an event is published and when it is available in a partition to be read is non-deterministic. When a publish operation completes, the Event Hubs broker has acknowledged receipt of the events and taken responsibility for ensuring they are persisted to multiple replicas and made available in a specific partition. However, it is not a guarantee that the event can immediately be read.
Depending on how you sent the events, the broker may also need to route events from a gateway by performing a round-robin or applying a hash algorithm. If you're looking to optimize the time from publish to availability, taking ownership of partition distribution and publishing directly to a partition can help, as can ensuring that you're publishing with the right degree of concurrency for your host environment and scenario.
With respect to duplication, it's important to be aware that Event Hubs offers an "at least once" guarantee; your consuming application should expect some duplicates and needs to be able to handle them in the way that is appropriate for your application scenario.
Azure Functions uses a set of event processors in its infrastructure to read events. The processors collaborate with one another to share work and distribute the responsibility for partitions between them. Because collaboration takes place using storage as an intermediary to synchronize, there is an overlap of partition ownership when instances are scaled up or scaled down, during which time the potential for duplication is increased.
Functions makes the decision to scale based on the number of events that it sees waiting in partitions to be read. In the case of your test, if your publication pattern increases rapidly and Functions sees "the event backlog" grow to the point that it feels the need to scale by multiple instances, you'll see more duplication than you otherwise would for a period of 10-30 seconds until partition ownership normalizes. To mitigate this, using an approach of gradually increasing speed of publishing over a 1-2 minute period can help to smooth out the scaling and reduce (but not eliminate) duplication.

How to count the number of messages fetched from a Kafka topic in a day?

I am fetching data from Kafka topics and storing them in Deltalake(parquet) format. I wish to find the number of messages fetched in particular day.
My thought process: I thought to read the directory where the data is stored in parquet format using spark and apply count on the files with ".parquet" for a particular day. This returns a count but I am not really sure if that's the correct way.
Is this way correct ? Are there any other ways to count the number of messages fetched from a Kafka topic for a particular day(or duration) ?
Message we consume from topic not only have key-value but also have other information like timestamp
Which can be used to track the consumer flow.
Timestamp
Timestamp get updated by either Broker or Producer based on Topic configuration. If Topic configured time stamp type is CREATE_TIME, the timestamp in the producer record will be used by the broker whereas if Topic configured to LOG_APPEND_TIME , timestamp will be overwritten by the broker with the broker local time while appending the record.
So if you are storing any where if you keep timestamp you can very well track per day, or per hour message rate.
Other way you can use some Kafka dashboard like Confluent Control Center (License price) or Grafana (free) or any other tool to track the message flow.
In our case while consuming message and storing or processing along with that we also route meta details of message to Elastic Search and we can visualize it through Kibana.
You can make use of the "time travel" capabilities that Delta Lake offers.
In your case you can do
// define location of delta table
val deltaPath = "file:///tmp/delta/table"
// travel back in time to the start and end of the day using the option 'timestampAsOf'
val countStart = spark.read.format("delta").option("timestampAsOf", "2021-04-19 00:00:00").load(deltaPath).count()
val countEnd = spark.read.format("delta").option("timestampAsOf", "2021-04-19 23:59:59").load(deltaPath).count()
// print out the number of messages stored in Delta Table within one day
println(countEnd - countStart)
See documentation on Query an older snapshot of a table (time travel).
Another way to retrieve this information without counting the rows between two versions is to use Delta table history. There are several advantages of that - you don't read the whole dataset, you can take into account updates & deletes as well, for example if you're doing MERGE operation (it's not possible to do with comparing .count on different versions, because update is replacing the actual value, or delete the row).
For example, for just appends, following code will count all inserted rows written by normal append operations (for other things, like, MERGE/UPDATE/DELETE we may need to look into other metrics):
from delta.tables import *
df = DeltaTable.forName(spark, "ml_versioning.airbnb").history()\
.filter("timestamp > 'begin_of_day' and timestamp < 'end_of_day'")\
.selectExpr("cast(nvl(element_at(operationMetrics, 'numOutputRows'), '0') as long) as rows")\
.groupBy().sum()

Best approach to check if Spark streaming jobs are hanging

I have Spark streaming application which basically gets a trigger message from Kafka which kick starts the batch processing which could potentially take up to 2 hours.
There were incidents where some of the jobs were hanging indefinitely and didn't get completed within the usual time and currently there is no way we could figure out the status of the job without checking the Spark UI manually. I want to have a way where the currently running spark jobs are hanging or not. So basically if it's hanging for more than 30 minutes I want to notify the users so they can take an action. What all options do I have?
I see I can use metrics from driver and executors. If I were to choose the most important one, it would be the last received batch records. When StreamingMetrics.streaming.lastReceivedBatch_records == 0 it probably means that Spark streaming job has been stopped or failed.
But in my scenario, I will receive only 1 streaming trigger event and then it will kick start the processing which may take up to 2 hours so I won't be able to rely on the records received.
Is there a better way? TIA
YARN provides the REST API to check the status of application and status of cluster resource utilization as well.
with API call it will give a list of running applications and their start times and other details. you can have simple REST client that triggers maybe once in every 30 min or so and check if the job is running for more than 2 hours then send a simple mail alert.
Here is the API documentation:
https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API
Maybe a simple solution like.
At the start of the processing - launch a waiting thread.
val TWO_HOURS = 2 * 60 * 60 * 1000
val t = new Thread(new Runnable {
override def run(): Unit = {
try {
Thread.sleep(TWO_HOURS)
// send an email that job didn't end
} catch {
case _: Exception => _
}
}
})
And in the place where you can say that batch processing is ended
t.interrupt()
If processing is done within 2 hours - waiter thread is interrupted and e-mail is not sent. If processing is not done - e-mail will be sent.
Let me draw your attention towards Streaming Query listeners. These are quite amazing lightweight things that can monitor your streaming query progress.
In an application that has multiple queries, you can figure out which queries are lagging or have stopped due to some exception.
Please find below sample code to understand its implementation. I hope that you can use this and convert this piece to better suit your needs. Thanks!
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: QueryStartedEvent) {
//logger message to show that the query has started
}
override def onQueryProgress(event: QueryProgressEvent) {
synchronized {
if(event.progress.name.equalsIgnoreCase("QueryName"))
{
recordsReadCount = recordsReadCount + event.progress.numInputRows
//Logger messages to show continuous progress
}
}
}
override def onQueryTerminated(event: QueryTerminatedEvent) {
synchronized {
//logger message to show the reason of termination.
}
}
})
I'm using Kubernetes currently with the Google Spark Operator. [1]
Some of my streaming jobs hang while using Spark 2.4.3: few tasks fail, then the current batch job never progresses.
I have set a timeout using a StreamingProgressListener so that a thread signals when no new batch is submitted for a long time. The signal is then forwarded to a Pushover client that sends a notification to an Android device. Then System.exit(1) is called. The Spark Operator will eventually restart the job.
[1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
One way is to monitor the output of the spark job that was kick started. Generally, for example,
If it writes to HDFS, monitor the HDFS output directory for last modified file timestamp or file count generated
If it writes to a Database, you could have a query to check the timestamp of the last record inserted into your job output table.
If it writes to Kafka, you could use Kafka GetOffsetShell to get the output topic's current offset.
Utilize
TaskContext
This provides contextual information for a task, and supports adding listeners for task completion/failure (see addTaskCompletionListener).
More detailed information such as the task 'attemptNumber' or 'taskMetrics' is available as well.
This information can be used by your application during runtime to determine if their is a 'hang' (depending on the problem)
More information about what is 'hanging' would be useful in providing a more specific solution.
I had a similar scenario to deal with about a year ago and this is what I did -
As soon as Kafka receive's message, spark streaming job picks up the event and start processing.
Spark streaming job sends an alert email to Support group saying "Event Received and spark transformation STARTED". Start timestamp is stored.
After spark processing/transformations are done - sends an alert email to Support group saying "Spark transformation ENDED Successfully". End timestamp is stored.
Above 2 steps will help support group to track if spark processing success email is not received after it's started and they can investigate by looking at spark UI for job failure or delayed processing (maybe job is hung due to resource unavailability for a long time)
At last - store event id or details in HDFS file along with start and end timestamp. And save this file to the HDFS path where some hive log_table is pointing to. This will be helpful for future reference to how spark code is performing over the period time and can be fine tuned if required.
Hope this is helpful.

Resources