Monitoring pub/sub services - node.js

For every service which read/write from/to topic in Kafka/Redis, there are some basic metrics which we want to have in Prometheus:
How "fast" the writes are for every topic
How "fast" the reads are for every topic
In Kafka, I may want to determine how "fast" each group-id reads.
To determine the "speed" of reading from a topic, one can think of a mechanism where someone publish the same message in intervals of 10 seconds and the consumer sends to Prometheus when it fully processed that message. If the graph show that the message was read every 12 seconds, it means that we have a lag of 2 seconds when reading any messages.
It looks like a lot of repeated manual work on every topic there is on the system.
Question
Is my proposal makes any sense? Are there any best-practice/tools on how to determine "lags"/"speed" of reading/writing from every topic in redis/kafka/... in Prometheus?

I had the exact same issue once.
Maintaining the each topic metrics manually is very tiring and not at all scalable.
I switched over to using kafka_consumergroup_lag metric from the kafka_exporter
This along with the consumergroup,topic labels were enough to let us know to know which topic was not being read/lagging behind and by which consumer group.
Also has other metrics like the rate of meassages being read.
As for converting this lag in terms of time, either attach an produce time to kafka message and read it at the other end of the kafka pipeline and export the difference in time via micrometer from the application to Prometheus.
Or better still:- use tracing to track each message in the piepline using OpenTracing tools like Jaeger
Use this for Redis monitoring.
All these exporters send the data in the Prometheus format and can be directly integrated.

Related

Get kafka message processing time

I have 2 services: producer and consumer.
As far as I understand, message.ts is the time the producer produced the message (not the time the kafka-broker received the message).
Questions
When the consumer consume the message, how can I know how much time it was inside the kafka-broker (without the network latency: from producer to kafka-broker and from kafka-broker to consumer)?
I did a ping from my consumer vm to the kafka broker. the ping result was 0.7ms (millisecond). Does the network latency from each side to the kafka broker is 0.3ms? I assume kafka transport is TCP so there is a "ACK" message for everything. And I assume that each side won't do nothing without "ACK" so I conclude that the network latency on each size is the same as the ping result: 0.7ms (millisecond). Am I correct?
It's a little more complicated than that. Many variables go into how long it takes to process a message. I suggest you look into Distributed Tracing. Something like Zipkin works like magic and is very easy to setup and use. Here's a tutorial on how to setup Zipkin tracing with Spring Boot. You can even use it with Kafka Connect with an interceptor, here's the one I use: brave-kafka-interceptor.
Zipkin produces a trace for every message including all producers and consumers that processed it. Those traces end up lookin something like this:
You can see how much time a message took to be processed, and how much time it took to be consumed afte being produced, which is what you're looking for.
I tested manually this by producing and consuming from the same vm to a kafka (which was inside my cluster). The result was 1.3-1.5 ms.
It means that the procssing time took 0.1 ms on average.
I produced a new message every 1 second to avoid delay while consuming.
This is not the best solution, but it is suffient to my research.

Replaying Messages in Order

I am implementing a consumer which does processing of messages from a queue where order of messages is of importance. I would like to implement a mechanism using NodeJS where:
the consumer function is consuming messages m1, m2, ..., mN from the queue
doing an IO intensive operation and process the messages. m -> m'
Storing the result m' in a redis cache.
acknowledging the queue after each message process (2)
In a different function, I am listening to the message from the cache
sending the processed messages m' to an external system
if the external system was able to process the external system, then delete the processed message from the cache
If the external system rejects the processed message, then stop sending messages, discard the unsent processed messages in the cache and reset the offset to the last accepted message in the queue. For example if m12' was the last message accepted by the system, and I have acknowledged m23 from the queue, then I have to discard m13' to m23' and reset the offset so that the consumer can read and start processing from m13 again.
Few assumptions:
The processing m to m' is intensive and I am processing them optimistically, knowing that most of the times there won't be a failure
With the current assumptions and goals, is there any way I can achieve this with RabbitMQ or any Azure equivalent? My client doesn't prefer Kafka or any Azure equivalent of Kafka (Azure Event Hub).
In scenarios where the messages will always be generated in sequence then a simple queue is probably all you need.
Azure Queues are pretty simple to get into, but the general mode of operation for queues is to remove the messages as they are processed successfully.
If you can avoid the scenario where you must "roll back" or re-process from an earlier time, so if you can avoid the orchestration aspect then this would be a much simpler option.
It's the "go back and replay" that you will struggle with. If you can implement two queues in a sequential pattern, where processing messages from one queue successfully pushes the message into the next queue, then we never need to go back, because the secondary consumer can never process ahead of the primary.
With Azure Event Hubs it is much easier to reset the offset for processing, because the messages stay in the bucket regardless of their read state, (in fact any given message does not have such a state) and the consumer maintains the offset pointer itself. It also has support for multiple consumer groups, which will make a copy of the message available to each consumer.
You can up your plan to maintain the data for up to 7 days without blowing the budget.
There are two problems with Large scale telemetry ingestion services like Azure Event Hubs for your use case
The order of receipt of the message is less reliable for messages that are extremely close together, the Hub is designed to receive many messages from many sources concurrently, so its internal architecture cares a lot less about trying to preserve the precise order, it records the precise receipt timestamp on the message, but it does not guarantee that the overall sequence of records will match exactly to a scenario where you were to sort by the receipt timestamp. (its a subtle but important distinction)
Event Hubs (and many client processing code examples) are designed to guarantee Exactly Once delivery across multiple concurrent consuming threads. Again the Consumers are encouraged to be asynchronous and the serice will try to ensure that failed processing attempts are retried by the next available thread.
So you could use Event Hubs, but you would have to bypass or disable a lot of its features which is generally a strong message that it is not the correct fit for your purpose, if you want to explore it though, you would want to limit the concurrency aspects:
minimise the partition count
You probably want 1 partition for each message producer, or atleast for each sequential set, maintaining sequence is simpler inside a single partition
make sure your message sender (producer) only sends to a specific partition
Each producer MUST use a unique partition key
create a consumer group for each of your consumers
process messages one at a time, not in batches
process with a single thread
I have a lot of experience in designing MS Azure based solutions for Industrial IoT (Telemetry from PLCs) and Agricultural IoT (Raspberry Pi) device implementations. In almost all cases we think that the order of messaging is important, but unless you are maintaining real-time 2 way command and control, you can usually get away with an optimisitic approach where each message and any derivatives are or were correct at the time of transmission.
If there is the remote possibility that a device can be offline for any period of time, then dealing with the stale data flushing through the system when a device comes back online can really play havok with sequential logic programming.
Take a step back to analyse your solution, EventHubs does offer a convient way to rollback the processing to a previous offset, as long as that record is still in the bucket, but can you re-design your logic flow so that you do not have to re-process old data?
What is the requirement that drives this sequence? If it is so important to maintain the sequence, then you should probably process the data with a single consumer that does everything, or look at chaining the queues in a sequential manner.

Spark Streaming job fails with ReceiverDisconnectedException Class

I've Spark Streaming job which captures the near real time data from Azure Eventhub and runs 24/7.
More interestingly, my job fails at least 2 times a day with the below error. if I google the error, Microsoft docs gives me 'This exception is thrown if two or more PartitionReceiver instances connect to the same partition with different epoch values'. I'm not worried about data loss because spark Checkpointing will automatically take care of data when i restart the job, but my question is why the spark streaming job fails 2-3 times a day with the same error.
Has anybody faced the same issue, is there any solution/workaround available of this. Any help would be much appreciated.
error:
This exception is thrown if two or more Partitions Receiver instances connect to the same partition with different epoch values.
What is Partition Receiver?
This is a logical representation of receiving from a EventHub partition.
A PartitionReceiver is tied to a ConsumerGroup + Partition combination. If you are creating an epoch based PartitionReceiver (i.e. PartitionReceiver.Epoch != 0) you cannot have more than one active receiver per ConsumerGroup + Partition combo. You can have multiple receivers per ConsumerGroup + Partition combination with non-epoch receivers.
It sounds like you are running two instances of the application, two concurrent classes, or two applications that use the same event hub consumer group. Event hub consumer groups are effectively pointers to a point in time on the event stream. If you try and use one consumer group pointing with two instances of code, then you get a conflict like the one you are seeing.
Either:
Ensure you only have a single instance reading the consumer group at a time.
Use two consumer groups when you need two separate programs or sets of functionality to process the event hub at the same time.
If you are looking to parallelize for performance, look in to event hub Partitioning and how to take advantage of processing each partition independently.
There is also an alternative scenario where an event hub partition is switched over to another host as part of the event hub's internal load balancing. In this case you may see the error you are receiving. In this case, just log it and continue on.
For more details, refer "Features and terminology in Azure Event Hubs" and "Event Hubs Receiver Epoch".
Hope this helps.

what is best practice to consume messages from multiple kafka topics?

I need to consumer messages from different kafka topics,
Should i create different consumer instance per topic and then start a new processing thread as per the number of partition.
or
I should subscribe all topics from a single consumer instance and the should start different processing threads
Thanks & regards,
Megha
The only rule is that you have to account for what Kafka does and doesn't not guarantee:
Kafka only guarantees message order for a single topic/partition. edit: this also means you can get messages out of order if your single topic Consumer switches partitions for some reason.
When you subscribe to multiple topics with a single Consumer, that Consumer is assigned a topic/partition pair for each requested topic.
That means the order of incoming messages for any one topic will be correct, but you cannot guarantee that ordering between topics will be chronological.
You also can't guarantee that you will get messages from any particular subscribed topic in any given period of time.
I recently had a bug because my application subscribed to many topics with a single Consumer. Each topic was a live feed of images at one image per message. Since all the topics always had new images, each poll() was only returning images from the first topic to register.
If processing all messages is important, you'll need to be certain that each Consumer can process messages from all of its subscribed topics faster than the messages are created. If it can't, you'll either need more Consumers committing reads in the same group, or you'll have to be OK with the fact that some messages may never be processed.
Obviously one Consumer/topic is the simplest, but it does add some overhead to have the additional Consumers. You'll have to determine whether that's important based on your needs.
The only way to correctly answer your question is to evaluate your application's specific requirements and capabilities, and build something that works within those and within Kafka's limitations.
This really depends on logic of your application - does it need to see all messages together in one place, or not. Sometimes, consumption from single topic could be easier to implement in terms of business logic of your application.

Can a date and time be specified when sending data to Azure event hub?

Here's the scenario. I'm not working with real-time data. Instead, I get data from my electric company for the past day's electric usage. Specifically, each day I can get # of kwhs for each hour on the clock on the past day.
So, I'd like to load this past information into event hub each following day. Is this doable? Does event hub support loading past information, or is it only and forever about realtime streaming data, with no ability to load past data in?
I'm afraid this is the case, as I've not seen any date specification in what limited api documentation I could find for it. I'd like to confirm, though...
Thanks,
John
An Azure Event Hub is really meant for short-term storage. By default you may only retain data up to 7 days. After which the data will be deleted based upon an append timestamp that was created when the message first entered the Event Hub. Therefore it is not practical to use an Azure Event Hub for data that's older than 7 days.
An Azure Event Hub is meant for message/event management, not long term storage. A possible solution would be to write the Event Hub data to an Azure SQL server or blob storage for long term storage. Then use Azure Stream Analytics (an event processor) to join the active stream with the legacy data that has accumulated on the SQL server. Also note, you can call this appended attribute. It's called "EventEnqueuedUtcTime". Keep in mind that it will be on the server time, whose clock may be different from the date/time of actual measurement.
As for appending a date time. If you are sending it in as a JSON, just simply append it as a key and message value. Example Message with Time: { "Time": "My UTC Time here" }
A streaming system of this type doesn't care about times a particular application may wish to apply to the items. There simply isn't any processing that happens based on a time field unless your code does it.
Each message sent is an EventData which contains a message with an arbitrary set of bytes. You can easily include a date/time in that serialized data structure, but EventHubs won't care about it. There is no sorting performed or fixed ordering other than insertion order within a partition which is defined by the sequence number. While the enqueued time is available it's mostly useful for monitoring how far behind in processing you are.
As to the rest of your problem, I'd agree with the comment that EventHubs may not really be the best choice. You can certainly load data into it once per day, but if it's really only 24 data points/day, it's not really the appropriate technology choice unless it's a prototype/tech demo for a system that's eventually supposed to have a whole load of smart meters reporting to it with fair frequency. (Note also that EventHubs cost $11/month minimum, Service Bus Queue $10/Month min, and AWS SQS $0 min)

Resources