how to aggregate events received within specific time? - apache-spark

I'm consuming messages from readStream, but in messages I dont have any time based column. In my scenario, I want to aggregate all messages I received in every last 10 mins.
How can I achieve this in spark ? I saw tumbling window function but it works only on timestamp column which is already coming in message.
Any suggestion please ?
Thanks

Related

spark streaming understanding timeout setup in mapGroupsWithState

I am trying very hard to understand the timeout setup when using the mapGroupsWithState for spark structured streaming.
below link has very detailed specification, but I am not sure i understood it properly, especially the GroupState.setTimeoutTimeStamp() option. Meaning when setting up the state expiry to be sort of related to the event time.
https://spark.apache.org/docs/3.0.0-preview/api/scala/org/apache/spark/sql/streaming/GroupState.html
I copied them out here:
With EventTimeTimeout, the user also has to specify the the the event time watermark in the query using Dataset.withWatermark().
With this setting, data that is older than the watermark are filtered out.
The timeout can be set for a group by setting a timeout timestamp usingGroupState.setTimeoutTimestamp(), and the timeout would occur when the watermark advances beyond the set timestamp.
You can control the timeout delay by two parameters - watermark delay and an additional duration beyond the timestamp in the event (which is guaranteed to be newer than watermark due to the filtering).
Guarantees provided by this timeout are as follows:
Timeout will never be occur before watermark has exceeded the set timeout.
Similar to processing time timeouts, there is a no strict upper bound on the delay when the timeout actually occurs. The watermark can advance only when there is data in the stream, and the event time of the data has actually advanced.
question 1:
What is this timestamp in this sentence and the timeout would occur when the watermark advances beyond the set timestamp? is it an absolute time or is it a relative time duration to the current event time in the state? I know I could expire it by removing the state by ```
e.g. say I have some data state like below, when will it exprire by setting up what value in what settings?
+-------+-----------+-------------------+
|expired|something | timestamp|
+-------+-----------+-------------------+
| false| someKey |2020-08-02 22:02:00|
+-------+-----------+-------------------+
question 2:
Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, is this correct?
question reason
Without understanding these, i can not really apply them to use cases. Meaning when to use GroupState.setTimeoutDuration(), when to use GroupState.setTimeoutTimestamp()
Thanks a lot.
ps. I also tried to read below
- https://www.waitingforcode.com/apache-spark-structured-streaming/stateful-transformations-mapgroupswithstate/read
(confused me, did not understand)
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
(did not say a lot of it for my interest)
What is this timestamp in the sentence and the timeout would occur when the watermark advances beyond the set timestamp?
This is the timestamp you set by GroupState.setTimeoutTimestamp().
is it an absolute time or is it a relative time duration to the current event time in the state?
This is a relative time (not duration) based on the current batch window.
say I have some data state (column timestamp=2020-08-02 22:02:00), when will it expire by setting up what value in what settings?
Let's assume your sink query has a defined processing trigger (set by trigger()) of 5 minutes. Also, let us assume that you have used a watermark before applying the groupByKey and the mapGroupsWithState. I understand you want to use timeouts based on event times (as opposed to processing times, so your query will be like:
ds.withWatermark("timestamp", "10 minutes")
.groupByKey(...) // declare your key
.mapGroupsWithState(
GroupStateTimeout.EventTimeTimeout)(
...) // your custom update logic
Now, it depends on how you set the TimeoutTimestamp withing your "custom update logic". Somewhere in your custom update logic you will need to call
state.setTimeoutTimestamp()
This method has four different signatures and it is worth scanning through their documentation. As we have set a watermark in (withWatermark) we can actually make use of that time. As a general rule: It is important to set the timeout timestamp (set by state.setTimeoutTimestamp()) to a value larger then the current watermark. To continue with our example we add one hour as shown below:
state.setTimeoutTimestamp(state.getCurrentWatermarkMs, "1 hour")
To conclude, your message can arrive into your stream between 22:00:00 and 22:15:00 and if that message was the last for the key it will timeout by 23:15:00 in your GroupState.
question 2: Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, this is correct?
Yes, this is correct. For the batch interval 22:00:00 - 22:05:00 all messages that have an event time (defined by column timestamp) arrive later then the declared watermark of 10 minutes (meaning later then 22:15:00) will be ignored anyway in your query and are not going to be processed within your "custom update logic".

How to count the number of messages fetched from a Kafka topic in a day?

I am fetching data from Kafka topics and storing them in Deltalake(parquet) format. I wish to find the number of messages fetched in particular day.
My thought process: I thought to read the directory where the data is stored in parquet format using spark and apply count on the files with ".parquet" for a particular day. This returns a count but I am not really sure if that's the correct way.
Is this way correct ? Are there any other ways to count the number of messages fetched from a Kafka topic for a particular day(or duration) ?
Message we consume from topic not only have key-value but also have other information like timestamp
Which can be used to track the consumer flow.
Timestamp
Timestamp get updated by either Broker or Producer based on Topic configuration. If Topic configured time stamp type is CREATE_TIME, the timestamp in the producer record will be used by the broker whereas if Topic configured to LOG_APPEND_TIME , timestamp will be overwritten by the broker with the broker local time while appending the record.
So if you are storing any where if you keep timestamp you can very well track per day, or per hour message rate.
Other way you can use some Kafka dashboard like Confluent Control Center (License price) or Grafana (free) or any other tool to track the message flow.
In our case while consuming message and storing or processing along with that we also route meta details of message to Elastic Search and we can visualize it through Kibana.
You can make use of the "time travel" capabilities that Delta Lake offers.
In your case you can do
// define location of delta table
val deltaPath = "file:///tmp/delta/table"
// travel back in time to the start and end of the day using the option 'timestampAsOf'
val countStart = spark.read.format("delta").option("timestampAsOf", "2021-04-19 00:00:00").load(deltaPath).count()
val countEnd = spark.read.format("delta").option("timestampAsOf", "2021-04-19 23:59:59").load(deltaPath).count()
// print out the number of messages stored in Delta Table within one day
println(countEnd - countStart)
See documentation on Query an older snapshot of a table (time travel).
Another way to retrieve this information without counting the rows between two versions is to use Delta table history. There are several advantages of that - you don't read the whole dataset, you can take into account updates & deletes as well, for example if you're doing MERGE operation (it's not possible to do with comparing .count on different versions, because update is replacing the actual value, or delete the row).
For example, for just appends, following code will count all inserted rows written by normal append operations (for other things, like, MERGE/UPDATE/DELETE we may need to look into other metrics):
from delta.tables import *
df = DeltaTable.forName(spark, "ml_versioning.airbnb").history()\
.filter("timestamp > 'begin_of_day' and timestamp < 'end_of_day'")\
.selectExpr("cast(nvl(element_at(operationMetrics, 'numOutputRows'), '0') as long) as rows")\
.groupBy().sum()

Check messages in each 10 min intervals kafka - Nodejs

Consumer should check messages at each 10 min intervals this time response message should contains from uncommitted offset,
Currently messages getting once producer send message
That's not really how a Kafka Consumer works. Usually, you have an infinite loop and just take whatever messages are given to you. Unless you're changing the group.id and not committing offsets between requests, you'll always get the next batch of messages.
If you want to add some max consumption limit, followed by a 10 minutes to sleep a thread within that loop, then that's an implementation detail of your application, but not specific to Kafka

Kibana - add a listener

I have ELK installed, and all works fine. I have one index that always receives logs from Logstash.
Sometimes, Logstash stops working (every second month or so), and nothing comes to the index.
I was wondering is there a way to query the index (some interval), if it does not have any entries to produce some kind of event, which I will handle.
For example, query that index every 10 mins, and if there are no logs, then create an event.
I assume you are looking for ELK's internal tools. There is the Elasticsearch Xpack plugin that gives watchers and notifications. But if that's not a requirement, you can write a nodeJS server that querys the last 5 minutes or so, and you can write the exact notification you need.
I hope I could help.

Basic query with TIMESTAMP by not producing output

I have a very basic setup, in which I never get any output if I use the TIMESTAMP BY statement.
I have a stream analytics job which is reading from Event Hub and writing to the table storage.
The query is the following:
SELECT
*
INTO
MyOutput
FROM
MyInput TIMESTAMP BY myDateTime;
If the query uses timestamp statement, I never get any output events. I do see incoming events in the monitoring, there are no errors neither in monitoring nor in the maintenance logs. I am pretty sure that the source data has the right column in the right format.
If I remove the timestamp statement, then everything is working fine. The reason why I need the timestamp statement in the first place is because I need to write a number of queries in the same job, writing various aggregations to different outputs. And if I use timestamp in one query, I am required to use it in all other queries itself.
Am I doing something wrong? Perhaps SELECT * does not play well with TIMESTAMP BY? I just did not find any documentation explaining that...
{"myDateTime":"2015-08-02T10:59:02.0000000Z", "EventEnqueuedUtcTime":"2015-08-07T10:59:07.6980000Z"}
Late tolerance window: 00.00:00:05
All of your events are considered late arriving because myDateTime is 5 days before EventEnqueuedUtcTime. Can you try sending new events where myDateTime is in UTC and is "now" so it matches within a couple of seconds?
Also, when you started the job, what did you pick as the job start date time? Can you make sure you pick a date before the myDateTime values? You might try this first.

Resources