Basic query with TIMESTAMP by not producing output - azure

I have a very basic setup, in which I never get any output if I use the TIMESTAMP BY statement.
I have a stream analytics job which is reading from Event Hub and writing to the table storage.
The query is the following:
SELECT
*
INTO
MyOutput
FROM
MyInput TIMESTAMP BY myDateTime;
If the query uses timestamp statement, I never get any output events. I do see incoming events in the monitoring, there are no errors neither in monitoring nor in the maintenance logs. I am pretty sure that the source data has the right column in the right format.
If I remove the timestamp statement, then everything is working fine. The reason why I need the timestamp statement in the first place is because I need to write a number of queries in the same job, writing various aggregations to different outputs. And if I use timestamp in one query, I am required to use it in all other queries itself.
Am I doing something wrong? Perhaps SELECT * does not play well with TIMESTAMP BY? I just did not find any documentation explaining that...

{"myDateTime":"2015-08-02T10:59:02.0000000Z", "EventEnqueuedUtcTime":"2015-08-07T10:59:07.6980000Z"}
Late tolerance window: 00.00:00:05
All of your events are considered late arriving because myDateTime is 5 days before EventEnqueuedUtcTime. Can you try sending new events where myDateTime is in UTC and is "now" so it matches within a couple of seconds?
Also, when you started the job, what did you pick as the job start date time? Can you make sure you pick a date before the myDateTime values? You might try this first.

Related

spark streaming understanding timeout setup in mapGroupsWithState

I am trying very hard to understand the timeout setup when using the mapGroupsWithState for spark structured streaming.
below link has very detailed specification, but I am not sure i understood it properly, especially the GroupState.setTimeoutTimeStamp() option. Meaning when setting up the state expiry to be sort of related to the event time.
https://spark.apache.org/docs/3.0.0-preview/api/scala/org/apache/spark/sql/streaming/GroupState.html
I copied them out here:
With EventTimeTimeout, the user also has to specify the the the event time watermark in the query using Dataset.withWatermark().
With this setting, data that is older than the watermark are filtered out.
The timeout can be set for a group by setting a timeout timestamp usingGroupState.setTimeoutTimestamp(), and the timeout would occur when the watermark advances beyond the set timestamp.
You can control the timeout delay by two parameters - watermark delay and an additional duration beyond the timestamp in the event (which is guaranteed to be newer than watermark due to the filtering).
Guarantees provided by this timeout are as follows:
Timeout will never be occur before watermark has exceeded the set timeout.
Similar to processing time timeouts, there is a no strict upper bound on the delay when the timeout actually occurs. The watermark can advance only when there is data in the stream, and the event time of the data has actually advanced.
question 1:
What is this timestamp in this sentence and the timeout would occur when the watermark advances beyond the set timestamp? is it an absolute time or is it a relative time duration to the current event time in the state? I know I could expire it by removing the state by ```
e.g. say I have some data state like below, when will it exprire by setting up what value in what settings?
+-------+-----------+-------------------+
|expired|something | timestamp|
+-------+-----------+-------------------+
| false| someKey |2020-08-02 22:02:00|
+-------+-----------+-------------------+
question 2:
Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, is this correct?
question reason
Without understanding these, i can not really apply them to use cases. Meaning when to use GroupState.setTimeoutDuration(), when to use GroupState.setTimeoutTimestamp()
Thanks a lot.
ps. I also tried to read below
- https://www.waitingforcode.com/apache-spark-structured-streaming/stateful-transformations-mapgroupswithstate/read
(confused me, did not understand)
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
(did not say a lot of it for my interest)
What is this timestamp in the sentence and the timeout would occur when the watermark advances beyond the set timestamp?
This is the timestamp you set by GroupState.setTimeoutTimestamp().
is it an absolute time or is it a relative time duration to the current event time in the state?
This is a relative time (not duration) based on the current batch window.
say I have some data state (column timestamp=2020-08-02 22:02:00), when will it expire by setting up what value in what settings?
Let's assume your sink query has a defined processing trigger (set by trigger()) of 5 minutes. Also, let us assume that you have used a watermark before applying the groupByKey and the mapGroupsWithState. I understand you want to use timeouts based on event times (as opposed to processing times, so your query will be like:
ds.withWatermark("timestamp", "10 minutes")
.groupByKey(...) // declare your key
.mapGroupsWithState(
GroupStateTimeout.EventTimeTimeout)(
...) // your custom update logic
Now, it depends on how you set the TimeoutTimestamp withing your "custom update logic". Somewhere in your custom update logic you will need to call
state.setTimeoutTimestamp()
This method has four different signatures and it is worth scanning through their documentation. As we have set a watermark in (withWatermark) we can actually make use of that time. As a general rule: It is important to set the timeout timestamp (set by state.setTimeoutTimestamp()) to a value larger then the current watermark. To continue with our example we add one hour as shown below:
state.setTimeoutTimestamp(state.getCurrentWatermarkMs, "1 hour")
To conclude, your message can arrive into your stream between 22:00:00 and 22:15:00 and if that message was the last for the key it will timeout by 23:15:00 in your GroupState.
question 2: Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, this is correct?
Yes, this is correct. For the batch interval 22:00:00 - 22:05:00 all messages that have an event time (defined by column timestamp) arrive later then the declared watermark of 10 minutes (meaning later then 22:15:00) will be ignored anyway in your query and are not going to be processed within your "custom update logic".

How to count the number of messages fetched from a Kafka topic in a day?

I am fetching data from Kafka topics and storing them in Deltalake(parquet) format. I wish to find the number of messages fetched in particular day.
My thought process: I thought to read the directory where the data is stored in parquet format using spark and apply count on the files with ".parquet" for a particular day. This returns a count but I am not really sure if that's the correct way.
Is this way correct ? Are there any other ways to count the number of messages fetched from a Kafka topic for a particular day(or duration) ?
Message we consume from topic not only have key-value but also have other information like timestamp
Which can be used to track the consumer flow.
Timestamp
Timestamp get updated by either Broker or Producer based on Topic configuration. If Topic configured time stamp type is CREATE_TIME, the timestamp in the producer record will be used by the broker whereas if Topic configured to LOG_APPEND_TIME , timestamp will be overwritten by the broker with the broker local time while appending the record.
So if you are storing any where if you keep timestamp you can very well track per day, or per hour message rate.
Other way you can use some Kafka dashboard like Confluent Control Center (License price) or Grafana (free) or any other tool to track the message flow.
In our case while consuming message and storing or processing along with that we also route meta details of message to Elastic Search and we can visualize it through Kibana.
You can make use of the "time travel" capabilities that Delta Lake offers.
In your case you can do
// define location of delta table
val deltaPath = "file:///tmp/delta/table"
// travel back in time to the start and end of the day using the option 'timestampAsOf'
val countStart = spark.read.format("delta").option("timestampAsOf", "2021-04-19 00:00:00").load(deltaPath).count()
val countEnd = spark.read.format("delta").option("timestampAsOf", "2021-04-19 23:59:59").load(deltaPath).count()
// print out the number of messages stored in Delta Table within one day
println(countEnd - countStart)
See documentation on Query an older snapshot of a table (time travel).
Another way to retrieve this information without counting the rows between two versions is to use Delta table history. There are several advantages of that - you don't read the whole dataset, you can take into account updates & deletes as well, for example if you're doing MERGE operation (it's not possible to do with comparing .count on different versions, because update is replacing the actual value, or delete the row).
For example, for just appends, following code will count all inserted rows written by normal append operations (for other things, like, MERGE/UPDATE/DELETE we may need to look into other metrics):
from delta.tables import *
df = DeltaTable.forName(spark, "ml_versioning.airbnb").history()\
.filter("timestamp > 'begin_of_day' and timestamp < 'end_of_day'")\
.selectExpr("cast(nvl(element_at(operationMetrics, 'numOutputRows'), '0') as long) as rows")\
.groupBy().sum()

ASA Lag is returning a result from outside given duration

I try to use Azure stream analytics to filter results that are too far from the last 2 reads.
However, if last read is more than 720 minutes back (by reading time) I don't want to discard current read because of this difference.
I noticed the following is returning a read from 900 minutes back, which is unexpected as far as I can understand:
LAG(Reading,2)
OVER (PARTITION BY RegisterNumber LIMIT DURATION(minute, 720))
[BeforeLastReading]
I can ignore this read in my select query but I prefer to understand the reason before give up using the duration feature...
Have you tried using the TIMESTAMP BY clause? You can find more documentation on this here: https://learn.microsoft.com/en-us/stream-analytics-query/timestamp-by-azure-stream-analytics

Azure stream analytics timestamp by format

When running stream analytics, I get an error message:
"Dropping events due to improper timestamps. Stream Analytics only
supports ISO8601 format for DateTime values"
I have tried the following formats:
2017-09-19T13:17:29.0111070Z
2017-09-19T13:17:29.123456
2017-09-19 13:17:29.123456
2017-09-19T13:17:29.123
2017-09-19 13:17:29.123
However, when I use the Test button on the query in Stream Analytics, the output comes out fine. Also, when I comment out the timestamp by clause, the query works, but the System.timestamp in the select statment will not return the correct time.
Is this a formatting issue or something else?
Firstly, as Vignesh Chandramohan mentioned, you can try to use CAST to convert the expression to DateTime, and check if it returns data conversion error that indicates any input data/value cannot cast to type 'datetime'.
Secondly, many factors can cause no output issue, for example: A where clause in the query filtered out their events that prevented outputs from being generated; Timestamp for events is before the job start time and therefore events are being dropped etc.
For detailed steps to debug with Azure Stream Analytics jobs, please check the Diagnose and solve problems on Azure portal or this article: Troubleshooting guide for Azure Stream Analytics.

Stream Analytics Query working but no output to table

I got a problem with my Stream Analytics job. I'm pulling events from an IoT Hub and grouping them in timewindows based on their custom timestamps; I've already written a query that does this correctly. But the problem is that it just doesn't write anything into my output table (being a NoSQL table on my Storage Account).
The query runs without problems in the query editor (when testing with a sample input file) and produces the correct output, but when running 'for real', it doesn't output anything (the output table remains empty). I've even tried renaming the table and outputting to a blob storage, but no dice. Here's the query:
SELECT
'general' AS partitionKey,
MIN(ID_frame) AS rowKey,
DATEADD(second, 1, DATEADD(hour, -3, System.TimeStamp)) AS window_start,
System.TimeStamp AS window_end,
COUNT(ID_frame) AS device_count
INTO
[IoT-Hub-output-table]
FROM
[IoT-Hub-input] TIMESTAMP BY custom_timestamp
GROUP BY TumblingWindow(Duration(hour, 3), Offset(second, -1))
The interesting part is that, if I omit any windowing in my query, then the table output works just fine.
I've been beating my head against the wall about this for a few days now, so I think I've already tried most of the obvious things.
As you are using a TumblingWindow of 3 hours, it means you will get a single output every 3 hours which contains an aggregate of all the events within that period.
So did you already wait for 3 hours for the first output to be generated?
I would try and set the window smaller, and try again to see if the output works correctly.
Turns out the query did output into my table, but with an amount of delay I didn't expect; I was waiting for 20-30 minutes at max. while the first insertions would began after a little later than half an hour. Thus I was cancelling the Analytics job before any output was produced and falsely assuming it just wouldn't output anything.
I found this to be the case afer I noted that 'sometimes' (when the job was running for long enough) there appeared to be some output. And in those output records I noticed the big delay between my custom timestamp field and the general timestamp field (which the engine uses to remember when the entity was updated for the last time)

Resources