data being overwritten when outputting data from stream analytics to powerbi - azure

lately I've been playing around with Stream Analytics queries with PowerBI as output sink. I made a simple query which retrieves the total count of http responsecodes of our website requests over time and groups them by date and response code.
The input data is retrieved from a storage account which holds BLOB storage. This is my query:
SELECT
DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0) as datum,
request.ArrayValue.responseCode,
count(request.ArrayValue.responseCode)
INTO
[requests-httpresponsecode]
FROM
[cvweu-internet-pr-sa-requests] R TIMESTAMP BY R.context.data.eventTime
OUTER APPLY GetArrayElements(R.request) as request
GROUP BY DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0), request.ArrayValue.responseCode, System.TimeStamp
Since continuous export became active on 3 september 2018, I chose a job start time of 3 september 2018. Since I am interested in the statistics until today, I did not include a date interval so I am expecting to see data from 3 september 2018 until now (20 december 2018). The job is running fine without errors and I chose PowerBI as an output sink. Immediately I saw the chart being propagated starting from 3 september grouped by day and counting. So far, so good. A few days later I noticed the output dataset didnt start from 3 september anymore but from 2 December until now. Apparently data is being overwritten.
The following link says:
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-power-bi-dashboard
"defaultRetentionPolicy: BasicFIFO: Data is FIFO, with a maximum of 200,000 rows."
But my output table does not have close to 200.000 rows:
datum,count,responsecode
2018-12-02 00:00:00,332348,527387
2018-12-03 00:00:00,3178250,3282791
2018-12-04 00:00:00,3170981,4236046
2018-12-05 00:00:00,2943513,3911390
2018-12-06 00:00:00,2966448,3914963
2018-12-07 00:00:00,2825741,3999027
2018-12-08 00:00:00,1621555,3353481
2018-12-09 00:00:00,2278784,3706966
2018-12-10 00:00:00,3160370,3911582
2018-12-11 00:00:00,3806272,3681742
2018-12-12 00:00:00,4402169,3751960
2018-12-13 00:00:00,2924212,3733805
2018-12-14 00:00:00,2815931,3618851
2018-12-15 00:00:00,1954330,3240276
2018-12-16 00:00:00,2327456,3375378
2018-12-17 00:00:00,3321780,3794147
2018-12-18 00:00:00,3229474,4335080
2018-12-19 00:00:00,3329212,4269236
2018-12-20 00:00:00,651642,1195501
EDIT: I have created the STREAM input source according to
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-quick-create-portal. I can create a REFERENCE input as well, but this invalidates my query since APPLY and GROUP BY are not supported and I also think STREAM input is what I want according to https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-add-inputs.
What am I missing? Is it my query?

It looks like you are streaming to a Streaming dataset. Streaming datasets doesn't store the data in a database, but keeps only the last hour of data. If you want to keep the data pushed to it, then you must enable Historic data analysis option, when you create the dataset:
This will create PushStreaming dataset (a.k.a. Hybrid) with basicFIFO retention policy (i.e. about 200k-210k records kept).

You're correct that Azure Stream Analytics should be creating a "PushStreaming" or "Hybrid" dataset. Can you confirm that your dataset is correctly configured as "Hybrid" (you can check this attribute even after creation as shown here)?
If it is the correct type, can you please clarify the following:
Does the schema of your data change? If, for example, you send the datum {a: 1, b: 2} and then {c: 3, d: 4}, Azure Stream Analytics will attempt to change the schema of your table, which can invalidate older data.
How are you confirming the number of rows in the dataset?

Looks like my query was the problem. I had to use TUMBLINGWINDOW(day,1) instead of System.TimeStamp.
TUMBLINGWINDOW and System.TimeStamp produce exactly the same chart output on the frontend, but seem to be processed in a different way in the backend. This was not reflected to the frontend in any way so this was confusing. I suspect something is happening in the backend due to the way the query is processed when not using TUMBLINGWINDOW and you happen to hit the 200k row per dataset limit sooner than expected. The query below is the one which is producing the expected result.
SELECT
request.ArrayValue.responseCode,
count(request.ArrayValue.responseCode),
DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0) as date
INTO
[requests-httpstatuscode]
FROM
[cvweu-internet-pr-sa-requests] R TIMESTAMP BY R.context.data.eventTime
OUTER APPLY GetArrayElements(R.request) as request
GROUP BY DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0),
TUMBLINGWINDOW(day,1),
request.ArrayValue.responseCode
As we speak my stream analytics job is running smoothly and producing the expected output from 3 september until now without data being overwritten.

Related

Event Processing (17 Million Events/day) using Azure Stream Analytics HoppingWindow Function - 24H Window, 1 Minute Hops

We have a business problem that needs solving and would like some guidance from the community on the combination of products in Azure we could use to solve it.
The Problem:
I work for a business that produces online games. We would like to display the number of users playing a specific game in a 24 Hour Window, but we want the value to update every minute. Essentially the output that HoppingWindow(Duration(hour, 24), Hop(minute, 1)) function in Azure Stream Analytics will provide.
Currently, the amount of events are around 17 Million a day and the Stream Analytics Job seems to be struggling with the load. We tried the following so far;
Tests Done:
17 Million Events -> Event Hub (32 Partitions) -> ASA (42 Streaming Units) -> Table Storage
Failed: Stream Analytics Job never outputs on large timeframes (Stopped test at 8 Hours)
17 Million Events -> Event Hub (32 Partitions) -> FUNC -> Table Storage (With Proper Partition/Row Key)
Failed: Table storage does not support distinct count
17 Million Events -> Event Hub -> FUNC -> Cosmos DB
Tentative: Cosmos DB doesn't support distinct count, not natively anyways. Seems to be some hacks going around, but not sure that's the way to go.
Is there any known designs geared for processing 17 Million events a minute, every minute?
Edit: As per the comments, the code.
SELECT
GameId,
COUNT(DISTINCT UserId) AS ActiveCount,
DateAdd(hour, -24, System.TimeStamp()) AS StartWindowUtc,
System.TimeStamp() AS EndWindowUtc INTO [out]
FROM
[in] TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY
HoppingWindow(Duration(hour, 24), Hop(minute, 1)),
GameId,
UserId
The expected output, note that in reality there will be 1440 records per GameId. One for each minute
To be clear, the problem is that generating the expected output on the larger timeframes, ie 24 Hours doesn't output or at the very least takes 8+ Hours to output. The smaller window sizes work, for example changing the above code to use HoppingWindow(Duration(minute, 10), Hop(minute, 5)).
The tests that followed assumed that ASA is not the answer to the problem and we tried different approaches. Which seemed to have caused a bit of confusion, sorry about that
The way ASA scales up at the moment is with 1 node vertically from 1 to 6SU, then horizontally with multiple nodes of 6SU above that threshold.
Obviously, to be able to scale horizontally a job needs to be parallelizable, which means the stream will be distributed across nodes according to the partition scheme.
Now if the input stream, the query and the output destination are aligned in partitions, then the job is called embarrassingly parallel and that's where you'll be able to reach maximum scale. Each pipeline, from entry to output, will be independent, and each node will only have to maintain in memory the data pertaining to its own state (those 24h of data). That's what we're looking for here: minimizing the local data store at each node.
With EH supporting 32 partitions on most SKUs, the maximum scale publicly available on ASA is 192SU (6*32). If partitions are balanced, that's means that each node will have the least amount of data to maintain in its own state store.
Then we need to minimize the payload itself (the size of an individual message), but from the look of the query that's already the case.
Could you try scaling up to 192SU and see what happens?
We are also working on multiple other features that could help on that scenario. Let me know if that could be of interest to you.

Is there a recent or known issue with the #flurry Data Download?

Our #flurry App Data Download appears bugged.
We requested raw data for analytics recently Oct, 2rd 2020, but the result was not enough than our expected data amount. there are only a few raw data. for example we compered Arbitrary period old which got around Sept 11th to new which got after 5th Oct.
around Sept 11th data is 16MB
after Oct 5th data is 18.6kB
Above data is same period and same data choice.
There is few raw data which is reported but also there is enough event counts on the Flurry Analytics. the every data graph is normal.
Flurry analytics web site. --> about 30,000 data
Exported data --> about 60 data
It's not relate the export file format (CSV, XML, JSON).
It's same result
Add information 2020.Oct.7th
I did data download how to this below.
Flurry analytics console login
Click the Data Download of Sessions
And select application SmartSync(iOS) or SmartSync(Android)
Set Event for any period, and CSV or else.
Is this a known issue or recent bug?
If someone know the any tips or correct setting, could you please advice?
This is now fixed. Please email support if you have further difficulties.

How to count the number of messages fetched from a Kafka topic in a day?

I am fetching data from Kafka topics and storing them in Deltalake(parquet) format. I wish to find the number of messages fetched in particular day.
My thought process: I thought to read the directory where the data is stored in parquet format using spark and apply count on the files with ".parquet" for a particular day. This returns a count but I am not really sure if that's the correct way.
Is this way correct ? Are there any other ways to count the number of messages fetched from a Kafka topic for a particular day(or duration) ?
Message we consume from topic not only have key-value but also have other information like timestamp
Which can be used to track the consumer flow.
Timestamp
Timestamp get updated by either Broker or Producer based on Topic configuration. If Topic configured time stamp type is CREATE_TIME, the timestamp in the producer record will be used by the broker whereas if Topic configured to LOG_APPEND_TIME , timestamp will be overwritten by the broker with the broker local time while appending the record.
So if you are storing any where if you keep timestamp you can very well track per day, or per hour message rate.
Other way you can use some Kafka dashboard like Confluent Control Center (License price) or Grafana (free) or any other tool to track the message flow.
In our case while consuming message and storing or processing along with that we also route meta details of message to Elastic Search and we can visualize it through Kibana.
You can make use of the "time travel" capabilities that Delta Lake offers.
In your case you can do
// define location of delta table
val deltaPath = "file:///tmp/delta/table"
// travel back in time to the start and end of the day using the option 'timestampAsOf'
val countStart = spark.read.format("delta").option("timestampAsOf", "2021-04-19 00:00:00").load(deltaPath).count()
val countEnd = spark.read.format("delta").option("timestampAsOf", "2021-04-19 23:59:59").load(deltaPath).count()
// print out the number of messages stored in Delta Table within one day
println(countEnd - countStart)
See documentation on Query an older snapshot of a table (time travel).
Another way to retrieve this information without counting the rows between two versions is to use Delta table history. There are several advantages of that - you don't read the whole dataset, you can take into account updates & deletes as well, for example if you're doing MERGE operation (it's not possible to do with comparing .count on different versions, because update is replacing the actual value, or delete the row).
For example, for just appends, following code will count all inserted rows written by normal append operations (for other things, like, MERGE/UPDATE/DELETE we may need to look into other metrics):
from delta.tables import *
df = DeltaTable.forName(spark, "ml_versioning.airbnb").history()\
.filter("timestamp > 'begin_of_day' and timestamp < 'end_of_day'")\
.selectExpr("cast(nvl(element_at(operationMetrics, 'numOutputRows'), '0') as long) as rows")\
.groupBy().sum()

Stream Analytics Query working but no output to table

I got a problem with my Stream Analytics job. I'm pulling events from an IoT Hub and grouping them in timewindows based on their custom timestamps; I've already written a query that does this correctly. But the problem is that it just doesn't write anything into my output table (being a NoSQL table on my Storage Account).
The query runs without problems in the query editor (when testing with a sample input file) and produces the correct output, but when running 'for real', it doesn't output anything (the output table remains empty). I've even tried renaming the table and outputting to a blob storage, but no dice. Here's the query:
SELECT
'general' AS partitionKey,
MIN(ID_frame) AS rowKey,
DATEADD(second, 1, DATEADD(hour, -3, System.TimeStamp)) AS window_start,
System.TimeStamp AS window_end,
COUNT(ID_frame) AS device_count
INTO
[IoT-Hub-output-table]
FROM
[IoT-Hub-input] TIMESTAMP BY custom_timestamp
GROUP BY TumblingWindow(Duration(hour, 3), Offset(second, -1))
The interesting part is that, if I omit any windowing in my query, then the table output works just fine.
I've been beating my head against the wall about this for a few days now, so I think I've already tried most of the obvious things.
As you are using a TumblingWindow of 3 hours, it means you will get a single output every 3 hours which contains an aggregate of all the events within that period.
So did you already wait for 3 hours for the first output to be generated?
I would try and set the window smaller, and try again to see if the output works correctly.
Turns out the query did output into my table, but with an amount of delay I didn't expect; I was waiting for 20-30 minutes at max. while the first insertions would began after a little later than half an hour. Thus I was cancelling the Analytics job before any output was produced and falsely assuming it just wouldn't output anything.
I found this to be the case afer I noted that 'sometimes' (when the job was running for long enough) there appeared to be some output. And in those output records I noticed the big delay between my custom timestamp field and the general timestamp field (which the engine uses to remember when the entity was updated for the last time)

Basic query with TIMESTAMP by not producing output

I have a very basic setup, in which I never get any output if I use the TIMESTAMP BY statement.
I have a stream analytics job which is reading from Event Hub and writing to the table storage.
The query is the following:
SELECT
*
INTO
MyOutput
FROM
MyInput TIMESTAMP BY myDateTime;
If the query uses timestamp statement, I never get any output events. I do see incoming events in the monitoring, there are no errors neither in monitoring nor in the maintenance logs. I am pretty sure that the source data has the right column in the right format.
If I remove the timestamp statement, then everything is working fine. The reason why I need the timestamp statement in the first place is because I need to write a number of queries in the same job, writing various aggregations to different outputs. And if I use timestamp in one query, I am required to use it in all other queries itself.
Am I doing something wrong? Perhaps SELECT * does not play well with TIMESTAMP BY? I just did not find any documentation explaining that...
{"myDateTime":"2015-08-02T10:59:02.0000000Z", "EventEnqueuedUtcTime":"2015-08-07T10:59:07.6980000Z"}
Late tolerance window: 00.00:00:05
All of your events are considered late arriving because myDateTime is 5 days before EventEnqueuedUtcTime. Can you try sending new events where myDateTime is in UTC and is "now" so it matches within a couple of seconds?
Also, when you started the job, what did you pick as the job start date time? Can you make sure you pick a date before the myDateTime values? You might try this first.

Resources