Azure stream analytics array_agg equivalent? - azure

Is there a way to do the postgres equivalent of array_agg or string_agg in stream analytics? I have data that comes in every few seconds, and would like to get the count of the values within a time frame.
Data:
{time:12:01:01,name:A,location:X,value:10}
{time:12:01:01,name:B,location:X,value:9}
{time:12:01:02,name:C,location:Y,value:5}
{time:12:01:02,name:B,location:Y,value:4}
{time:12:01:03,name:B,location:Z,value:2}
{time:12:01:03,name:A,location:Z,value:3}
{time:12:01:06,name:B,location:Z,value:4}
{time:12:01:06,name:C,location:Z,value:7}
{time:12:01:08,name:B,location:Y,value:1}
{time:12:01:13,name:B,location:X,value:8}
With a sliding window of 2 seconds, I want to group the data to see the following:
12:01:01, 2 events, 9.5 avg, 2 distinct names, 1 distinct location, nameA:1, nameB:1, locationX:1
12:01:02, 4 events, 7 avg, 3 distinct names, 2 distinct location, nameA:1, nameB:2,nameC:1,locationX:1,locationY:1
12:01:03...
12:01:06...
...
I can get the number of events, average, and distinct counts without issue. I use a window as well as a with statement to join on the timestamp to get the aggregated counts for that timestamp. I am having trouble figuring out how to get the total counts by name and location, mostly because I do not know how to aggregate strings in Azure.
with agg1 as (
select system.timestamp as start,
avg(value) as avg,
count(1) as events,
count(distinct name) as distinct names,
count(distinct location) as distinct location
from input timestamp by created
group by slidingwindow(second,2)
),
agg2 as (
select agg2_inner.start,
array_agg(name,'|',ct_name) as countbyname (????)
from (
select system.timestamp as start,
name, count(1) as ct_name
from input timestamp by created
group by slidingwindow(second,2), name
) as agg2_inner
group by agg2_inner.start, slidingwindow(seconds,2)
)
select * from agg1 join agg2 on (datediff(second,agg1,agg2) between 0 and 2
and agg1.start = agg2.start)
There is not set list of names, locations so the query needs to be a bit dynamic. It is ok if the counts are in an object within a single query, a process later on can parse to get individual counts.

As far as I know, azure stream analysis doesn't provide the array_agg method. But it provides Collect method which could return the all record values from the window.
I suggest you could use Collect method firstly return the array which grouped by the time and window.
Then you could use Azure Stream Analytics JavaScript user-defined functions to write your own logic to convert the array to the result.
More details, you could refer to below sample:
The query like this:
SELECT
time, udf.yourunfname(COLLECT()) as Result
INTO
[YourOutputAlias]
FROM
[YourInputAlias]
Group by time, TumblingWindow(minute, 10)
The UDF is like this:
I just return the avg and the event length.
function main(InputJSON) {
var sum = 0;
for (i = 0; i < InputJSON.length; i++) {
sum += InputJSON[i].value;
}
var result = {events:InputJSON.length,avg:sum/InputJSON.length };
return result;
}
Data:
{"name": "A", "time":"12:01:01","value":10}
{"name": "B", "time":"12:01:01","value":9}
{"name": "C", "time":"12:01:02","value":10}
Result:

Related

Cosmos Db Sql Query produces drastically different results when using order by

I have a Cosmos Db instance with > 1 Million JSON Documents stored in it.
I am trying to pull data of a certain time frame as to when the document was created based on the _ts variable which is auto-generated when the document is inserted. It represents the UNIX timestamp of that moment.
I am unable to understand, why both these queries produce drastically different results:
Query 1:
Select *
from c
where c._ts > TimeStamp1
AND c._ts < TimeStamp2
Produces 0 results
Query 2
Select *
from c
where c._ts > TimeStamp1
AND c._ts < TimeStamp2
order by c._ts desc
Produces the correct number of results.
What I have tried?
I suspected that might be because of the default CosmosDb index on the data. So, I rewrote the index policy to index only that variable. Still the same problem.
Since my end purpose is to group by on the returned data from the query, then I tried to use group by with order by alone or in a subquery. Surprisingly, according to the docs, CosmosDb yet doesn't support using group by with order by.
What I need help on?
Why am I observing such a behavior?
Is there a way to index the Db in such a way that the rows are returned.
Beyond this, is there a way to still use group by and order by together (Please don't link the question to another one because of this point, I have gone through them and their answers are not valid in my case).
#Andy and #Tiny-wa, Thanks for replying.
I was able to understand the unintended behavior and it was showing up because of the GetCurrentTimestamp() used to calculate the TimeStamps. The documentation states that
This system function will not utilize the index. If you need to
compare values to the current time, obtain the current time before
query execution and use that constant string value in the WHERE
clause.
Although, I don't fully understand what this means but I was to solve this by creating a Stored Procedure where the Time Stamp is fetched before the SQLAPI query is formed and executed and I was able to get the rows as expected.
Stored Procedure Pseudocode for that is like:
function FetchData(){
..
..
..
var Current_TimeStamp = Date.now();
var CDbQuery =
`Select *
FROM c
where (c._ts * 10000000) > DateTimeToTicks(DateTimeAdd("day", -1, TicksToDateTime(` + Current_TimeStamp + ` * 10000)))
AND (c._ts * 10000000) < (` + Current_TimeStamp + ` * 10000)`
var isAccepted = collection.queryDocuments(
collection.getSelfLink(),
XQuery,
function (err, feed, options) {
..
..
..
});
}

Getting AutoQuery pagination to work with left join

In my AutoQuery request I have a left join specified so I can query on properties in the joined table.
public class ProductSearchRequest : QueryDb<Book>
, ILeftJoin<Book, BookAuthor>, ILeftJoin<BookAuthor, Author>
{}
If I use standard way of autoquery like so:
var q = AutoQuery.CreateQuery(request, base.Request);
var results = AutoQuery.Execute(request, q);
And 100 are being requested, then often less than 100 will be retuned as the Take() is based on results with a left join.
To remedy this I am doing this instead:
var q = AutoQuery.CreateQuery(request, base.Request);
q.OrderByExpression = null //throws error if orderby exists
var total = Db.Scalar<int>(q.Select(x => Sql.CountDistinct(x.Id))); //returns 0
var q1 = AutoQuery.CreateQuery(request, base.Request).GroupBy(x => x);
var results = Db.Select<Book>(q1);
return new QueryResponse<Book>
{
Offset = q1.Offset.GetValueOrDefault(0),
Total = total
Results = results
};
The group by appears to return correct number of results so paging works but the Total returns 0.
I also tried:
var total2 = (int)Db.Count(q1);
But even though q1 has a GroupBy() it returns the number of results including the left join and not the actual query
How can I get the true total of the query?
(Getting some official docs on how to do paging and totals with autoquery & left join would be very helpful as right now it's a bit confusing)
Your primary issue stems from trying to return a different total then the actual query AutoQuery executes. If you have multiple left joins, the total is the total results of the query it executes not the number of rows in your source table.
So you're not looking for the "True total", rather you're looking to execute a different query to get a different total than the query that's executed, but still deriving from the original query as its basis. First consider using normal INNER JOINS (IJoin<>) instead of LEFT JOINS so it only returns results for related rows in joined tables which the total will reflect accordingly.
Your total query that returns 0 is likely returning no results, so I'd look at looking at the query in an SQL Profiler so you can see the query that's executed. You can also enable logging of OrmLite queries with Debug logging enabled and in your AppHost:
OrmLiteUtils.PrintSql();
Also note that GroupBy() of the entire table is unusual, you would normally group by a single or multiple explicit selected columns, e.g:
.GroupBy(x => x.Id);
.GroupBy(x => new { x.Id, x.Name });

To Find Distinct values after doing a UNION operation in Azure Stream Analytics

I am running a query in Stream analytics with a UNION function.
I would like to get the distinct values for the query results. Since the UNION is allowing duplicates in Azure Stream Analytics I am getting result with duplicate values. I have tried using DISTINCT keyword also but even it is not working.
Below is the Query I tried.
WITH
ABCINNERQUERY AS (
SELECT
event.ID as ID,
ABCArrayElement.ArrayValue.E as TIME,
ABCArrayElement.ArrayValue.V as ABC
FROM
[YourInputAlias] as event
CROSS APPLY GetArrayElements(event.ABC) AS ABCArrayElement
),
XYZINNERQUERY AS (
SELECT
event.ID as ID,
XYZArrayElement.ArrayValue.E as TIME,
XYZArrayElement.ArrayValue.V as XYZ
FROM
[YourInputAlias] as event
CROSS APPLY GetArrayElements(event.XYZ) AS XYZArrayElement
),
KEYS AS
(
SELECT DISTINCT
ABCINNERQUERY.ID AS ID,
ABCINNERQUERY.TIME as TIME
FROM ABCINNERQUERY
UNION
SELECT DISTINCT
XYZINNERQUERY.ID AS ID,
XYZINNERQUERY.TIME as TIME
FROM XYZINNERQUERY
)
SELECT
KEYS.ID as ID,
KEYS.TIME as TIME
INTO [YourOutputAlias]
FROM KEYS
In the above query ID is unique and Time will be An array of values with Time and value of ABC/XYZ.
input json file is as below.
[
{"ID":"006XXXXX",
"ABC":
[{"E":1557302231320,"V":54.799999237060547}],
"XYZ":
[{"E":1557302191899,"V":31.0},{"E":1557302231320,"V":55}],
{"ID":"007XXXXX",
"ABC":
[{"E":1557302195483,"V":805.375},{"E":1557302219803,"V":0}],
"XYZ":
[{"E":1557302219803,"V":-179.0},{"E":1557302195483,"V":88}]
]
Expected result without duplicates.

Cassandra Modelling for Date Range

Cassandra Newbie here. Cassandra v 3.9.
I'm modelling the Travellers Flight Checkin Data.
My Main Query Criteria is Search for travellers with a date range (max of 7 day window).
Here is what I've come up with with my limited exposure to Cassandra.
create table IF NOT EXISTS travellers_checkin (checkinDay text, checkinTimestamp bigint, travellerName text, travellerPassportNo text, flightNumber text, from text, to text, bookingClass text, PRIMARY KEY (checkinDay, checkinTimestamp)) WITH CLUSTERING ORDER BY (checkinTimestamp DESC)
Per day, I'm expecting upto a million records - resulting in the partition to have a million records.
Now my users want search in which the date window is mandatory (max a week window). In this case should I use a IN clause that spans across multiple partitions? Is this the correct way or should I think of re-modelling the data? Alternatively, I'm also wondering if issuing 7 queries (per day) and merging the responses would be efficient.
Your Data Model Seems Good.But If you could add more field to the partition key it will scale well. And you should use Separate Query with executeAsync
If you are using in clause, this means that you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing
Source : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Instead of using IN clause, use separate query of each day and execute it with executeAsync.
Java Example :
PreparedStatement statement = session.prepare("SELECT * FROM travellers_checkin where checkinDay = ? and checkinTimestamp >= ? and checkinTimestamp <= ?");
List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i, i));
futures.add(resultSetFuture);
}
for (ResultSetFuture future : futures){
ResultSet rows = future.getUninterruptibly();
//You get the result set of each query, merge them here
}

Cannot link MS Access query with subquery

I have created a query with a subquery in Access, and cannot link it in Excel 2003: when I use the menu Data -> Import External Data -> Import Data... and select the mdb file, the query is not present in the list. If I use the menu Data -> Import External Data -> New Database Query..., I can see my query in the list, but at the end of the import wizard I get this error:
Too few parameters. Expected 2.
My guess is that the query syntax is causing the problem, in fact the query contains a subquery. So, I'll try to describe the query goal and the resulting syntax.
Table Positions
ID (Autonumber, Primary Key)
position (double)
currency_id (long) (references Currency.ID)
portfolio (long)
Table Currency
ID (Autonumber, Primary Key)
code (text)
Query Goal
Join the 2 tables
Filter by portfolio = 1
Filter by currency.code in ("A", "B")
Group by currency and calculate the sum of the positions for each currency group an call the result: sumOfPositions
Calculate abs(sumOfPositions) on each currency group
Calculate the sum of the previous results as a single result
Query
The query without the final sum can be created using the Design View. The resulting SQL is:
SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")));
in order to calculate the final SUM I did the following (in the SQL View):
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
So, the question is: is there a better way for structuring the query in order to make the export work?
I can't see too much wrong with it, but I would take out some of the junk Access puts in and scale down the query to this, hopefully this should run ok:
SELECT Sum(Abs(A.SumOfPosition)) As SumAbs
FROM (SELECT C.code, Sum(P.position) AS SumOfposition
FROM Currency As C INNER JOIN Positions As P ON C.ID = P.currency_id
WHERE P.portfolio=1
GROUP BY C.code
HAVING C.code In ("A","B")) As A
It might be worth trying to declare your parameters in the MS Access query definition and define their datatypes. This is especially important when you are trying to use the query outside of MS Access itself, since it can't auto-detect the parameter types. This approach is sometimes hit or miss, but worth a shot.
PARAMETERS [[Positions].[portfolio]] Long, [[Currency].[code]] Text ( 255 );
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
I have solved my problems thanks to the fact that the outer query is doing a trivial sum. When choosing New Database Query... in Excel, at the end of the process, after pressing Finish, an Import Data form pops up, asking
Where do you want to put the data?
you can click on Create a PivotTable report... . If you define the PivotTable properly, Excel will display only the outer sum.

Resources