KQL join two tables with different TimeStamp - azure

I´m working with KQL and trying to join two tables on a timestamp field. The problem is that they have different values when it comes to seconds.
The table "TableToJoin" will ingest record every minute (so seconds are 00) and the MeasureTime column I made will have different seconds depending on when I hit the run button (knowing it will starting counting 36h from now)
Do you know a method I could use to fix this?
I paste my code below:
range MeasureTime from ago(36h) to now() step(10m)
| join kind=rightouter
(TableToJoin| where TagName == 'TagName') on $left.MeasureTime == $right.Timestamp | take 10
TableToJoin TimeStamp:
2021-11-01T14:09:00Z
2021-11-01T14:08:00Z
2021-11-01T14:06:00Z
2021-11-01T14:05:00Z
2021-11-01T14:04:00Z
2021-11-01T14:03:00Z
2021-11-01T14:02:00Z
2021-11-01T14:01:00Z
2021-11-01T14:00:00Z
MeasureTime TimeStamp:
2021-11-01T13:59:20.5230363Z
2021-11-01T14:00:20.5230363Z
2021-11-01T14:01:20.5230363Z
2021-11-01T14:02:20.5230363Z
2021-11-01T14:03:20.5230363Z
2021-11-01T14:04:20.5230363Z
2021-11-01T14:05:20.5230363Z
2021-11-01T14:06:20.5230363Z
Thanks in advance

You can use the bin() function to "round" the timestamp. For example:
range MeasureTime from ago(36h) to now() step(10m)
| extend MeasureTime = bin(MeasureTime, 10s)
| join kind=rightouter (
TableToJoin| where TagName == 'TagName'
) on $left.MeasureTime == $right.Timestamp | take 10

Related

How can i do a "GROUP BY WITH ROLLUP" in Kusto?

In T-SQL, when grouping results, you can also get a running total row when specifying "WITH ROLLUP".
How can i achieve this in Kusto? So, consider the following query:
customEvents | summarize counter = count() by name
The query above gives me a list of event names, and how often they occurred. This is what i need, but i also want a row with the running total (the count of all events).
It feels like there should be an easy way to achieve this, but i havent found anything in the docs ...
You can write 2 queries, the first query is used to count the number of each events, the second query is used to count the numbers of all the events. Then use the union operator to join them.
The query like below:
customEvents
| count
| extend name = "total",counter=Count
| project name,counter
| union
(customEvents
| summarize counter = count() by name)
Test result is as below:

Optimizing Theta Joins in Spark SQL

I have just 2 tables wherein I need to get the records from the first table (big table 10 M rows) whose transaction date is lesser than or equal to the effective date present in the second table (small table with 1 row), and this result-set will then be consumed by downstream queries.
Table Transact:
tran_id | cust_id | tran_amt | tran_dt
1234 | XYZ | 12.55 | 10/01/2020
5678 | MNP | 25.99 | 25/02/2020
5561 | XYZ | 32.45 | 30/04/2020
9812 | STR | 10.32 | 15/08/2020
Table REF:
eff_dt |
30/07/2020 |
Hence as per logic I should get back the first 3 rows and discard the last record since it is greater than the reference date (present in the REF table)
Hence, I have used a non-equi Cartesian Join between these tables as:
select
/*+ MAPJOIN(b) */
a.tran_id,
a.cust_id,
a.tran_amt,
a.tran_dt
from transact a
inner join ref b
on a.tran_dt <= b.eff_dt
However, this sql is taking forever to complete due to the cross Join with the transact table even using Broadcast hints.
So is there any smarter way to implement the same logic which will be more efficient than this ? In other words, is it possible to optimize the Theta join in this query ?
Thanks in advance.
So I wrote something like this:
Referring from https://databricks.com/session/optimizing-apache-spark-sql-joins
Can you try Bucketing on trans_dt (Bucketed on Year/Month only). And write 2 queries to do the same work
First query, trans_dt(Year/Month) < eff_dt(Year/Month). So this could help you actively picking up buckets(rather than checking each and every record trans_dt) which is less than 2020/07.
second query, trans_dt(Year/Month) = eff_dt(Year/Month) and trans_dt(Day) <= eff_dt(Day)

Cassandra - Overlapping Data Ranges

I have the following 'Tasks' table in Cassandra.
Task_ID UUID - Partition Key
Starts_On TIMESTAMP - Clustering Column
Ends_On TIMESTAMP - Clustering Column
I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).
What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.
In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.
Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).
Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):
CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));
Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:
CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';
Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.
So first we have to set the end_range we want:
UPDATE tasks SET end_range=15 where p=1;
And let's say we have this data:
SELECT * FROM tasks;
p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
Now let's get the task_id's that have start >= 2 and end <= 15:
SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;
start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.
Anyway not a perfect method obviously, but it might be something you can work with. :)
A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:
CREATE TABLE userEvents (
userid UUID,
eventTime TIMEUUID,
eventType TEXT,
eventDesc TEXT,
PRIMARY KEY ((userid),eventTime,eventType));
With this structure, I can query by userid and eventtime:
SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents
WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68
AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');
userid | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 | End | event1
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 | Begin | lunch
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 | End | lunch
(3 rows)
That query will give me all event rows for a particular user for today.
NOTES:
If you need to query by whether or not an event is starting or ending (I did not) you will want to order eventType ahead of eventTime in the primary key.
You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
In your case, you will want to find a good key to partition on, as Task_ID will be too unique (high cardinality). This is a must in Cassandra, as you cannot range query on a partition key (only a clustering key).
There doesn't seem to be a completely satisfactory way to do this in Cassandra but the following method seems to work well:
I cluster the table on the Starts_On timestamp in descending order. (Ends_On is just a regular column.) Then I constrain the query with Starts_On<? where the parameter is the end of the period of interest - i.e. filter out events that start after our period of interest has finished.
I then iterate through the results until the row Ends_On is earlier than the start of the period of interest and throw away the rest of the results rows. (Note that this assumes events don't overlap - there are no subsequent results with a later Ends_On.)
Throwing away the rest of the result rows might seem wasteful, but here's the crucial bit: You can set the paging size sufficiently small that the number of rows to throw away is relatively small, even if the total number of rows is very large.
Ideally you want the paging size just a little bigger than the total number of relevant rows that you expect to receive back. If the paging size is too small the driver ends up retrieving multiple pages, which could hurt performance. If it is too large you end up throwing away a lot of rows and again this could hurt performance by transfering more data than is necessary. In practice you can probably find a good compromise.

Cassandra : Making an appropriate Data Model

I have a table called Price in MYSQL which looks like this :
+---------+-------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+-------------------+-----------------------------+
| Current | float(20,3) | YES | | NULL | |
| Time | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+---------+-------------+------+-----+-------------------+-----------------------------+
My application requires me to sum and retrieve results from the last 1 hour, 2 hours up to the last week from now. I am trying to move to Cassandra and wanted to make a suitable model for my data. Currently i have built a table in Cassandra which looks something like this :
CREATE TABLE IF NOT EXISTS HAS.Price (
ID INT,
Current float,
Time timestamp,
Time_uuid timeuuid,
PRIMARY KEY (ID, Time_uuid)
);
This is not logical as it just creates one big table and i dont think this will distribute data to other nodes. I am using a fixed id of 1 here. I believe in my case the logical partition key to choose would be "hour" so for example i can sum all the current values from last hour, last 2 hours and so on. In this case i am referring to this post . If i create hour as a partition key for example all the data for lets say the 15th hour of the day will go in this row
2015-08-06 15:00:00
and the data for the next hour will go to 2015-08-06 16:00:00. However lets say the current time is 2015-08-06 16:12:43 and i want to select records from last hour how will my query look like because part of the data is in 2015-08-06 15:00:00 which will have a different primary key
Try the following option. ( I have correct the answer)
Design for your queries. Here, possible queries I could see other than upto minute
Get sum for day
Get sum for hour
Get sum for last hour (any time on the hour)
CREATE TABLE mykeyspace.price (
day text,
hour text,
inserttime timeuuid,
current float,
PRIMARY KEY ((day, hour), inserttime)
) WITH CLUSTERING ORDER BY (inserttime DESC)
Make 2 insert for every transaction like below
insert into price (day, hour , inserttime , current ) VALUES ('20150813','',now(),2.00)
insert into price (day, hour , inserttime , current ) VALUES ('',’ 2015081317',now(),2.00)
Where
day is YYYYMMDD
hour YYYYMMDDhhmmss (2015081317)
Select Query to get last hour at any minute: Use minTimeuuid and maxTimeuuid
select day,hour,dateOf(inserttime) from price where day = 0 and hour IN ( 2015081317, 2015081316) and inserttime > maxTimeuuid('2015-08-13 16:20:00-0500') and inserttime < minTimeuuid('2015-08-13 17:20:00-0500');
Note: Range query is not allowed on a partition key, although documentation says you could use token function but the results are not predictable.
This is not logical as it just creates one big table and i dont think this will distribute data to other nodes.
Yes, this won't distribute data across you nodes.
Here what I think solution should be
CREATE TABLE IF NOT EXISTS HAS.Price (
Time_uuid timeuuid,
Current float,
PRIMARY KEY (Time_uuid)
);
Then simply find start hour time_uuid and end hour time_uuid and write query like
`SELECT * FROM HAS.Price WHERE time_uuid>=cdb36860-4444-11e5-8080-808080808080 AND time_uuid<=f784b8ef-450d-11e5-7f7f-7f7f7f7f7f7f`

Data scheme Cassandra using various data types

Currently I am developing a solution in the field of time-series data. Within these data we have: an ID, a value and a timestamp.
So here it comes: the value might be of type boolean, float or string. I consider three approaches:
a) For every data type a distinct table, all sensor values of type boolean into a table, all sensor values of type string into another. The obvious disadvantage is that you have to know where to look for a certain sensor.
b) A meta-column describing the data type plus all values of type string. The obvious disadvantage is the data conversion e.g. for calculating the MAX, AVG and so on.
c) Having three columns of different type but only one will be with a value per record. The disadvantage is 500000 sensors firing every 100ms ... plenty of unused space.
As my knowledge is limited any help is appreciated.
500000 sensors firing every 100ms
First thing, is to make sure that you partition properly, to make sure that you don't exceed the limit of 2 billion columns per partition.
CREATE TABLE sensorData (
stationID uuid,
datebucket text,
recorded timeuuid,
intValue bigint,
strValue text,
blnValue boolean,
PRIMARY KEY ((stationID,datebucket),recorded));
With a half-million every 100ms, that's 500 million in a second. So you'll want to set your datebucket to be very granular...down to the second. Next I'll insert some data:
stationid | datebucket | recorded | blnvalue | intvalue | strvalue
--------------------------------------+---------------------+--------------------------------------+----------+----------+----------
8b466f1d-8d6b-46fa-9f5b-8c4eb51aa40c | 2015-04-22T14:54:29 | 6338df40-e929-11e4-88c8-21b264d4c94d | null | 59 | null
8b466f1d-8d6b-46fa-9f5b-8c4eb51aa40c | 2015-04-22T14:54:29 | 633e0f60-e929-11e4-88c8-21b264d4c94d | null | null | CD
8b466f1d-8d6b-46fa-9f5b-8c4eb51aa40c | 2015-04-22T14:54:29 | 6342f160-e929-11e4-88c8-21b264d4c94d | True | null | null
3221b1d7-13b4-40d4-b41c-8d885c63494f | 2015-04-22T14:56:19 | a48bbdf0-e929-11e4-88c8-21b264d4c94d | False | null | null
...plenty of unused space.
You might be suprised. With the CQL output of SELECT * above, it appears that there are null values all over the place. But watch what happens when we use the cassandra-cli tool to view how the data is stored "under the hood:"
RowKey: 3221b1d7-13b4-40d4-b41c-8d885c63494f:2015-04-22T14\:56\:19
=> (name=a48bbdf0-e929-11e4-88c8-21b264d4c94d:, value=, timestamp=1429733297352000)
=> (name=a48bbdf0-e929-11e4-88c8-21b264d4c94d:blnvalue, value=00, timestamp=1429733297352000)
As you can see, the data (above) stored for the CQL row where stationid=3221b1d7-13b4-40d4-b41c-8d885c63494f AND datebucket='2015-04-22T14:56:19' shows that blnValue has a value of 00 (false). But also notice that intValue and strValue are not present. Cassandra doesn't force a null value like an RDBMS does.
The obvious disadvantage is the data conversion e.g. for calculating the MAX, AVG and so on.
Perhaps you already know this, but I did want to mention that Cassandra CQL does not contain definitions for MAX, AVG or any other data aggregation function. You'll either need to do that client-side, or implement Apache-Spark to perform OLAP-type queries.
Be sure to read through Patrick McFadin's Getting Started With Time Series Data Modeling. It contains good suggestions on how to solve time series problems like this.

Resources