In T-SQL, when grouping results, you can also get a running total row when specifying "WITH ROLLUP".
How can i achieve this in Kusto? So, consider the following query:
customEvents | summarize counter = count() by name
The query above gives me a list of event names, and how often they occurred. This is what i need, but i also want a row with the running total (the count of all events).
It feels like there should be an easy way to achieve this, but i havent found anything in the docs ...
You can write 2 queries, the first query is used to count the number of each events, the second query is used to count the numbers of all the events. Then use the union operator to join them.
The query like below:
customEvents
| count
| extend name = "total",counter=Count
| project name,counter
| union
(customEvents
| summarize counter = count() by name)
Test result is as below:
Related
I am very new to kql, and i am stuck on this query. I am looking to have query display which users have had sign-ins from different states. I created this query, but i do not know how to count the results in the column "names".
SigninLogs
| project tostring(LocationDetails.state), UserDisplayName
| extend p =pack( 'Locations', LocationDetails_state)
| summarize names = make_set(p) by UserDisplayName
This generates a column "names" with a row like so:
[{"Locations":"Arkansas"},{"Locations":"Iowa"},{"Locations":""}]
Here is a simple query that grabs all sign-ins from users and another column with the locations.
SigninLogs
| where ResultType == "0"
| summarize by UserDisplayName, tostring(LocationDetails.state)
Is there a way to combine the duplicates of users column, and then display each location in the second? If so, could i count each location in order to filter by where location is > 1?
I am looking to have query display which users have had sign-ins from different states
Assuming I understood your question correctly, this could work (using array_length()):
SigninLogs
| project State = tostring(LocationDetails.state), UserDisplayName
| summarize States = make_set(State) by UserDisplayName
| where array_length(States) > 1 // filter users who had sign-ins from *more than 1 state*
I have a statement where I try to concatenate logs (strings) together to a single string.
Something like this
ContainerLog
| where conditions
| summarize strcat(LogEntry)
However I cant figure out how to concatenate strings like this since strcat is not an aggregating function. I need something else but don't know what.
How can I do this?
For example if I have log entries "1","2","3" the final result should be "123"
This allows a distinct combination in concrete.
requests
| where URL contains "prod"
| summarize count(), code=make_set(resultCode) by name
| sort by count_ desc
Reference: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/summarizeoperator, https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/makeset-aggfunction
I'm trying to create a query in Application Insights that can show me the absolute and average number of messages in conversations over a particular time period. I'm using the LUIS trace example to get the context+LUIS information, which is where I'm pulling the conversationID from. I can get a table showing the number of messages per conversation, but I would also like to have a average number of messages for the data set. Either static average or rolling average (by pulling in timestamp) would be fine. I can get this value by doing a second summarize statement, but then I lose the granularity from the first. Here is my query.
requests
| where url endswith "messages"
| where timestamp > ago(30d)
| project timestamp, url, id
| parse kind = regex url with *"(?i)http://"botName".azurewebsites.net/api/messages"
| join kind= inner (
traces | extend id = operation_ParentId
) on id
| where message == "LUIS"
| extend convID = tostring(customDimensions.LUIS_botContext_conversation_id)
| order by timestamp desc nulls last
| project timestamp, botName, convID
| summarize messages=count() by conversation=convID
This gives me a table of conversation IDs with the message count for each conversation. I would also like to see the average number of messages per conversation. For example, if I have 4 conversations with 100 messages total, I want to see that the average is 25. I can get this result by doing a second summarize statement | summarize messages=sum(messages), avgMessages=avg(messages), but then of course I can no longer see the individual conversations. Is there any way to see both in the same table?
You can write 2 queries, one for "gives me a table of conversation IDs with the message count for each conversation", and another for " the average number of messages per conversation". And consider use Let statement for your query.
The tricky here is that, in both of the 2 queries, after the summarize statement, add this line of code at the end, like | extend myidentifier="aaa" .
Then you can join the 2 queries by using myidentifier.
I couldn't figure out how to do this without losing granularity from the first list (i.e. I couldn't figure out how to calculate average per period e.g. day), but the following query does at least get me the average across whatever timestamp filter I set, which ultimately gets me at the data I was looking for.
requests
| where url endswith "messages"
| where timestamp > ago(30d)
| project timestamp, url, id
| parse kind = regex url with *"(?i)http://"botName".azurewebsites.net/api/messages"
| join kind= inner (
traces | extend id = operation_ParentId
) on id
| where message == "LUIS"
| extend convID = tostring(customDimensions.LUIS_botContext_conversation_id)
| order by timestamp desc nulls last
| project timestamp, botName, convID
| summarize messages=count() by conversation=convID
| summarize conversations=count(), messageAverage=avg(messages)
I'm looking for examples of the code using python3, no links to the documentation. I havent found examples in the documentation.
I'm looking to query 2 elements with the category "red" starting at the ID 1.
This is my table:
| ID | category | description |
| 0 | red | .... |
| 1 | red | .... |
| 2 | blue | .... |
| 3 | red | .... |
| 4 | red | .... |
The query should return the elements with the id 1 and 3.
Looking forward to read your examples. Thanks in advance.
In dynamo Db you query over your PartitionKey, LSIs or GSIs.
In your case I would create a GSI with its' partitionKey (gsiID) as your category and its' sortKey (gsiSK) as your ID.
In that case you can do a query like this: query all elements with gsiID = red and gsiSK = *
This will give you all the reds sorted by their ID in ascending order (you can also specify descending order)
Now dynamo queries have an option to limit your result. Since you need to you can do a limit = 2.
I hope this will help you!
You need to define an Global secondary index in which the partition key is category and the sort key is id.
Once your have that index defined, you can query it as follows (I am using the JS notation, sorry):
{
TableName: 'your_table_name',
IndexName: 'your_index_name',
KeyConditionExpression: 'category = :x and ID >= :y',
ExpressionAttributeValues: {
':x': 'red',
':y': 1
}
}
Note that this is a query. In DynamoDB, queries work on "chunks" of items (aka: "pages"). Specifically, when executing a query, DDB takes a chunk, finds all matching items in that chunks and returns them. If there are other matching items in other chunks they will not be returned. However, the response will provide you with details of the next chunk so that you can issue a subsequent query on the next chunk. These "details" are encapsulated in the LastEvaluatedKey field of the response and they should be copied into the ExclusiveStartKey of the subsequent request.
You can check this guide to see an example of using LastEvaluatedKey. Look for the following line:
while 'LastEvaluatedKey' in response:
Important!
Although you want to get just two items, you do not want to set the Limit field to 2. Setting it to 2 means that DynamoDB will use very small chunks when looking for items that match your query (in fact, it will use chunks of just two items): this means you will need to do numerous repeated queries (by using LastEvaluatedkey/ExclusiveStartKey as explained above) until you actually find two matching items. This will considerably slow down the entire process. For most practical scenarios, the best thing to do is not to set the Limit field at all, and just use its default value.
I have the following 'Tasks' table in Cassandra.
Task_ID UUID - Partition Key
Starts_On TIMESTAMP - Clustering Column
Ends_On TIMESTAMP - Clustering Column
I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).
What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.
In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.
Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).
Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):
CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));
Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:
CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';
Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.
So first we have to set the end_range we want:
UPDATE tasks SET end_range=15 where p=1;
And let's say we have this data:
SELECT * FROM tasks;
p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
Now let's get the task_id's that have start >= 2 and end <= 15:
SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;
start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.
Anyway not a perfect method obviously, but it might be something you can work with. :)
A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:
CREATE TABLE userEvents (
userid UUID,
eventTime TIMEUUID,
eventType TEXT,
eventDesc TEXT,
PRIMARY KEY ((userid),eventTime,eventType));
With this structure, I can query by userid and eventtime:
SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents
WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68
AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');
userid | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 | End | event1
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 | Begin | lunch
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 | End | lunch
(3 rows)
That query will give me all event rows for a particular user for today.
NOTES:
If you need to query by whether or not an event is starting or ending (I did not) you will want to order eventType ahead of eventTime in the primary key.
You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
In your case, you will want to find a good key to partition on, as Task_ID will be too unique (high cardinality). This is a must in Cassandra, as you cannot range query on a partition key (only a clustering key).
There doesn't seem to be a completely satisfactory way to do this in Cassandra but the following method seems to work well:
I cluster the table on the Starts_On timestamp in descending order. (Ends_On is just a regular column.) Then I constrain the query with Starts_On<? where the parameter is the end of the period of interest - i.e. filter out events that start after our period of interest has finished.
I then iterate through the results until the row Ends_On is earlier than the start of the period of interest and throw away the rest of the results rows. (Note that this assumes events don't overlap - there are no subsequent results with a later Ends_On.)
Throwing away the rest of the result rows might seem wasteful, but here's the crucial bit: You can set the paging size sufficiently small that the number of rows to throw away is relatively small, even if the total number of rows is very large.
Ideally you want the paging size just a little bigger than the total number of relevant rows that you expect to receive back. If the paging size is too small the driver ends up retrieving multiple pages, which could hurt performance. If it is too large you end up throwing away a lot of rows and again this could hurt performance by transfering more data than is necessary. In practice you can probably find a good compromise.