I have recently started working with Kusto. I am stuck with a use case where i need to confirm the approach i am taking is right.
I have data in the following format
In the above example, if the status is 1 and if the time frame is equal to 15 seconds then i need to assume it as 1 occurrence.
So in this case 2 occurrence of status.
My approach was
if the current and next rows status is equal to 1 then take the time difference and do row_cum_sum and break it if the next(STATUS)!=0.
Even though the approach is giving me correct output, I am assuming the performance can slow down once the size is increased.
I am looking for an alternative approach if any. Also adding the complete scenario to reproduce this with a sample data.
.create-or-alter function with (folder = "Tests", skipvalidation = "true") InsertFakeTrue() {
range LoopTime from ago(365d) to now() step 6s
| project TIME=LoopTime,STATUS=toint(1)
}
.create-or-alter function with (folder = "Tests", skipvalidation = "true") InsertFakeFalse() {
range LoopTime from ago(365d) to now() step 29s
| project TIME=LoopTime,STATUS=toint(0)
}
.set-or-append FAKEDATA <| InsertFakeTrue();
.set-or-append FAKEDATA <| InsertFakeFalse();
FAKEDATA
| order by TIME asc
| serialize
| extend cstatus=STATUS
| extend nstatus=next(STATUS)
| extend WindowRowSum=row_cumsum(iff(nstatus ==1 and cstatus ==1, datetime_diff('second',next(TIME),TIME),0),cstatus !=1)
| extend windowCount=iff(nstatus !=1 or isnull(next(TIME)), iff(WindowRowSum ==15, 1,iff(WindowRowSum >15,(WindowRowSum/15)+((WindowRowSum%15)/15),0)),0 )
| summarize IDLE_COUNT=sum(windowCount)
The approach in the question is the way to achieve such calculations in Kusto and given that the logic requires sorting is also efficient (as long as the sorted data can reside on a single machine).
Regarding union operator - it runs in parallel by default, you can control the concurrency and spread using hints, see: union operator
Related
I want to calculate a statistic mode on a column during summarization of a table.
My CalculateMode function that I try is like this:
.create function CalculateMode(Action:int, Asset:string, Start:long, End:long) {
Event
| where Time between (Start .. End) and IdAction == Action and IdDevice == Device
| summarize Count = countif(isnotnull(Result) and isnotempty(Result)) by tostring(Result)
| top 1 by Count desc
| project ActionResult
}
OR
.create function CalculateMode(T:(data:dynamic)) {
T
| summarize Count = countif(isnotnull(data) and isnotempty(data)) by tostring(data)
| top 1 by Count desc
| project data
}
when i using first coding on summarizing:
Event
| summarize Result = CalculateMode(toint(IdAction), tostring(IdDevice), Start, End) by Category
Obtain this error No tabular expression statement found and
when i using second coding on summarizing:
Event
| summarize Result = CalculateMode(Result) by Category
I get this error
CalculateMode(): argument #1 must be a tabular expression
What can I do? Where am I doing something wrong?
Thanks
You can't just do summarize Result = CalculateMode(Result). You have to decide which aggregation function you want to summarize by (see the full list of aggregation functions here).
While writing a kusto query to create a custom chart on my azure dashboard, I want to be able to calculate the time grain based on the period the user selected on the dashboard.
For example: last 4h => time grain 2 mins, last 24h => 10 mins
I tried the following to calculate the period because we are still unable to access it (as far as I could find on the internet).
let timeGrain = traces
| summarize min_time = min(timestamp), max_time = max(timestamp)
| extend timeWindow = max_time - min_time // days / hrs/ min / seconds
| project timeWindow
| extend timeGrain = case(timeWindow <= 4h, "2m",
timeWindow <= 12h, "5m",
timeWindow <= 24h, "10m",
"2h")
| project timeGrain;
The query returns me the time grain I want to achieve but I am unable to use this variable inside of my other query.
traces
...
| summarize percentile(DurationInMs, 50) by bin(timestamp, timeGrain), CommandType
| render areachart with (ytitle = "Duration In Ms", xtitle = "Timestamp");
(I know traces isn't the best place to store data regarding duration, we are gonna change this to metrics but it's not the scope of the question)
This gives me the following error: 'summarize' operator: Failed to resolve scalar expression named 'timeGrain'
Is there a way to fix this error or is there a better way to create a dynamic time grain?
Obviously I do not have the same fields in my traces but you should use a timespan instead of a string to define timeGrain.
Also, to use the query result timeGrain as a variable, use toscalar (docs):
let timeGrain = toscalar(traces
| summarize min_time = min(timestamp), max_time = max(timestamp)
| extend timeWindow = max_time - min_time // days / hrs/ min / seconds
| project timeWindow
| extend timeGrain = case(timeWindow <= 4h, 2m,
timeWindow <= 12h, 5m,
timeWindow <= 24h, 10m,
2h)
| project timeGrain);
traces
| summarize count() by bin(timestamp, timeGrain)
| order by timestamp desc
this works just fine.
This may not be a direct answer to the question but may be useful for others who do not want to create logic to infer time grain from time range.
Use a workbook to create chart from app insights query. Add a time range parameter and refer to the parameter in query. {TimeRange:grain} would give you granularity corresponding to time range selected. Now pin the query part to dashboard and voila! Your chart is ready to use time range selected on dashboard, auto refresh parameter.
Create workbook and pin parts to dashboard: https://learn.microsoft.com/en-us/azure/azure-monitor/visualize/workbooks-overview
Time range parameter: https://learn.microsoft.com/en-us/azure/azure-monitor/visualize/workbooks-time
My Dataset looks like below, i want to fetch the 1st row,1st column value (A1 in this case)
+-------+---+--------------+----------+
|account|ccy|count(account)|sum_amount|
+-------+---+--------------+----------+
| A1|USD| 2| 500.24|
| A2|SGD| 1| 200.24|
| A2|USD| 1| 300.36|
+-------+---+--------------+----------+
I can do this as below :
Dataset finalDS = dataset.groupBy("account", "ccy").
agg(count("account"), sum("amount").alias("sum_amount"))
.orderBy("account", "ccy");
Object[] items = (Object[])(finalDS.filter(functions.col("sum_amount")
.equalTo(300.36))
.collect());
String accountNo = (String)((GenericRowWithSchema)items[0]).get(0);
2 questions :
Any other/more efficient way to do this ? I am aware of Dataframe/JavaRDD queries
Without the explicit cast Object[], there is a compile time failure, however I would have thought that this is an implicit cast. Why ? I suspect something to do with scala compilation.
Any other/more efficient way to do this ? I am aware of Dataframe/JavaRDD queries
You'd better use Dataset.head (javadocs) function in order to eliminate passing all the data to driver process. This will limit you to loading only 1st row to driver RAM instead of the entire dataset. You also can consider using take function to obtain first N rows.
Without the explicit cast Object[], there is a compile time failure, however I would have thought that this is an implicit cast. Why ? I suspect something to do with scala compilation.
It depends on how your dataset is typed. In case of Datarame (which is Dataset[Row], proof), you'll get an Array[Row] on call to collect. It's worth to mention the signature of collect functions:
def collect(): Array[T] = withAction("collect", queryExecution)(collectFromPlan)
I have the following two queries which i am running on the azure kusto online query terminal
(available at this link - https://dataexplorer.azure.com/clusters/help/databases/Samples )
//okay this is lag related code.
//but we need to serialize first in some way, in other words sort it
StormEvents
| order by StartTime | extend LaggedOutput = next( State,2,"NOTHING FOUND") | project State,LaggedOutput;
//lets try coalasce
//next inside the coalesce returns a empty string and that is replaced with our replacement.
//note : I think we can forgo coalesce completely because next
//already has a default value.
StormEvents
| order by StartTime | project coalesce(next(State,2,""),"COALSESCE");
So, my question is, why bother with coalesce at all? next() already provides a default value that I can apply in this scenario?
In the scenario you provided, the answer is yes: you can remove the coalesce from the 2nd query, and just use next operator with a default value like below(for the 2nd query):
StormEvents
| order by StartTime
| project next(State,2,"COALSESCE")
the output is same as using project coalesce(next(State,2,""),"COALSESCE").
But for other scenarios, like I want to get a non-null value from several values, sample like below:
print result=coalesce(tolong("not a number"), tolong("42"), 33)
and here, we can only use the coalesce operator to get the first non-null value => 42. This is the scenario the next operator cannot cover.
When grabbing search result using Azure Log Analytics Search REST API
I'm able to receive only the first 5000 results (as by the specs, at the top of the document), but know there are many more (by the "total" attribute in the metadata in the response).
Is there a way to paginate so to get the entire result set?
One hacky way would be to attempt to break down the desired time-range iteratively until the "total" is less than 5000 for that timeframe, and do this process iteratively for the entire desired time-range - but this is guesswork that will cost many redundant requests.
While it doesn't appear to be a way to paginate using the REST API itself, you can use your query to perform the pagination. The two key operators here are TOP and SKIP:
Suppose you want page n with pagesize x (starting at page 1), then append to your query:
query | skip (n-1) * x | top x.
For a full reference list, see https://learn.microsoft.com/en-us/azure/log-analytics/log-analytics-search-reference
Yes, skip operation is not available anymore but if you want create pagination there is still an option. You need to count total count of entries, use a simple math and two opposite sortings.
Prerequisites for this query are values: ContainerName, Namespace, Page, PageSize.
I'm using it in Workbook where these values are set by fields.
let containers = KubePodInventory
| where ContainerName matches regex '^.*{ContainerName}$' and Namespace == '{Namespace}'
| distinct ContainerID
| project ContainerID;
let TotalCount = toscalar(ContainerLog
| where ContainerID in (containers)
| where LogEntry contains '{SearchText}'
| summarize CountOfLogs = count()
| project CountOfLogs);
ContainerLog
| where ContainerID in (containers)
| where LogEntry contains '{SearchText}'
| extend Log=replace(#'(\x1b\[[0-9]*m|\x1b\[0 [0-9]*m)','', LogEntry)
| project TimeGenerated, Log
| sort by TimeGenerated asc
| take {PageSize}*{Page}
| top iff({PageSize}*{Page} > TotalCount, TotalCount - ({PageSize}*({Page} - 1)) , {PageSize}) by TimeGenerated desc;
// The '| extend' is not needed if in logs are not the annoying special characters