Counting operations in Azure Monitor Log Analytics query - azure

I want to query operations like Add-MailboxPermission with FullAccess and deleted emails/calendar events to find compromised accounts (in 30m intervals).
1. How I should modify my code to show operations which fulfil both assumptions at the same time (if I change "or" to "and" then it will check both assumptions in one log)?
2. How can I modify a "count" to decrease the number of logs only to this which show min 10 in the result? Maybe there should be another function?
OfficeActivity
| where parse_json(Parameters)[2].Value like "FullAccess" or Operation contains "Delete"
| summarize Events=count() by bin(TimeGenerated, 30m), Operation, UserId

Welcome to Stack Overflow!
Yes, the logical and operator returns true only if both conditions are true. Check this doc for the query language reference.
Yes again, there is the top operator that's used to return the first N records sorted by the specified columns, used as follows:
OfficeActivity
| where parse_json(Parameters)[2].Value like "FullAccess" and Operation contains "Delete"
| summarize Events=count() by bin(TimeGenerated, 30m), Operation, UserId
| top 10 by Events asc
Additional tip:
There are limit and take operators as well that return resultset up to the specified number of rows, but with a caveat that there is no guarantee as to which records are returned, unless the source data is sorted.
Hope this helps!

Related

How to monitor consecutive exceptions in Azure? (Kusto)

I want to monitor consecutive exceptions.
For example if I get 'X' amount of '500' exceptions in a row, I want it to trigger an action group.
How to write this in Kusto?
I know how to monitor amount of exceptions over a 1min period but I'm a bit stuck on how to monitor consecutive exceptions.
You are looking for setting up a custom log alert on AppInsights
Here is the step by step guide on how to setup
You can use the following query with Summarize Operator
exceptions
| where timestamp >= datetime('2019-01-01')
| summarize min(timestamp) by operation_Id
Please use the query like below:
Exceptions
| summarize count() by xxx
For more details about summarize operator, refer to this article.

Best way to filter to a specific row in pyspark dataframe

I have what seems like a simple question, but I cannot figure it out. I am trying to filter to a specific row, based on an id (primary key) column, because I want to spot-check it against the same id in another table where a transform has been applied.
More detail... I have a dataframe like this:
| id | name | age |
| 1112 | Bob | 54 |
| 1123 | Sue | 23 |
| 1234 | Jim | 37 |
| 1251 | Mel | 58 |
...
except it has ~3000MM rows and ~2k columns. The obvious answer is something like df.filter('id = 1234').show(). The problem is that I have ~300MM rows and this query takes forever (as in 10-20 minutes on a ~20 node AWS EMR cluster).
I understand that it has to do table scan, but fundamentally I don't understand why something like df.filter('age > 50').show() finishes in ~30 seconds and the id query takes so long. Don't they both have to do the same scan?
Any insight is very welcome. I am using pyspark 2.4.0 on linux.
Don't they both have to do the same scan?
That depends on the data distribution.
First of all show takes only as little data as possible, so as long there is enough data to collect 20 rows (defualt value) it can process as little as a single partition, using LIMIT logic (you can check Spark count vs take and length for a detailed description of LIMIT behavior).
If 1234 was on the first partition and you've explicitly set limit to 1
df.filter('id = 1234').show(1)
the time would be comparable to the other example.
But if limit is smaller than number of values that satisfy the predicate, or values of interest reside in the further partitions, Spark will have to scan all data.
If you want to make it work faster you'll need data bucketed (on disk) or partitioned (in memory) using field of interest, or use one of the proprietary extensions (like Databricks indexing) or specialized storage (like unfortunately inactive, succint).
But really, if you need fast lookups, use a proper database - this what they are designed for.

How can I consume more than the reserved number of request units with Azure Cosmos DB?

We have reserved various number of RUs per second for our various collections. I'm trying to optimize this to save money. For each response from Cosmos, we're logging the request charge property to Application Insights. I have one analytics query that returns the average number of request units per second and one that returns the maximum.
let start = datetime(2019-01-24 11:00:00);
let end = datetime(2019-01-24 21:00:00);
customMetrics
| where name == 'RequestCharge' and start < timestamp and timestamp < end
| project timestamp, value, Database=tostring(customDimensions['Database']), Collection=tostring(customDimensions['Collection'])
| make-series sum(value) default=0 on timestamp in range(start, end, 1s) by Database, Collection
| mvexpand sum_value to typeof(double), timestamp limit 36000
| summarize avg(sum_value) by Database, Collection
| order by Database asc, Collection asc
let start = datetime(2019-01-24 11:00:00);
let end = datetime(2019-01-24 21:00:00);
customMetrics
| where name == 'RequestCharge' and start <= timestamp and timestamp <= end
| project timestamp, value, Database=tostring(customDimensions['Database']), Collection=tostring(customDimensions['Collection'])
| summarize sum(value) by Database, Collection, bin(timestamp, 1s)
| summarize arg_max(sum_value, *) by Database, Collection
| order by Database asc, Collection asc
The averages are fairly low but the maxima can be unbelievably high in some cases. An extreme example is a collection with a reservation of 1,000, an average used of 15,59 and a maximum used of 63,341 RUs/s.
My question is: How can this be? Are my queries wrong? Is throttling not working? Or does throttling only work on a longer period of time than a single second? I have checked for request throttling on the Azure Cosmos DB overview dashboard (response code 429), and there was none.
I have to answer myself. I found two problems:
Application Insights logs an inaccurate timestamp. I added a timestamp as a custom dimension, and within a certain minute I get different seconds in my custom timestamp but the built-in timestamp is one second past the minute for many of these. That is why I got (false) peaks in request charge.
We did have throttling. When viewing request throttling in the portal, I have to select a specific database. If I try to view request throttling for all databases, it looks like there is none.

makeset operation not preserve ordering?

The following command does not produce a consistent ordering of items:
KubePodInventory
| where ClusterName == "mycluster"
| distinct Computer
| order by Computer asc
| summarize makeset(Computer)
But upon reading the documentation (see here) it states the following:
Like makelist, makeset also works with ordered data and will generate
the arrays based on the order of the rows that are passed into it.
Is this a bug or am I doing something wroing?
As per this issue #MohitVerma mentioned, makeset() should not support ordering, and they are planning to correct the doc : Like makelist, makeset also works with ordered data and will generate the arrays based on the order of the rows that are passed into it.
You can use makelist() as a workaround, which does support ordering as per my testing.
Please check this answer for the similar type of operation.
How to order item in Makeset?
Below code worked for me-
requests | summarize makeset(client_City) by client_City | distinct client_City | order by client_City asc
You can follow this thread for the code snippet , marked the answer for closing this thread.
https://github.com/MicrosoftDocs/azure-docs/issues/24135#issuecomment-460185491
requests | summarize makeset(client_City) by client_City | distinct client_City | order by client_City asc | summarize makelist(client_City)

Traversing the optimum path between nodes

in a graph where there are multiple path to go from point (:A) to (:B) through node (:C), I'd like to extract paths from (:A) to (:B) through nodes of type (c:C) where c.Value is maximum. For instance, connect all movies with only their oldest common actors.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, max(a.Age)
The above query returns the proper age for the oldest actor, but not always his correct name.
Conversely, I noticed that the following query returns both correct age and name.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
with m1, m2, a order by a.age desc
return m1.name, m2.name, a.name, max(a.age), head(collect(a.name))
Would this always be true? I guess so.
I there a better way to do the job without sorting which may cost much?
You need to use ORDER BY ... LIMIT 1 for this:
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, a.Age order by a.Age desc limit 1
Be aware that you basically want to do a weighted shortest path. Neo4j can do this more efficiently using java code and the GraphAlgoFactory, see the chapter on this in the reference manual.
For those who are willing to do similar things, consider read this post from #_nicolemargaret which describe how to extract the n oldest actors acting in pairs of movies rather than just the first, as with head(collect()).

Resources