I have a simple query like below;
union isfuzzy=true
availabilityResults,
requests,
exceptions,
pageViews,
traces,
customEvents,
dependencies
| order by timestamp desc
| take 100
This returns all available columns, which is fine. Then when I use following;
union isfuzzy=true
availabilityResults,
requests,
exceptions,
pageViews,
traces,
customEvents,
dependencies
| order by timestamp desc
| take 100
| project customDimensions.ApplicationName
This only returns ApplicationName column, this is also fine.
But what I want it to get additional column on top of existing ones, similar to:
union isfuzzy=true
availabilityResults,
requests,
exceptions,
pageViews,
traces,
customEvents,
dependencies
| order by timestamp desc
| take 100
| project *, customDimensions.ApplicationName
But * wildcard does not work here. Is there any way to achieve this?
If I understand correctly, you want the result table to include all existing columns, and extend another calculated column in addition to those.
If that's correct, you can use the extend operator.
e.g.:
union isfuzzy=true
availabilityResults,
requests,
exceptions,
pageViews,
traces,
customEvents,
dependencies
| top 100 by timestamp desc
| extend ApplicationName = customDimensions.ApplicationName
Related
So I've got a HTTP function that writes logs to app insights when its invoked.
I'm wanting to know when a period of time elapses when the HTTP function isn't called.
traces | where message contains "function invoked" | summarize count() by bin(timestamp, 10m)
This works, but it only pulls out the present of logs,
What I'm wanting is to know how many requests have hit this endpoint from now. Rather than it showing "no results". it should have a table showing the datetime and a value of 0.
that way I can show a flat line.
make-series operator
This works if you have at least one data point
traces
| where message has "function invoked"
| make-series count() on timestamp from ago(1h) to now() step 10m
| render timechart
If you might have no data points, you will need to use some tricks
union traces, (print timestamp = now(1ms))
| make-series countif(timestamp <= now()) on timestamp from ago(1h) step 10m
| render timechart
I want to query operations like Add-MailboxPermission with FullAccess and deleted emails/calendar events to find compromised accounts (in 30m intervals).
1. How I should modify my code to show operations which fulfil both assumptions at the same time (if I change "or" to "and" then it will check both assumptions in one log)?
2. How can I modify a "count" to decrease the number of logs only to this which show min 10 in the result? Maybe there should be another function?
OfficeActivity
| where parse_json(Parameters)[2].Value like "FullAccess" or Operation contains "Delete"
| summarize Events=count() by bin(TimeGenerated, 30m), Operation, UserId
Welcome to Stack Overflow!
Yes, the logical and operator returns true only if both conditions are true. Check this doc for the query language reference.
Yes again, there is the top operator that's used to return the first N records sorted by the specified columns, used as follows:
OfficeActivity
| where parse_json(Parameters)[2].Value like "FullAccess" and Operation contains "Delete"
| summarize Events=count() by bin(TimeGenerated, 30m), Operation, UserId
| top 10 by Events asc
Additional tip:
There are limit and take operators as well that return resultset up to the specified number of rows, but with a caveat that there is no guarantee as to which records are returned, unless the source data is sorted.
Hope this helps!
I have what seems like a simple question, but I cannot figure it out. I am trying to filter to a specific row, based on an id (primary key) column, because I want to spot-check it against the same id in another table where a transform has been applied.
More detail... I have a dataframe like this:
| id | name | age |
| 1112 | Bob | 54 |
| 1123 | Sue | 23 |
| 1234 | Jim | 37 |
| 1251 | Mel | 58 |
...
except it has ~3000MM rows and ~2k columns. The obvious answer is something like df.filter('id = 1234').show(). The problem is that I have ~300MM rows and this query takes forever (as in 10-20 minutes on a ~20 node AWS EMR cluster).
I understand that it has to do table scan, but fundamentally I don't understand why something like df.filter('age > 50').show() finishes in ~30 seconds and the id query takes so long. Don't they both have to do the same scan?
Any insight is very welcome. I am using pyspark 2.4.0 on linux.
Don't they both have to do the same scan?
That depends on the data distribution.
First of all show takes only as little data as possible, so as long there is enough data to collect 20 rows (defualt value) it can process as little as a single partition, using LIMIT logic (you can check Spark count vs take and length for a detailed description of LIMIT behavior).
If 1234 was on the first partition and you've explicitly set limit to 1
df.filter('id = 1234').show(1)
the time would be comparable to the other example.
But if limit is smaller than number of values that satisfy the predicate, or values of interest reside in the further partitions, Spark will have to scan all data.
If you want to make it work faster you'll need data bucketed (on disk) or partitioned (in memory) using field of interest, or use one of the proprietary extensions (like Databricks indexing) or specialized storage (like unfortunately inactive, succint).
But really, if you need fast lookups, use a proper database - this what they are designed for.
The following command does not produce a consistent ordering of items:
KubePodInventory
| where ClusterName == "mycluster"
| distinct Computer
| order by Computer asc
| summarize makeset(Computer)
But upon reading the documentation (see here) it states the following:
Like makelist, makeset also works with ordered data and will generate
the arrays based on the order of the rows that are passed into it.
Is this a bug or am I doing something wroing?
As per this issue #MohitVerma mentioned, makeset() should not support ordering, and they are planning to correct the doc : Like makelist, makeset also works with ordered data and will generate the arrays based on the order of the rows that are passed into it.
You can use makelist() as a workaround, which does support ordering as per my testing.
Please check this answer for the similar type of operation.
How to order item in Makeset?
Below code worked for me-
requests | summarize makeset(client_City) by client_City | distinct client_City | order by client_City asc
You can follow this thread for the code snippet , marked the answer for closing this thread.
https://github.com/MicrosoftDocs/azure-docs/issues/24135#issuecomment-460185491
requests | summarize makeset(client_City) by client_City | distinct client_City | order by client_City asc | summarize makelist(client_City)
i have a coulmn family in cassandra 1.2 as below:
time | class_name | level_log | message | thread_name
-----------------+-----------------------------+-----------+---------------+-------------
121118135945759 | ir.apk.tm.test.LoggerSimple | DEBUG | This is DEBUG | main
121118135947310 | ir.apk.tm.test.LoggerSimple | ERROR | This is ERROR | main
121118135947855 | ir.apk.tm.test.LoggerSimple | WARN | This is WARN | main
121118135946221 | ir.apk.tm.test.LoggerSimple | DEBUG | This is DEBUG | main
121118135951461 | ir.apk.tm.test.LoggerSimple | WARN | This is WARN | main
when i use this query:
SELECT * FROM LogTM WHERE token(time) > token(0);
i get nothing!!! but as you see all of time values are greater than zero!
this is CF schema:
CREATE TABLE logtm(
time bigint PRIMARY KEY ,
level_log text ,
thread_name text ,
class_name text ,
msg text
);
can any body help?
thanks :)
If you're not using an ordered partitioner (if you don't know what that means you don't) that query doesn't do what you think. Just because two timestamps sort one way doesn't mean that their tokens do. The token is the (Murmur3) hash of the cell value (unless you've changed the partitioner).
If you need to do range queries you can't do it on the partition key, only on clustering keys. One way you can do it is to use a schema like this:
CREATE TABLE LogTM (
shard INT,
time INT,
class_name ASCII,
level_log ASCII,
thread_name ASCII,
message TEXT,
PRIMARY KEY (shard, time, class_name, level_log, thread_name)
)
If you set shard to zero the schema will be roughly equivalent to what you're doing now, but the query SELECT * FROM LogTM WHERE timestamp > 0 will give you the results you expect.
However, the performance will be awful. With a single value of shard only a single partition/row will be created, and you will only use a single node of your cluster (and that node will be very busy trying to compact that single row).
So you need to figure out a way to spread the load across more nodes. One way is to pick a random shard between something like 0 and 359 (or 0 and 255 if you like multiples of two, the exact range isn't important, it just needs to be an order of magnitude or so larger than the number of nodes) for each insert, and read from all shards when you read back: SELECT * FROM LogTM WHERE shard IN (0,1,2,...) (you need to include all shards in the list, in place of ...).
You can also pick the shard by hashing the message, that way you don't have to worry about duplicates.
You need to tell us more about what exactly it is that you're trying to do, especially how you intend to query the data. Don't go do the thing I described above, it is probably completely wrong for your use case, I just wanted to give you an example so that I could explain what is going on inside Cassandra.