makeset operation not preserve ordering? - azure

The following command does not produce a consistent ordering of items:
KubePodInventory
| where ClusterName == "mycluster"
| distinct Computer
| order by Computer asc
| summarize makeset(Computer)
But upon reading the documentation (see here) it states the following:
Like makelist, makeset also works with ordered data and will generate
the arrays based on the order of the rows that are passed into it.
Is this a bug or am I doing something wroing?

As per this issue #MohitVerma mentioned, makeset() should not support ordering, and they are planning to correct the doc : Like makelist, makeset also works with ordered data and will generate the arrays based on the order of the rows that are passed into it.
You can use makelist() as a workaround, which does support ordering as per my testing.

Please check this answer for the similar type of operation.
How to order item in Makeset?
Below code worked for me-
requests | summarize makeset(client_City) by client_City | distinct client_City | order by client_City asc
You can follow this thread for the code snippet , marked the answer for closing this thread.
https://github.com/MicrosoftDocs/azure-docs/issues/24135#issuecomment-460185491
requests | summarize makeset(client_City) by client_City | distinct client_City | order by client_City asc | summarize makelist(client_City)

Related

Counting operations in Azure Monitor Log Analytics query

I want to query operations like Add-MailboxPermission with FullAccess and deleted emails/calendar events to find compromised accounts (in 30m intervals).
1. How I should modify my code to show operations which fulfil both assumptions at the same time (if I change "or" to "and" then it will check both assumptions in one log)?
2. How can I modify a "count" to decrease the number of logs only to this which show min 10 in the result? Maybe there should be another function?
OfficeActivity
| where parse_json(Parameters)[2].Value like "FullAccess" or Operation contains "Delete"
| summarize Events=count() by bin(TimeGenerated, 30m), Operation, UserId
Welcome to Stack Overflow!
Yes, the logical and operator returns true only if both conditions are true. Check this doc for the query language reference.
Yes again, there is the top operator that's used to return the first N records sorted by the specified columns, used as follows:
OfficeActivity
| where parse_json(Parameters)[2].Value like "FullAccess" and Operation contains "Delete"
| summarize Events=count() by bin(TimeGenerated, 30m), Operation, UserId
| top 10 by Events asc
Additional tip:
There are limit and take operators as well that return resultset up to the specified number of rows, but with a caveat that there is no guarantee as to which records are returned, unless the source data is sorted.
Hope this helps!

Best way to filter to a specific row in pyspark dataframe

I have what seems like a simple question, but I cannot figure it out. I am trying to filter to a specific row, based on an id (primary key) column, because I want to spot-check it against the same id in another table where a transform has been applied.
More detail... I have a dataframe like this:
| id | name | age |
| 1112 | Bob | 54 |
| 1123 | Sue | 23 |
| 1234 | Jim | 37 |
| 1251 | Mel | 58 |
...
except it has ~3000MM rows and ~2k columns. The obvious answer is something like df.filter('id = 1234').show(). The problem is that I have ~300MM rows and this query takes forever (as in 10-20 minutes on a ~20 node AWS EMR cluster).
I understand that it has to do table scan, but fundamentally I don't understand why something like df.filter('age > 50').show() finishes in ~30 seconds and the id query takes so long. Don't they both have to do the same scan?
Any insight is very welcome. I am using pyspark 2.4.0 on linux.
Don't they both have to do the same scan?
That depends on the data distribution.
First of all show takes only as little data as possible, so as long there is enough data to collect 20 rows (defualt value) it can process as little as a single partition, using LIMIT logic (you can check Spark count vs take and length for a detailed description of LIMIT behavior).
If 1234 was on the first partition and you've explicitly set limit to 1
df.filter('id = 1234').show(1)
the time would be comparable to the other example.
But if limit is smaller than number of values that satisfy the predicate, or values of interest reside in the further partitions, Spark will have to scan all data.
If you want to make it work faster you'll need data bucketed (on disk) or partitioned (in memory) using field of interest, or use one of the proprietary extensions (like Databricks indexing) or specialized storage (like unfortunately inactive, succint).
But really, if you need fast lookups, use a proper database - this what they are designed for.

How to select max timestamp in a partition using Cassandra

I have a problem modeling my data using Cassandra. I would like to use it as an event store. My events have creation timestamp. Those event belong to a partition which is identified by an id.
Now I'd like to see most recent event for each id and then filter this ids according to the timestamp.
So I have something like this:
ID | CREATION_TIMESTAMP | CONTENT
---+---------------------------------+----------------
1 | 2018-11-09 12:15:45.841000+0000 | {SOME_CONTENT}
1 | 2018-11-09 12:15:55.654656+0000 | {SOME_CONTENT}
2 | 2018-11-09 12:15:35.982354+0000 | {SOME_CONTENT}
2 | 2018-11-09 12:35:25.321655+0000 | {SOME_CONTENT}
2 | 2018-11-09 13:15:15.068498+0000 | {SOME_CONTENT}
I tried grouping by partition id and querying for max of creation_timestamp but that is not allowed and I should specify partition id using EQ or IN. Additional reading led me to believe that this is entirely wrong way of approaching this problem but I don't know whether NoSQL is not suitable tool for the job or I am simply approaching this problem from wrong angle?
You can easily achieve this by having your CREATION_TIMESTAMP as clustering column and ordered DESC. Then you would query by your id and using limit 1 (which will return the most recent event since the data is order DESC in that partition key).
can you please share your table definition .
by looking at your data you can use ID as partition key and CREATION_TIMESTAMP as clustering column.
you can use select MAX(CREATION_TIMESTAMP) from keyspace.table where ID='value';

Retain last row for given key in spark structured streaming

Similar to Kafka's log compaction there are quite a few use cases where it is required to keep only the last update on a given key and use the result for example for joining data.
How can this be archived in spark structured streaming (preferably using PySpark)?
For example suppose I have table
key | time | value
----------------------------
A | 1 | foo
B | 2 | foobar
A | 2 | bar
A | 15 | foobeedoo
Now I would like to retain the last values for each key as state (with watermarking), i.e. to have access to a the dataframe
key | time | value
----------------------------
B | 2 | foobar
A | 15 | foobeedoo
that I might like to join against another stream.
Preferably this should be done without wasting the one supported aggregation step. I suppose I would need kind of a dropDuplicates() function with reverse order.
Please note that this question is explicily about structured streaming and how to solve the problem without constructs that waste the aggregation step (hence, everything with window functions or max aggregation is not a good answer). (In case you do not know: Chaining Aggregations is right now unsupported in structured streaming.)
Using flatMapGroupsWithState or mapGroupsWithState, group by key, and sort the value by time in the flatMapGroupsWithState function, store the last line into the GroupState.

Why is there no generic method to diff consecutive rows in pyspark dataframes/rdds?

I frequently come across the use case that I have (a time ordered) Spark dataframe with values from which I should like to know the differences between consecutive rows:
>>> df.show()
+-----+----------+----------+
|index| c1| c2|
+-----+----------+----------+
| 0.0|0.35735932|0.39612636|
| 1.0| 0.7279809|0.54678476|
| 2.0|0.68788993|0.25862947|
| 3.0| 0.645063| 0.7470685|
+-----+----------+----------+
The question on how to do this has been asked before in a narrower context:
pyspark, Compare two rows in dataframe
Date difference between consecutive rows - Pyspark Dataframe
However, I find the answers rather involved:
a separate module "Window" must be imported
for some data types (datetimes) a cast must be done
then using "lag" finally the rows can be compared
It strikes me as odd, that this cannot be done with a single API call like, for example, so:
>>> import pyspark.sql.functions as f
>>> df.select(f.diffs(df.c1)).show()
+----------+
| diffs(c1)|
+----------+
| 0.3706 |
| -0.0400 |
| -0.0428 |
| null |
+----------+
What is the reason for this?
There are a few basic reasons:
In general distributed data structures used in Spark are not ordered. In particular any lineage containing shuffle phase / exchange can output a structure with non-deterministic order.
As a result when we talk about Spark DataFrame we actually mean relation not DataFrame as known from local libraries like Pandas and without explicit ordering comparing consecutive rows is just not meaningful.
Things are even more fuzzy when you realize that sorting methods used in Spark use shuffles and are not stable.
If you ignore possible non-determinism handling partition boundaries is a bit involved and typically breaks lazy execution.
In other words you cannot access an element which is left from the first element on a given partition or right from the last element of a given partition without a shuffle, an additional action or separate data scan.

Resources