Kusto Query Language: set column name of summarize by evaluated expression - azure

Me again asking another Kusto related question (I really wish there would be a thorough video tutorial on this somewhere).
I have a summarize statement, that produces two columns for y axis and one for x axis.
Now i want to relabel the columns for x axis to show a string, that i also got from the database and already put into a variable with let.
This basically looks like this:
let android_col = strcat("Android: ", toscalar(customEvents
| where application_Version contains secondLatestVersionAndroid));
let iOS_col = strcat("iOS: ", toscalar(customEvents
| where application_Version contains secondLatestVersionIOS));
... some Kusto magic ...
| summarize
Android = 100 - (round((countif(hasUnhandledErrorAndroid == 1 ) * 100.0 ) / countif(isAndroid == 1), 2)),
iOS = 100 - (round((countif(hasUnhandledErroriOS == 1) * 100.0 ) / countif(isIOS == 1), 2))
by Time
|render timechart with (ytitle="crashfree users in %", xtitle="date", legend=visible )
Now i want to have the summarize display not Android and iOS, but the value of android_col and iOS_col.
Is that possible?
Best regards
Maverick

Generally, it's suggested to have predefined column names, otherwise various features don't work. For example, IntelliSense won't know the names of the columns, as they would be determined at run time only. Also, if you create a function that returns a dynamic schema, you won't be able to run this function from other clusters.
However, if you do want to change column names, you definitely have a way to do it by using various plugins. For example, bag_unpack, pivot and others.
As for courses on Kusto, there are actually several excellent courses on Pluralsight (all are free):
How to start with Microsoft Azure Data Explorer
Basic KQL
Azure Data Explorer – Advanced KQL

The usage of the "toscalar" in this query looks wrong, it seems to me that you should use the "extend" operator with the same logic to create the additional columns.

Related

R DiagrammeR package Mermaid text using actual calculation results

I would like to utilize the DiagrammeR package for a simple flow chart in my Rmarkdown. However, I couldn't figure out a way to use actual output from a data table into the text. Suppose I have a simple query of a database with total records, patients count and date in year info for three different cohorts.
I wanted to create a diagram using Mermaid. The codes look at this.
Total = paste0('Records:',b1$records,' Patients:',b1$patients,' Year:',b1$year)
# (Records:1000 Patients:822 Year:5)
Sub1 = paste0('Records:',b2$records,' Patients:',b2$patients,' Year:',b2$year)
Sub2 = paste0('Records:',b3$records,' Patients:',b3$patients,' Year:',b3$year)
mermaid("
graph TB
A[Total] --> B{Sub1} --> C{Sub2}
")
Instead of Printing out diagram with: Records:1000 Patients:822 Year:5 in the A, it shows verbatim word "Total".
Any suggestion on how to do it correctly?
Thanks!
You are one step away from what you'd like to achieve. Please try this simple example below to see the logic:
library(DiagrammeR)
Stracture:
DiagrammeR(
"
graph TB
A[Question] -->B[Answer]
"
)
1. Define answer node:
B <- paste0("There are ", nrow(iris), " records")
2. Combine it with other components, using ; to separate statements:
results <- paste0("graph TB; A[How many rows does iris have?]-->", "B[", B, "]")
3. Call 'results' in DiagrammeR:
DiagrammeR(diagram = results)
The final plot should refresh when your calculation updates.
The plot that calls your calculation

Dynamically filtering a Pandas DataFrame based on user input

I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)

Query date range and product size from xlsx file

I'm using python 3.6 to do this. Below are just a few important columns that I'm interested to query out.
Auto-Gen Index : Product Container : Ship Date :.......
0 : Large Box : 2017-01-09:.......
1 : Large Box : 2012-07-15:.......
2 : Small Box : 2012-07-18:.......
3 : Large Box : 2012-07-31:.......
I would like to query rows that indicate Large Box as their product container and the shipping date must be in the period of July in the year of 2012.
file_name = r'''Sample-Superstore-Subset-Excel.xlsx'''
df = read_excel(file_name, sheet_name = my_sheet)
lb = df.loc[df['Product Container'] == 'Large Box'] //Get large box
july = lb[(lb['Ship Date'] > '2012-07-01') & (lb['Ship Date'] < '2012-07-31')]
I just wonder how to use query and where condition by python(pd.query())?
If your question is when to use loc vs where, see my answer here:
Think of loc as a filter - give me only the parts of the df that
conform to a condition.
where originally comes from numpy. It runs over an array and checks if
each element fits a condition. So it gives you back the entire array,
with a result or NaN. A nice feature of where is that you can also get
back something different, e.g. df2 = df.where(df['Goals']>10,
other='0'), to replace values that don't meet the condition with 0.
If you are asking when to use query, AFAIK there is no real reason to do besides performance. If you have a very large dataset, query is expected to be faster. More on high-level performance here.

PowerShell on CSV file - looking for string depending on string

I need your help regarding PowerShell programming on CSV file.
I've made some searches but cannot find what I'm looking for (or perhaps I don't know the technical terms). Basically, I have an Excel workbook with large amount of data (more or less 38 columns x 350.000 rows), and there are a couple of formulas that take hours to calculate.
I was first wondering if PowerShell could speed up a bit the calculation compared to Excel. The calculations taking most of my time are in fact not that complex (at least at first glance). My data is more or less constructed like this:
Ref Title
----- --------------------------
A/001 "free_text"
A/002 "free_text A/001 free_text"
... ...
A/005 "free_text A/004 free_text"
A/006 "free_text"
B/001 "free_text"
B/002 "free_text"
C/001 "free_text"
C/002 "free_text"
...
C/050 "free_text C/047 free_text"
... ...
C/103 "free_text"
D/001 "free_text"
D/002 "free_text D/001 free_text"
... ....
Basically the data is as follows:
the Ref field contains unique values, in {letter}/{incremental value} format.
In some rows, the Title field may call up one of the Ref data. For example, in line 2, the Title calls for the A/001 Ref. In the last row, the Title calls for the D/001 Ref, etc.
There is no logic pattern defining when this ref could be called up in a title. This is random.
However, what I'm 100% sure of is the following:
The Ref called in the Title is always belonging to the same {letter} block. For example: the string 'C/047' in the Title field can only be found in the block where the Ref {letter} is C.
The Ref called in the Title will always be located 'after' (or in a lower row) than the Ref it refers to. In other words, I cannot have a line with following pattern:
Ref Title
------------ -----------------------------------------
{letter/i} {free_text {letter/j} free_text} with j<i
→ This is not possible.
→ j is always > i
I've used these characteristics in Excel to minimize my lookup arrays. But it still takes an hour to calculate everything.
I've therefore looked into PowerShell, and started to 'play' a bit with the CSV, and looping with the ForEach-Object hoping I would have quicker results. Up to now I basically ended-up looping twice on my CSV file.
$CSV1 = myfile.csv
$CSV2 = myfile.csv
$CSV1 | ForEach-Object {
# find Title
$TitSearch = $_.$Ref
$CSV2 | ForEach-Object {
if ($_.$Title -eq $TitSearch) {
myinstructions
}
}
}
It works but it's really really really long. So I then tried the following instead of using the $CSV2 | ForEach...:
$CSV | where {$_.$Title -eq $TitleSearch} | % $Ref
In either case, it's too long and not efficient at all. Additionally with these 2 solutions, I'm not using above characteristics which could reduce the lookup array and as already stated, it seems I end up looping twice on the CSV file from its beginning up to the end.
Questions:
Is there a leaner way to do this?
Am I wasting my time with PowerShell?
I though about creating 1 file per Ref {letter} block (1 file for block A, 1 for B, etc...). However I have about 50.000 blocks to create. Or create them one by one, carry out the analysis, put the results in a new file, and delete them. Would that be quicker?
Note: this is for work, to be used by other colleagues, and Excel and PowerShell are really the only softwares we may use. I know VBA but ok... At the end I'm curious about how and if this can be solved in a simple manner using PowerShell.
As far as I can see your base algorithm do N^2 iteration (~120 billion). There is a standard way to make it efficient - you need to build a hashtable first. Hashtable is a key/value storage, and look up is pretty much instantaneous, so algorithm's time complexity will become ~N.
Powershell has built-in data type for that. In your case the key would be ref, and the value an array of cell data (assuming your table is smth like: ref, title, col1, ..., colN)
$hash = #{}
foreach($row in $table} {$hash.Add($row.ref, #($row.title, $row.col1, ...)}
#it will take 350K steps to generate it
#then you can iterate over it again
foreach($key in $hash.Keys) {
$key # access current ref
$rowData = $hash.$key # access to current row elements (by index)
$refRowData = $hash[$rowData[$j]] # lookup from other rows, assuming lookup reference is in some column
}
So it's a general idea how to solve the time issue. To be honest I don't believe you need to recreate a wheel and code it yourself. What you need is a relational database. Since you have excel, you should have MS ACCESS too. Just import your data in there, make ref and title an index, then all you need to do is self join. MS Access suck, but I'm sure it will handle 350K row just fine.
Ideally you'd need to get a database on some corporate MSSQL server (open a ticket, talk to your manger, etc). It will calculate all that in seconds, and then you can link the output to a spreadsheet as well.

passing Jan of selected year by default in prompt

I have 2 year-month prompts. If I don't select any year-month in 1st prompt, report should by default, run from January of the same year, selected in 2nd prompt. My prompts are value prompts and have string values. Please help me materialise the requirement. I have already tried # prompt macro, ?prompt?, case when etc. I am nto sure, If javascript would help.
I'm going to assume your underlying date fields are not stored as DATE value types since you're using strings. This may be easier split into 4 prompts: from month, from year, to month, to year.
The filter would then be an implied if:
(
(?FROM_YEAR? = '' or ?FROM_MONTH? = '') and
[database_from_month] = '01' and
[database_from_year] = ?TO_YEAR? and
[database_to_month] = ?TO_MONTH? and
[database_to_year] = ?TO_YEAR?
)
OR
(
(?FROM_YEAR? <> '' or ?FROM_MONTH? <> '') and
[database_from_month] = ?FROM_MONTH? and
[database_from_year] = ?FROM_YEAR? and
[database_to_month] = ?TO_MONTH? and
[database_to_year] = ?TO_YEAR?
)
The above style filter is superior for many reasons:
More likely to be sargeable
Easy to understand
Uses simple built-in Cognos functions; more likely to be cross-version compliant
No issues with cross-browser support you would get with Javascript
Code snippet would work in other Cognos studios (Business Insight, etc)
You've likely seen CASE statements in filters throws an error. The CASE statement is passed to SQL, not compiled into a SQL statement via Cognos. Hence it's not seen as proper syntax.

Resources