I have a table that consists of one row and number of columns. One of the columns is named EventProperties which is a JSON of properties of this format:
{
"Success":true,
"Counters":{
"Counter1":1,
"Counter2":-1,
"Counter3":5,
"Counter4":4,
}
}
I want to convert the Counters from this JSON to a two-column table of keys and values, where the first column is the name of the counter (e.g. Counter3) and the second column is the value of the counter (e.g. 5).
I've tried this:
let eventPropertiesCell = materialize(MyTable
| project EventProperties
);
let countersStr = extractjson("$.Counters", tostring(toscalar(eventPropertiesCell)), typeof(string));
let countersJson = parse_json(countersStr);
let result =
print mydynamicvalue = todynamic(countersJson)
| mvexpand mydynamicvalue
| evaluate bag_unpack(mydynamicvalue);
result
But I get a table with a column for each counter from the JSON, and number of rows that is equal to the number of counters, while only one random row is filled with the counter value. For example, with the JSON from the example above, I get:
But I want something like this:
Any help will be appreciated!
you could try using mv-apply as follows:
datatable(event_properties:dynamic)
[
dynamic({
"Success":true,
"Counters":{
"Counter1":1,
"Counter2":-1,
"Counter3":5,
"Counter4":4
}
}),
dynamic({
"Success":false,
"Counters":{
"Counter1":1,
"Counter2":2,
"Counter3":3,
"Counter4":4
}
})
]
| mv-apply event_properties.Counters on (
extend key = tostring(bag_keys(event_properties_Counters)[0])
| project key, value = event_properties_Counters[key]
)
| project-away event_properties
Related
I want to create a function that allows me to pass the tabular result of a query as a parameter without specifying the table column names.
This is what I want as a result:
let Func = (T) {
T
| where Source has_any ("value")
};
let EventVar = Event | where TimeGenerated > ago(30d);
Func (EventVar);
You do not need to specify all columns in the tabular parameter schema, only those columns that you need to use inside the function.
For example, this is how your query can look like:
let CustomFunc = (T:(Source:string)) {
T | where Source has_any ("value")
};
let EventVar = Event | where TimeGenerated > ago(30d);
CustomFunc(EventVar);
The query above will output all columns from the table EventVar if its rows match the condition in your function. The only requirement is that the table EventVar has a column of type string with name Source, and it can have any number of other columns.
It is also possible to accept any tabular schema by defining the input tabular parameter like T:(*), but in this case you will not be able to reference any column names inside the function. See example 4 on the documentation page for reference.
I would like to filter an unpivoted LazyDataFrame by using nodejs-polars without having to collect (and lose the LazyDataFrame) in between.
Consider the following example csv's
1.csv:
asset_key
abc
asset_key
abc
2.csv:
id;asset_key_1;asset_key_2;asset_key_3
1;123;456;abc
id
asset_key_1
asset_key_2
asset_key_3
1
123
456
abc
I would first like to unpivot 2.csv, to have all asset_keys available in a new column. Then, I want to filter that column on the value available in 1.csv ("abc"), such that the remaining result after filtering would be:
id
variable
value
1
asset_key_3
abc
Instead, I am getting an error
"Error: Not found: value"
If I would collect the LazyDataFrame into a DataFrame after melting and before filtering, it does work. But I would like to know if there is a way to do this without having to give up the LazyDataFrame.
This is the code I use:
import * as pl from 'nodejs-polars';
const df_1: LazyDataFrame = pl.scanCSV('1.csv', { sep: ';' });
const df_2: LazyDataFrame = pl.scanCSV('2.csv', { sep: ';' });
const isInFilter: LazyDataFrame = df_1.select('asset_key');
const df: DataFrame = await df_2
.melt('id', ['asset_key_1', 'asset_key_2', 'asset_key_3'])
.dropNulls()
.filter(pl.col('value').isIn(isInFilter['asset_key']))
.collect();
this does look like a bug not only in nodejs-polars, but in polars as well. I opened up an issue for you! https://github.com/pola-rs/polars/issues/4368
I need to pass the column name dynamically into the query. While following is syntactically correct it does not really execute the condition on the particular column.
Is this even possible in Kusto?
let foo = (duration: timespan, column:string) {
SigninLogs
| where TimeGenerated >= ago(duration)
| summarize totalPerCol = count() by ['column']
};
//foo('requests')<-- does not work
//foo('user') <-- does not work
//foo('ip')<-- does not work
you can try using the column_ifexists() function: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/columnifexists.
for example:
let foo = (column_name:string) {
datatable(col_a:string)["hello","world"]
| summarize totalPerCol = count() by column_ifexists(column_name, "")
};
foo('col_a')
col_a
totalPerCol
hello
1
world
1
Imagine I have a huge dataset which I partitionBy('id'). Assume that id is unique to a person, so there could be n number of rows per id and the goal is to reduce it to one.
Basically, aggregating to make id distinct.
w = Window().partitionBy(id).rowsBetween(-sys.maxsize, sys.maxsize)
test1 = {
key: F.first(key, True).over(w).alias(key)
for key in some_dict.keys()
if (some_dict[key] == 'test1')
}
test2 = {
key: F.last(key, True).over(w).alias(k)
for k in some_dict.keys()
if (some_dict[k] == 'test2')
}
Assume that I have some_dict with values either as test1 or test2 and based on the value, I either take the first or last as shown above.
How do I actually call aggregate and reduce this?
cols = {**test1, **test2}
cols = list(cols.value())
df.select(*cols).groupBy('id').agg(*cols) # Doesnt work
The above clearly doesn't work. Any ideas?
Goal here is : I have 5 unique IDs and 25 rows with each ID having 5 rows. I want to reduce it to 5 rows from 25.
Let assume you dataframe name df which contains duplicate use below method
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
window = Window.partitionBy(df['id']).orderBy(df['id'])
final = df.withColumn("row_id", row_number.over(window)).filter("row_id = 1")
final.show(10,False)
change the order by condition in case there is specific criteria so that particular record will be on top of partition
I'm trying to implement some kind of pagination feature for my app that using cassandra in the backend.
CREATE TABLE sample (
some_pk int,
some_id int,
name1 txt,
name2 text,
value text,
PRIMARY KEY (some_pk, some_id, name1, name2)
)
WITH CLUSTERING ORDER BY(some_id DESC)
I want to query 100 records, then store the last records keys in memory to use them later.
+---------+---------+-------+-------+-------+
| sample_pk| some_id | name1 | name2 | value |
+---------+---------+-------+-------+-------+
| 1 | 125 | x | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 124 | a | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 124 | b | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 123 | y | '' | '' |
+---------+---------+-------+-------+-------+
(for simplicity, i left some columns empty. partition key(sample_pk) is not important)
let's assume my page size is 2.
select * from sample where sample_pk=1 limit 2;
returns first 2 rows. now i store the last record in my query result and run query again to get next 2 rows;
this is the query that does not work because of restriction of a single non-EQ relation
select * from where sample_pk=1 and some_id <= 124 and name1>='a' and name2>='' limit 2;
and this one returns wrong results because some_id is in descending order and name columns are in ascending order.
select * from where sample_pk=1 and (some_id, name1, name2) <= (124, 'a', '') limit 2;
So I'm stuck. How can I implement pagination?
You can run your second query like,
select * from sample where some_pk =1 and some_id <= 124 limit x;
Now after fetching the records ignore the record(s) which you have already read (this can be done because you are storing the last record from the previous select query).
And after ignoring those records if you are end up with empty list of rows/records that means you have iterated over all the records else continue doing this for your pagination task.
You don't have to store any keys in memory, also you don't need to use limit in your cqlsh query. Just use the capabilities of datastax driver in your application code for doing pagination like the following code:
public Response getFromCassandra(Integer itemsPerPage, String pageIndex) {
Response response = new Response();
String query = "select * from sample where sample_pk=1";
Statement statement = new SimpleStatement(query).setFetchSize(itemsPerPage); // set the number of items we want per page (fetch size)
// imagine page '0' indicates the first page, so if pageIndex = '0' then there is no paging state
if (!pageIndex.equals("0")) {
statement.setPagingState(PagingState.fromString(pageIndex));
}
ResultSet rows = session.execute(statement); // execute the query
Integer numberOfRows = rows.getAvailableWithoutFetching(); // this should get only number of rows = fetchSize (itemsPerPage)
Iterator<Row> iterator = rows.iterator();
while (numberOfRows-- != 0) {
response.getRows.add(iterator.next());
}
PagingState pagingState = rows.getExecutionInfo().getPagingState();
if(pagingState != null) { // there is still remaining pages
response.setNextPageIndex(pagingState.toString());
}
return response;
}
note that if you make the while loop like the following:
while(iterator.hasNext()) {
response.getRows.add(iterator.next());
}
it will first fetch number of rows as equal as the fetch size we set, then as long as the query still matches some rows in Cassandra it will go fetch again from cassandra till it fetches all rows matching the query from cassandra which may not be intended if you want to implement a pagination feature
source: https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/