How to implement pagination for cassandra by using keys? - cassandra

I'm trying to implement some kind of pagination feature for my app that using cassandra in the backend.
CREATE TABLE sample (
some_pk int,
some_id int,
name1 txt,
name2 text,
value text,
PRIMARY KEY (some_pk, some_id, name1, name2)
)
WITH CLUSTERING ORDER BY(some_id DESC)
I want to query 100 records, then store the last records keys in memory to use them later.
+---------+---------+-------+-------+-------+
| sample_pk| some_id | name1 | name2 | value |
+---------+---------+-------+-------+-------+
| 1 | 125 | x | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 124 | a | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 124 | b | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 123 | y | '' | '' |
+---------+---------+-------+-------+-------+
(for simplicity, i left some columns empty. partition key(sample_pk) is not important)
let's assume my page size is 2.
select * from sample where sample_pk=1 limit 2;
returns first 2 rows. now i store the last record in my query result and run query again to get next 2 rows;
this is the query that does not work because of restriction of a single non-EQ relation
select * from where sample_pk=1 and some_id <= 124 and name1>='a' and name2>='' limit 2;
and this one returns wrong results because some_id is in descending order and name columns are in ascending order.
select * from where sample_pk=1 and (some_id, name1, name2) <= (124, 'a', '') limit 2;
So I'm stuck. How can I implement pagination?

You can run your second query like,
select * from sample where some_pk =1 and some_id <= 124 limit x;
Now after fetching the records ignore the record(s) which you have already read (this can be done because you are storing the last record from the previous select query).
And after ignoring those records if you are end up with empty list of rows/records that means you have iterated over all the records else continue doing this for your pagination task.

You don't have to store any keys in memory, also you don't need to use limit in your cqlsh query. Just use the capabilities of datastax driver in your application code for doing pagination like the following code:
public Response getFromCassandra(Integer itemsPerPage, String pageIndex) {
Response response = new Response();
String query = "select * from sample where sample_pk=1";
Statement statement = new SimpleStatement(query).setFetchSize(itemsPerPage); // set the number of items we want per page (fetch size)
// imagine page '0' indicates the first page, so if pageIndex = '0' then there is no paging state
if (!pageIndex.equals("0")) {
statement.setPagingState(PagingState.fromString(pageIndex));
}
ResultSet rows = session.execute(statement); // execute the query
Integer numberOfRows = rows.getAvailableWithoutFetching(); // this should get only number of rows = fetchSize (itemsPerPage)
Iterator<Row> iterator = rows.iterator();
while (numberOfRows-- != 0) {
response.getRows.add(iterator.next());
}
PagingState pagingState = rows.getExecutionInfo().getPagingState();
if(pagingState != null) { // there is still remaining pages
response.setNextPageIndex(pagingState.toString());
}
return response;
}
note that if you make the while loop like the following:
while(iterator.hasNext()) {
response.getRows.add(iterator.next());
}
it will first fetch number of rows as equal as the fetch size we set, then as long as the query still matches some rows in Cassandra it will go fetch again from cassandra till it fetches all rows matching the query from cassandra which may not be intended if you want to implement a pagination feature
source: https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/

Related

KQL: How to parse a log and pull proper substrings? [duplicate]

I have a table that consists of one row and number of columns. One of the columns is named EventProperties which is a JSON of properties of this format:
{
"Success":true,
"Counters":{
"Counter1":1,
"Counter2":-1,
"Counter3":5,
"Counter4":4,
}
}
I want to convert the Counters from this JSON to a two-column table of keys and values, where the first column is the name of the counter (e.g. Counter3) and the second column is the value of the counter (e.g. 5).
I've tried this:
let eventPropertiesCell = materialize(MyTable
| project EventProperties
);
let countersStr = extractjson("$.Counters", tostring(toscalar(eventPropertiesCell)), typeof(string));
let countersJson = parse_json(countersStr);
let result =
print mydynamicvalue = todynamic(countersJson)
| mvexpand mydynamicvalue
| evaluate bag_unpack(mydynamicvalue);
result
But I get a table with a column for each counter from the JSON, and number of rows that is equal to the number of counters, while only one random row is filled with the counter value. For example, with the JSON from the example above, I get:
But I want something like this:
Any help will be appreciated!
you could try using mv-apply as follows:
datatable(event_properties:dynamic)
[
dynamic({
"Success":true,
"Counters":{
"Counter1":1,
"Counter2":-1,
"Counter3":5,
"Counter4":4
}
}),
dynamic({
"Success":false,
"Counters":{
"Counter1":1,
"Counter2":2,
"Counter3":3,
"Counter4":4
}
})
]
| mv-apply event_properties.Counters on (
extend key = tostring(bag_keys(event_properties_Counters)[0])
| project key, value = event_properties_Counters[key]
)
| project-away event_properties

Passing column name as parameter in Kusto query

I need to pass the column name dynamically into the query. While following is syntactically correct it does not really execute the condition on the particular column.
Is this even possible in Kusto?
let foo = (duration: timespan, column:string) {
SigninLogs
| where TimeGenerated >= ago(duration)
| summarize totalPerCol = count() by ['column']
};
//foo('requests')<-- does not work
//foo('user') <-- does not work
//foo('ip')<-- does not work
you can try using the column_ifexists() function: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/columnifexists.
for example:
let foo = (column_name:string) {
datatable(col_a:string)["hello","world"]
| summarize totalPerCol = count() by column_ifexists(column_name, "")
};
foo('col_a')
col_a
totalPerCol
hello
1
world
1

how to use wildcard (*) for join parameter in KQL?

I'm racking my brain with this and would like some help. :)
I want to know how to use wildcard(*) for join union parameter.
I need to join two tables with the same names in the fields, however, some fields may come with the wildcard(*), since for this field I want all to be validated.
My exceptions table:
let table_excep= datatable (Computer:string,Event_id:string, logon_type:string)
[
"Pc_01","*","4",
"Pc_02","4648","*",
"*","*","60"
];
My data table:
let table_windows= datatable (Computer:string,Event_id:string, logon_type:string)
[
"Pc_01","5059","4",
"Pc_02","4648","1",
"Pc_03","61","60"
];
When running, it doesn't bring anything in the result.
For this union, I want the 3 union fields to be considered, ie based on the exceptions table, if computer_name is Pc_01 and logon_type is 4, no matter what event_id is, this log should be displayed, since the field of eventi_id in the exception list is wildcard(*).
I'm not finding a way to solve this problem since the join condition only allows "==" and "and".
cross join (inner join on 1=1) + where
let table_excep= datatable (Computer:string,Event_id:string, logon_type:string)
[
"Pc_01","*","4",
"Pc_02","4648","*",
"*","*","60"
];
let table_windows= datatable (Computer:string,Event_id:string, logon_type:string)
[
"Pc_01","5059","4",
"Pc_02","4648","1",
"Pc_03","61","60"
];
table_excep | extend dummy = 1
| join kind=inner (table_windows | extend dummy = 1) on dummy
| where (Computer == Computer1 or Computer == '*')
and (Event_id == Event_id1 or Event_id == '*')
and (logon_type == logon_type1 or logon_type == '*')
Computer
Event_id
logon_type
dummy
Computer1
Event_id1
logon_type1
dummy1
Pc_01
*
4
1
Pc_01
5059
4
1
Pc_02
4648
*
1
Pc_02
4648
1
1
*
*
60
1
Pc_03
61
60
1
Fiddle

Custom aggregate function on summarize

I want to calculate a statistic mode on a column during summarization of a table.
My CalculateMode function that I try is like this:
.create function CalculateMode(Action:int, Asset:string, Start:long, End:long) {
Event
| where Time between (Start .. End) and IdAction == Action and IdDevice == Device
| summarize Count = countif(isnotnull(Result) and isnotempty(Result)) by tostring(Result)
| top 1 by Count desc
| project ActionResult
}
OR
.create function CalculateMode(T:(data:dynamic)) {
T
| summarize Count = countif(isnotnull(data) and isnotempty(data)) by tostring(data)
| top 1 by Count desc
| project data
}
when i using first coding on summarizing:
Event
| summarize Result = CalculateMode(toint(IdAction), tostring(IdDevice), Start, End) by Category
Obtain this error No tabular expression statement found and
when i using second coding on summarizing:
Event
| summarize Result = CalculateMode(Result) by Category
I get this error
CalculateMode(): argument #1 must be a tabular expression
What can I do? Where am I doing something wrong?
Thanks
You can't just do summarize Result = CalculateMode(Result). You have to decide which aggregation function you want to summarize by (see the full list of aggregation functions here).

How to use a filter in subselect

I want to perform a subselect on a related set of data. That subdata needs to be filtered using data from the main query:
customEvents
| extend envId = tostring(customDimensions.EnvironmentId)
| extend organisation = tostring(customDimensions.OrganisationName)
| extend version = tostring(customDimensions.Version)
| extend app = tostring(customDimensions.Appname)
| where customDimensions.EventName contains "ApiSessionStartStart"
| extend dbInfo = toscalar(
customEvents
| extend dbInfo = tostring(customDimensions.dbInfo)
| extend serverEnvId = tostring(customDimensions.EnvironmentId)
| where customDimensions.EventName == "ServiceSessionStart" or customDimensions.EventName == "ServiceSessionContinuation"
| where serverEnvId = envId // This gives and error
| project dbInfo
| take 1)
| order by timestamp desc
| project timestamp, customDimensions.OrganisationName, customDimensions.Version, customDimensions.onBehalfOf, customDimensions.userId, customDimensions.Appname, customDimensions.apiKey, customDimensions.remoteIp, session_Id , dbInfo, envId
The above query results in an error:
Failed to resolve entity 'envId'
How can I filter the data in the subselect based on the field envId in the main query?
i believe you'd need to use join instead, where you'd join to get that value from the second query
docs for join: https://docs.loganalytics.io/docs/Language-Reference/Tabular-operators/join-operator
the left hand side of the join is your "outer" query, and the right hand side of the join would be that "inner" query, though instead of doing take 1, you'd probably do a simpler query that just gets distinct values of serverEnvId, dbInfo

Resources