Cassandra: Query with where clause containing greather- or lesser-than (< and >) - cassandra

I'm using Cassandra 1.1.2 I'm trying to convert a RDBMS application to Cassandra. In my RDBMS application I have following table called table1:
| Col1 | Col2 | Col3 | Col4 |
Col1: String (primary key)
Col2: String (primary key)
Col3: Bigint (index)
Col4: Bigint
This table counts over 200 million records. Mostly used query is something like:
Select * from table where col3 < 100 and col3 > 50;
In Cassandra I used following statement to create the table:
create table table1 (primary_key varchar, col1 varchar,
col2 varchar, col3 bigint, col4 bigint, primary key (primary_key));
create index on table1(col3);
I changed the primary key to an extra column (I calculate the key inside my application).
After importing a few records I tried to execute following cql:
select * from table1 where col3 < 100 and col3 > 50;
This result is:
Bad Request: No indexed columns present in by-columns clause with Equal operator
The Query select col1,col2,col3,col4 from table1 where col3 = 67 works
Google said there is no way to execute that kind of queries. Is that right? Any advice how to create such a query?

Cassandra indexes don't actually support sequential access; see http://www.datastax.com/docs/1.1/ddl/indexes for a good quick explanation of where they are useful. But don't despair; the more classical way of using Cassandra (and many other NoSQL systems) is to denormalize, denormalize, denormalize.
It may be a good idea in your case to use the classic bucket-range pattern, which lets you use the recommended RandomPartitioner and keep your rows well distributed around your cluster, while still allowing sequential access to your values. The idea in this case is that you would make a second dynamic columnfamily mapping (bucketed and ordered) col3 values back to the related primary_key values. As an example, if your col3 values range from 0 to 10^9 and are fairly evenly distributed, you might want to put them in 1000 buckets of range 10^6 each (the best level of granularity will depend on the sort of queries you need, the sort of data you have, query round-trip time, etc). Example schema for cql3:
CREATE TABLE indexotron (
rangestart int,
col3val int,
table1key varchar,
PRIMARY KEY (rangestart, col3val, table1key)
);
When inserting into table1, you should insert a corresponding row in indexotron, with rangestart = int(col3val / 1000000). Then when you need to enumerate all rows in table1 with col3 > X, you need to query up to 1000 buckets of indexotron, but all the col3vals within will be sorted. Example query to find all table1.primary_key values for which table1.col3 < 4021:
SELECT * FROM indexotron WHERE rangestart = 0 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 1000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 2000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 3000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 4000 AND col3val < 4021 ORDER BY col3val;

If col3 is always known small values/ranges, you may be able to get away with a simpler table that also maps back to the initial table, ex:
create table table2 (col3val int, table1key varchar,
primary key (col3val, table1key));
and use
insert into table2 (col3val, table1key) values (55, 'foreign_key');
insert into table2 (col3val, table1key) values (55, 'foreign_key3');
select * from table2 where col3val = 51;
select * from table2 where col3val = 52;
...
Or
select * from table2 where col3val in (51, 52, ...);
Maybe OK if you don't have too large of ranges. (you could get the same effect with your secondary index as well, but secondary indexes aren't highly recommended?). Could theoretically parallelize it "locally on the client side" as well.
It seems the "Cassandra way" is to have some key like "userid" and you use that as the first part of "all your queries" so you may need to rethink your data model, then you can have queries like select * from table1 where userid='X' and col3val > 3 and it can work (assuming a clustering key on col3val).

Related

How to get top 3 columns and their values across multiple columns (dynamically) in BigQuery

I have a table that looks like this
select 'Alice' AS ID, 1 as col1, 3 as col2, -2 as col3, 9 as col4
union all
select 'Bob' AS ID, -9 as col1, 2 as col2, 5 as col3, -6 as col4
I would like to get the top 3 absolute values for each record across the four columns and then format the output as a dictionary or STRUCT like below
select
'Alice' AS ID, [STRUCT('col4' AS column, 9 AS value), STRUCT('col2',3), STRUCT('col3',-2)] output
union all
select
'Bob' AS ID, [STRUCT('col1' AS column, -9 AS value), STRUCT('col4',-6), STRUCT('col3',5)]
output
output
I would like it to be dynamic, so avoid writing out columns individually. It could go up to 100 columns that change
For more context, I am trying to get the top three features from the batch local explanations output in Vertex AI
https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-batch-predictions
I have looked up some examples, would like something similar to the second answer here How to get max value of column values in a record ? (BigQuery)
EDIT: the data is actually structured like this. If this can be worked with more easily, this would be a better option to work from
select 'Alice' AS ID, STRUCT(1 as col1, 3 as col2, -2 as col3, 9 as col4) AS featureAttributions
union all
SELECT 'Bob' AS ID, STRUCT(-9 as col1, 2 as col2, 5 as col3, -6 as col4) AS featureAttributions
Consider below query.
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM sample_table UNPIVOT (value FOR column IN (col1, col2, col3, col4))
)
GROUP BY ID;
Query results
Dynamic Query
I would like it to be dynamic, so avoid writing out columns individually
You need to consider a dynamic SQL for this. By refering to the answer from #Mikhail you linked in the post, you can write a dynamic query like below.
EXECUTE IMMEDIATE FORMAT("""
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM sample_table UNPIVOT (value FOR column IN (%s))
)
GROUP BY ID
""", ARRAY_TO_STRING(
REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT AS STRUCT * EXCEPT (ID) FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
);
For updated sample table
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table)
UNPIVOT (value FOR column IN (col1, col2, col3, col4))
)
GROUP BY ID;
EXECUTE IMMEDIATE FORMAT("""
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table)
UNPIVOT (value FOR column IN (%s))
)
GROUP BY ID
""", ARRAY_TO_STRING(
REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT featureAttributions FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
);

COUNT(*) vs. COUNT(1) vs COUNT(column) with WHERE condition performance in Cassandra

I have a query in Cassandra
select count(pk1) from tableA where pk1=xyz
Table is :
create table tableA
(
pk1 uuid,
pk2 int,
pk3 text,
pk4 int,
fc1 int,
fc2 bigint,
....
fcn blob,
primary key (pk1, pk2 , pk3 , pk4)
The query is executed often and takes up to 2s to execute.
I am wondering if there will be any performance gain if refactoring to:
select count(1) from tableA where pk = xyz
Based on the documentation here, there is no difference between count(1) and count(*).
Generally speaking COUNT(1) and COUNT(*) will both return the number of rows that match the condition specified in your query
This is in line with how traditional SQL databases are implemented.
COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )
Count(1) is a condition that always evaluates to true.
Also, Count(Column_name) only returns the Non-Null values.
Since in your case because of where condition the "Non-null" is a non-factor, I don't think there will be any difference in performance in using one over the other. This answer tried confirming the same using some performance tests.
In general COUNT is not at all recommended in Cassandra . As it’s going to scan through multiple nodes and get your answer back . And I’m not sure the count you get is really consistent.

Set primary key for range query in Cassandra

I want to create a table with these columns: id1, id2, type, time, data, version.
The frequent query is:
select * from table_name where id1 = ... and id2 =... and type = ...
select * from table_name where id1= ... and type = ... and time > ... and time < ...
I don't know how to set the primary key for the fast query?
As you have two different queries, you will likely need to have two different tables for them to perform well. This is not unusual for Cassandra data models. Keep in mind that for both of these, the PRIMARY KEY definition in Cassandra is largely dependent on the cardinalities and anticipated query patterns. As you have only provided the latter, you may need to make adjustments based on the cardinalities of id1, id2, and type.
select * from table_name where id1 = X and id2 = Y and type = Z;
So here I'm going to make an educated guess that id1 and id2 are nigh unique (high cardinality), as IDs usually are. I don't know how many types are available in your application, but as long as there aren't more than 10,000 this should work:
CREATE TABLE table_name_by_ids (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1,id2),type));
This will key your partitions on a joint hash of id1 and id2, sorting the rows inside by type (default ascending).
select * from table_name where id1= X and type = Z and time > A and time < B;
Likewise, the table to support this query will look like this:
CREATE TABLE table_name_by_id1_time (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Again, this should work as long as you don't have more than several thousand type/time combinations.
One final adjustment that I would make though, would be around judging just how many type/time combinations you expect to have over the life of the application. If this data will grow over time, then the above will cause the partitions to grow to an unmaintainable point. To keep that from happening, I'd also recommend adding a time "bucket."
version TEXT,
month_bucket TEXT,
PRIMARY KEY ((id1,month_bucket),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Likewise for this, the query will need to be adjusted as well:
select * from table_name_by_id1_time
where id1= 'X' and type = 'Z'
and month_bucket='201910'
and time > '2019-10-07 00:00:00' and time < '2019-10-07 16:22:12';
Hope this helps.
how do I guarantee the atomicity of these two insertions?
Simply put, you can run the two INSERTs together in an atomic batch.
BEGIN BATCH
INSERT INTO table_name_by_ids (
id1, id2, type, time, data, version
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0'
) ;
INSERT INTO table_name_by_id1_time (
id1, id2, type, time, data, version, month_bucket
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0','201910'
);
APPLY BATCH;
For more info, check out the DataStax docs on atomic batches: https://docs.datastax.com/en/dse/6.7/cql/cql/cql_using/useBatchGoodExample.html

Cassandra queries performance, ranges

I'm quite new with Cassandra, and I was wondering if there would be any impact in performance if a query is asked with "date = '2015-01-01'" or "date >= '2015-01-01' AND date <= '2015-01-01'".
The only reason I want to use the ranges like that is because I need to make multiple queries and I want to have them prepared (as in prepared statements). This way the prepared statements number is cut by half.
The keys used are ((key1, key2), date) and (key1, date, key2) in the two tables I want to use this. The query for the first table is similar to:
SELECT * FROM table1
WHERE key1 = val1
AND key2 = val2
AND date >= date1 AND date <= date2
For a PRIMARY KEY (key1, date, key2) that type of query just isn't possible. If you do, you'll see an error like this:
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY column
"key2" cannot be restricted (preceding column "date" is either not
restricted or by a non-EQ relation)"
Cassandra won't allow you to filter by a PRIMARY KEY component if the preceding column(s) are filtered by anything other than the equals operator.
On the other hand, your queries for PRIMARY KEY ((key1, key2), date) will work and perform well. The reason, is that Cassandra uses the clustering key(s) (date in this case) to specify the on-disk sort order of data within a partition. As you are specifying partition keys (key1 and key2) your result set will be sorted by date, allowing Cassandra to satisfy your query by performing a continuous read from the disk.
Just to test that out, I'll even run two queries on a table with a similar key, and turn tracing on:
SELECT * FROM log_date2 WHERe userid=1001
AND time > 32671010-f588-11e4-ade7-21b264d4c94d
AND time < a3e1f750-f588-11e4-ade7-21b264d4c94d;
Returns 1 row and completes in 4068 microseconds.
SELECT * FROM log_date2 WHERe userid=1001
AND time=74ad4f70-f588-11e4-ade7-21b264d4c94d;
Returns 1 row and completes in 4001 microseconds.

Cassandra wide row with every column as composite key

I have a wide row table with column
page_id int, user_id int, session_tid timeuuid and end_time timestamp
with partition key = user_id
I need to do multiple queries on the table, some based on one column and some based on another - and it turns out that I have cases with where clause on every column
As Cassandra doesnt allow me to use where clause on non-indexed, non-key column, is it ok if I make all of the columns my composite key? (currently all but end_time column are already composite key, with user_id as the partition key)
Making all columns as part of the primary key will not allow you to perform where conditions to each column in the way you're thinking.
To make an easy example if you create such a primary key
PK(key1, key2, key3, key4)
you won't be able to perform a query like
select * from mytable where key2 = 'xyz';
Because the rule is that you have to follow the order of keys to create a "multiple-where" condition.
So valid queries with multiple where are the following:
select * from mytable where key1 = 'xyz' and key2 = 'abc';
select * from mytable where key1 = 'xyz' and key2 = 'abc' and key3 = 11;
select * from mytable where key2 = 'xyz' and key2 = 'abc' and key3 = 11 and key4 = 2014;
You can ask for keyN only providing keyN-1
HTH,
Carlo

Resources