Ranges (intervals) request in Cassandra DB - CQL - cassandra

Excuse, if it is a duplicate, I've found a few questions about times ranges here, but my case seems a little bit different and not yet discussed.
I would like to store quite big chunks (bins) of data (blobs - 2-4Mb, this is the “black-box data”, I can't change its layout) to access with interval keys:
...
primary key ( bin_id int, from_item_id int, to_item_id int )
...
with ability to select with items ranges, like in this pseudo-code to select all chunks that contains interval of items [110, 200]:
select chunk from tb1 where chunk_id = 100500 and from_item_id >= 110 and to_item_id <= 200;
Attempt to run such a query directly ended with error:
code=2200 [Invalid query] message="PRIMARY KEY column "to_item_id" cannot be restricted (preceding column "from_item_id" is restricted by a non-EQ relation)"
Currently only solution I've found is to implement additional table (tb_map) with reverse mapping from item_id to bin_id and use select to make a query looks something like this:
...
– in tb_map
primary key (dummy_id, item_id)
...
select bin_id from tb_map where dummy_id = SOME_MAGIK and item_id >= 110 and item_id <= 200;
And then use bin_id to retrieve chunks from tb1 with EQ or IN operator like here:
select * from tb1 where bin_id in (...);
But I can't use this model due insert performance issues (application should avoid many inserts to additional table and should avoid maintaining additional data structures, but should be "as simple as nail").
Is it any simple solution to stay within one table (or several simple tables)? I'm stuck with no ideas how to model such behaviour in C* (may be slices should be used?), could local C* gurus provide any hints?
I'm using CQL 3.1

From CQL3 reference:
Moreover, for a given partition key, the clustering columns induce an ordering of rows and relations on them is restricted to the relations that allow to select a contiguous (for the ordering) set of rows.
In your case the query doesn't select a contiguous set of rows, so Cassandra refuses to process it.

Related

Spanner SQL Filtering by Less Than on Tuples

Is there any way to achive <, >, etc comparisons in the WHERE clause of a Spanner SQL query where the values compared are not scalar but tuples/structs?
For example, say we have a table users (intentionally unrealistic schema)
CREATE TABLE users (
is_special BOOL NOT NULL,
registered_on TIMESTAMP NOT NULL,
) PRIMARY KEY (is_special DESC, registered_on DESC)
The natural sort order of the PK index is then is_special DESC, registered_on DESC.
I want select a range of rows starting with a specific row in that PK index (i.e. from a cursor):
SELECT * FROM users
WHERE (is_special, registered_on) < (#cursor.is_special, #cursor.registered_on)
LIMIT 100
That's not allowed by Spanner SQL because the tuple is treated as a STRUCT type and STRUCT types do not allow the < comparison. Is there any other way to achieve this?
With the Read API, I can query a range by using a KeyRange and providing the PK of the row I want to start the query from, but I'd like to achieve the same in SQL.
Here is how to write the query using individual fields. This relies on the fact that column is_special is not nullable.
SELECT * FROM users
WHERE (is_special < #cursor.is_special) OR (is_special = #cursor.is_special AND registered_on < #cursor.registered_on)
LIMIT 100
Just for completeness; if column is_special is nullable then it gets a uglier.
SELECT * FROM users
WHERE (is_special < #cursor.is_special) OR ((is_special = #cursor.is_special OR (is_special IS NULL AND #cursor.is_special IS NULL)) AND registered_on < #cursor.registered_on)
LIMIT 100
Additional comment. The query has a LIMIT clause but no ORDER BY clause. This is legal but unusual and it looks like a bug given that the query is paging results.
I think the query should have the following clause:
ORDER BY is_special, registered_on
The reason is as follows:
If a SQL query does not have an ORDER BY clause then it does not provide any row ordering guarantee. In practice you will observe ordering in Spanner results even without an ORDER BY clause but no order is guaranteed and you should not rely on it. However, if a query has an ORDER BY and Spanner uses an index that provides the required order then Spanner will not explicitly sort the data. Therefore you need not worry about the performance or memory impact of including ORDER BY.

Cassandra slow SELECT MAX(x) query

I have a dev machine with Cassandra 3.9 and 2 tables, one has ~~ 400,000 records, another about 40,000,000 records. Their structures are different.
Each of them has a secondary index on a field x, and I'm trying to run a query of the form SELECT MAX(x) FROM table. On the first table, the query takes a couple of seconds, and on the second table, it times out.
My experience is with relational databases where these queries are trivial and fast. So in Cassandra, it looks like the index isn't used to execute these queries? Is there an alternative?
In cassandra aggregation functions such as MIN, MAX, COUNT, SUM or AVG on a table without specifing a partition key is a bad practice. instead, you can have an other table that store the max value of x field for both tables.
However, you have to add some client side logic to maintain this max value in the other table when you run INSERT or UPDATE statements.
Tables structures :
CREATE TABLE t1 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE t2 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE agg_table (
table_name text PRIMARY KEY,
max_value int
);
So with this structure you can have the max value for a table :
SELECT max_value
FROM agg_table
WHERE table_name = 't1';
Hope this can help you.

Cassandra queries performance, ranges

I'm quite new with Cassandra, and I was wondering if there would be any impact in performance if a query is asked with "date = '2015-01-01'" or "date >= '2015-01-01' AND date <= '2015-01-01'".
The only reason I want to use the ranges like that is because I need to make multiple queries and I want to have them prepared (as in prepared statements). This way the prepared statements number is cut by half.
The keys used are ((key1, key2), date) and (key1, date, key2) in the two tables I want to use this. The query for the first table is similar to:
SELECT * FROM table1
WHERE key1 = val1
AND key2 = val2
AND date >= date1 AND date <= date2
For a PRIMARY KEY (key1, date, key2) that type of query just isn't possible. If you do, you'll see an error like this:
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY column
"key2" cannot be restricted (preceding column "date" is either not
restricted or by a non-EQ relation)"
Cassandra won't allow you to filter by a PRIMARY KEY component if the preceding column(s) are filtered by anything other than the equals operator.
On the other hand, your queries for PRIMARY KEY ((key1, key2), date) will work and perform well. The reason, is that Cassandra uses the clustering key(s) (date in this case) to specify the on-disk sort order of data within a partition. As you are specifying partition keys (key1 and key2) your result set will be sorted by date, allowing Cassandra to satisfy your query by performing a continuous read from the disk.
Just to test that out, I'll even run two queries on a table with a similar key, and turn tracing on:
SELECT * FROM log_date2 WHERe userid=1001
AND time > 32671010-f588-11e4-ade7-21b264d4c94d
AND time < a3e1f750-f588-11e4-ade7-21b264d4c94d;
Returns 1 row and completes in 4068 microseconds.
SELECT * FROM log_date2 WHERe userid=1001
AND time=74ad4f70-f588-11e4-ade7-21b264d4c94d;
Returns 1 row and completes in 4001 microseconds.

cassandra error when using select and where in cql

I have a cassandra table defined like this:
CREATE TABLE test.test(
id text,
time bigint,
tag text,
mstatus boolean,
lonumb int,
PRIMARY KEY (id, time, tag)
)
And I want to select one column using select.
I tried:
select * from test where lonumb = 4231;
It gives:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
Also I cannot do
select * from test where mstatus = true;
Doesn't cassandra support where as a part of CQL? How to correct this?
You can only use WHERE on the indexed or primary key columns. To correct your issue you will need to create an index.
CREATE INDEX iname
ON keyspacename.tablename(columname)
You can see more info here.
But you have to keep in mind that this query will have to run against all nodes in the cluster.
Alternatively you might rethink your table structure if the lonumb is something you'll do the most queries on.
Jny is correct in that WHERE is only valid on columns in the PRIMARY KEY, or those where a secondary index has been created for. One way to solve this issue is to create a specific query table for lonumb queries.
CREATE TABLE test.testbylonumb(
id text,
time bigint,
tag text,
mstatus boolean,
lonumb int,
PRIMARY KEY (lonumb, time, id)
)
Now, this query will work:
select * from testbylonumb where lonumb = 4231;
It will return all CQL rows where lonumb = 4231, sorted by time. I put id on the PRIMARY KEY to ensure uniqueness.
select * from test where mstatus = true;
This one is trickier. Indexes and keys on low-cardinality columns (like booleans) are generally considered a bad idea. See if there's another way you could model that. Otherwise, you could experiment with a secondary index on mstatus, but only use it when you specify a partition key (lonumb in this case), like this:
select * from testbylonumb where lonumb = 4231 AND mstatus = true;
Maybe that wouldn't perform too badly, as you are restricting it to a specific partition. But I definitely wouldn't ever do a SELECT * on mstatus.

CQL query on 'validFrom/validTo timestamps'

I'm currently trying to model a column family that has two timestamps specifying whether an entry is valid (or 'active') at a given date (typically execution time).
No big issue with traditional SQL, 64 gigs of RAM and some indices, we're doing that quite often with our SQL server.
However, in CQL I haven't managed to model this scenario and write valid queries for it.
My basic model is (I skipped the PK definition!)
create table myTable(
id uuid,
validFrom timeuuid,
validTo timeuuid,
someInformationalData varChar
);
Some explanations:
due to the fact, that a validity date is not unique, I need a combined key in my final application this is going to be a usergroup reference (would be an ideal partition key)
validFrom/To are designed to be optional, but I could deal with by using boundary values (1970, 2038) for 'null' values passed through the persistence layer
I tried various combinations of partitioning/clustering keys, however neither of them resulted in valid CQL
-- only active results
select *
from
myTable
where
validFrom < now()
and
validTo > now()
I'm quite new to the NoSQL/CQL world and am struggling a bit with converting some of our applications. I could do it in memory, but I'm afraid, this could get a bottleneck at some point...
No sure if this kind of 'I have no idea what I'm doing' yell is appropriate, but any kind of help would be appreciated. :)
edit Here's one of the approaches I've been messing around with
drop table if exists myTable;
create table myTable(
id int,
datefrom timeuuid,
dateto timeuuid,
someColumns varChar,
primary key((id,datefrom),dateto)
);
create index if not exists my_idx on myTable(datefrom);
insert into myTable(id, datefrom,dateto,somecolumns)
values(0,minTimeuuid('1970-01-01 00:00:00'),minTimeuuid('2020-01-01 00:00:00'),'test');
insert into myTable(id,datefrom,dateto,somecolumns)
values(1,minTimeuuid('1970-01-01 00:00:00'),minTimeuuid('2012-01-01 00:00:00'),'test2');
select * from myTable where dateto > now() allow filtering;
-- invalid ("A column of a partition key can be restricted only if the preceding one is restricted by an Equal relation.")
select * from myTable where datefrom < now() and dateto > now() allow filtering;
The first query is limiting my result, the row with 'validTo=2012-01-01' is filtered, but I wasn't able to work out a scheme that worked on both limitations in the where clause.
If I understand your problem, what you are looking for is a way to run a range query based on the timestamp. Basically to be able to do this, your model will have to have the timestamp component as part of the clustering key:
create table myTable(
eventType uuid,
ts timestamp,
val text,
PRIMARY KEY (eventType, ts)
);
The above will allow you to run a query like: SELECT eventType, val from myTable where eventType = 'your_event' and ts >= 'start_ts' and ts < 'end_ts'.
What you need to remember is that the clustering keys are dictating the order on disk, thus making it possible to run efficiently queries like above. You can read more details about this in the CQL spec SELECT section.
Their is no such thing as Now() in cassandra like any other sql databases. you have to clearly mention today's date instead of Now() ..
You can use columns in which you defined as primary key or secondary index in where clause.

Resources