Clustering column non-EQ relation - cassandra

Have a table
CREATE TABLE room (
uuidhotel text,
startreservetime double,
endreservetime double,
uuid text,
uuidguest text,
uuidroom text,
PRIMARY KEY (uuidhotel, startreservetime, endreservetime)
query like this works:
select * from room WHERE uuidhotel = 'testUUIDHotel' and startreservetime > 1;
but when I try to use:
cqlsh:hotelier> select * from room WHERE uuidhotel = 'testUUIDHotel' and startreservetime > 1 and endreservetime < 3;
got an error.
InvalidRequest: code=2200 [Invalid query] message="Clustering column "endreservetime" cannot be restricted (preceding column "startreservetime" is restricted by a non-EQ relation)
How can I execute query with 3 parameters ?
Any alternatives ?

Unfortunately, if you want to use a greater/less-than operator on a PRIMARY KEY component in Cassandra, all of the preceding PRIMARY KEY components must be restricted with an equals operator.
So how can you query for a date range? Well, you can specify the same PRIMARY KEY component twice. Currently that doesn't help you. But with a small modeling change (coupled with storing each row twice...once for the start and again for the end) it does:
aploetz#cqlsh:stackoverflow> SELECT * FROM room WHERE uuidhotel = 'testUUIDHotel'
AND reservetime > 1 AND reservetime < 3;
uuidhotel | reservetime | startend | uuid | uuidguest | uuidroom
---------------+-------------+----------+--------------------------------------+--------------------------------------+--------------------------------------
testUUIDHotel | 2.1 | S | 49c441cd-a6cd-4638-85b3-fdc3405779f4 | cd3ad747-42a3-4d31-b02a-8190dd8559d8 | daae89d5-abd3-4cac-b4cc-aec9d6b7fb1f
testUUIDHotel | 2.2 | E | 49c441cd-a6cd-4638-85b3-fdc3405779f4 | cd3ad747-42a3-4d31-b02a-8190dd8559d8 | daae89d5-abd3-4cac-b4cc-aec9d6b7fb1f
(2 rows)
Basically, if you store an entry for each the start and the end, (and use startend in your key for uniqueness) you'll be able to effectively query with greater/less-than on the time. Just make sure you're querying with a wide-enough gap that you don't query in-between the range of reservetimes that you care about.

EDIT: Apparently tuple comparisons don't work the way I thought they did... The below answer doesn't work, it compares the entire tuple and NOT the individual elements, i.e. (1,2) < (3,1) returns true. Leaving it in case it inspires a better method...
Another approach, you can query with non-EQ conditions on multiple clustering columns using tuples.
However, it would require you to have a single non-EQ operator, i.e. your query would have to look like:
SELECT * FROM table WHERE (c1, c2) > (1, 3)
So you have to transform your values such that you can use a single operator. You can do this by negating both sides!
Remember from algebra:
5 < 10
If you negate both sides, you have to switch the operator:
-5 > -10
So create a new column negative_c2 with the negated values from c2 and do the query:
SELECT * FROM table WHERE (c1, negative_c2) > (1, -3)

Related

Cassandra Range queries on Map values using timestamp

I have below Cassandra table.
create table person(
id int PRIMARY KEY,
name text,
imp_dates map<text,timestamp>
);
Data inserted as below
insert into person(id,name,imp_dates) values(1,'one',{'birth':'1982-04-01','marriage':'2018-04-01'});
insert into person(id,name,imp_dates) values(2,'two',{'birth':'1980-04-01','marriage':'2010-04-01'});
insert into person(id,name,imp_dates) values(3,'three',{'birth':'1980-04-01','graduation':'2012-04-01'});
id | name | imp_dates
----+-------+-----------------------------------------------------------------------------------------------
1 | one | {'birth': '1982-03-31 18:30:00.000000+0000', 'marriage': '2018-03-31 18:30:00.000000+0000'}
2 | two | {'birth': '1980-03-31 18:30:00.000000+0000', 'marriage': '2010-03-31 18:30:00.000000+0000'}
3 | three | {'birth': '1980-03-31 18:30:00.000000+0000', 'graduation': '2012-03-31 18:30:00.000000+0000'}
I have requirement to write query as below. This required range on map value column.
select id,name,imp_dates from person where id =1 and imp_dates['birth'] < '2000-04-01';
I get following error
Error from server: code=2200 [Invalid query] message="Only EQ relations are supported on map entries"
The possible solution I can think of is:
1) Make map flat into multiple columns and then make it part of primary key. this will work but its not flexible since I may have to alter the schema
2) I can create another table person_id_by_important_dates to replace Map but then I loose read consistency as I have to read from two tables and join myself.
I do not wish to include imp_dates (map) part of primary key as it will create new row every time I insert with new values.
Appreciate help with this.
Thanks

CQL query in Cassandra with composite partition key

My main problem is with paginating Cassandra resultset on a table with a composite partition key. However, I am trying to narrow it down with a simple scenario. Say, I have a table,
CREATE TABLE numberofrequests (
cluster text,
date text,
time text,
numberofrequests int,
PRIMARY KEY ((cluster, date), time)
) WITH CLUSTERING ORDER BY (time ASC)
And I have a data like,
cluster | date | time | numberofrequests
---------+------------+------+------------------
c2 | 01/04/2015 | t1 | 1
c2 | d1 | t1 | 1
c2 | 02/04/2015 | t1 | 1
c1 | d1 | t1 | 1
c1 | d1 | t2 | 2
Question: Is there any way I can query data for cluster=c2? I don't care about the 'date' and honestly speaking I keep this only for partitioning purpose to avoid hot-spots. I tried the following,
select * from numberofrequests where token(cluster,date)>=token('c2','00/00/0000');
select * from numberofrequests where token(cluster,date)>=token('c2','1');
select * from numberofrequests where token(cluster,date)>=token('c2','a');
select * from numberofrequests where token(cluster,date)>=token('c2','');
My schema uses the default partitioner (Murmur3Partitioner). Is this achievable at all?
Cassandra needs the partitioning key (PK) to locate the queried row. Any queries based only on parts of the PK will not work, since its murmur3 hash won't match the hash based on the complete PK as initially created by the partitioner. What you could do instead is to use the ByteOrderedPartitioner. This would allow you to use the token() function as in your examples by keeping the byte order of the PK instead of using a hash function. But in most cases, that's a bad idea, as data will not be distributed evenly across the cluster and you'll end up with hotspots you tried to avoid in first place.

Cassandra DB: Why less than query failed?

I have created a KEYSPACE and a TABLE with a uuid column as primary key and a timestamp column using an index. All this succeeded like the following picture showed:
cassandra#cqlsh:my_keyspace> insert into my_test ( id, insert_time, value ) values ( uuid(), '2015-03-12 09:10:30', '111' );
cassandra#cqlsh:my_keyspace> insert into my_test ( id, insert_time, value ) values ( uuid(), '2015-03-12 09:20:30', '222' );
cassandra#cqlsh:my_keyspace> select * from my_test;
id | insert_time | value
--------------------------------------+--------------------------+-------
9d7f88bc-5cb9-463f-b679-fd66e6469eb5 | 2015-03-12 09:20:30+0000 | 222
69579f6f-bf88-493b-a1d6-2f89fac25650 | 2015-03-12 09:10:30+0000 | 111
(2 rows)
and now query
cassandra#cqlsh:my_keyspace> select * from my_test where insert_time = '2015-03-12 09:20:30';
id | insert_time | value
--------------------------------------+--------------------------+-------
9d7f88bc-5cb9-463f-b679-fd66e6469eb5 | 2015-03-12 09:20:30+0000 | 222
(1 rows)
and now query with less than:
cassandra#cqlsh:my_keyspace> select * from my_test where insert_time < '2015-03-12 09:20:30';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: 'insert_time < <value>'"
while the first query is successful, why this happened? How should I make the second query successful since that's just what I want?
You can test all this on your own machine. Thanks
CREATE TABLE my_test (
id uuid PRIMARY KEY,
insert_time timestamp,
value text
) ;
CREATE INDEX my_test_insert_time_idx ON my_keyspace.my_test (insert_time);
Cassandra range queries are quite limited. It goes down to performance, and data storage mechanics. A range query must have the following:
Hit a (or few with IN) partition key, and include exact matches on all consecutive clustering keys except the last one in the query, which you can do a range query on.
Say your PK is (a, b, c, d), then the following are allowed:
where a=a1 and b < b1
where a=a1 and b=b1 and c < c1
The following is not:
where a=a1 and c < 1
[I won't go into Allow Filtering here...avoid it.]
Secondary indexes must be exact matches. You can't have range queries on them.

Can an index be created on a UUID Column?

Is it possible to create an index on a UUID/TIMEUUID column in Cassandra? I'm testing out a model design which would have an index on a UUID column, but queries on that column always return 0 rows found.
I have a table like this:
create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
I create an index with this command:
create index idx on some_data (run_id) ;
No errors are thrown by CQL when I create this index.
I have a small bit of test data in the table:
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-----------------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
However, when I run the query:
select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
CQLSH just returns: (0 rows)
If I use an int for the run_id then the index behaves as expected.
Yes, you can create a secondary index on a UUID. The real question is "should you?"
In any case, I followed your steps, and got it to work.
Connected to Test Cluster at 192.168.23.129:9042.
[cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native protocol v3]
Use HELP for help.
aploetz#cqlsh> use stackoverflow ;
aploetz#cqlsh:stackoverflow> create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
aploetz#cqlsh:stackoverflow> create index idx on some_data (run_id) ;
aploetz#cqlsh:stackoverflow> INSERT INTO some_data (site_id, user_id, run_id, value) VALUES (1,1,9e118af0-ac92-11e4-81ae-8d1bc921f26d,3);
aploetz#cqlsh:stackoverflow> select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
code=2200 [Invalid query] message="unconfigured columnfamily usr_rec3"
aploetz#cqlsh:stackoverflow> select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
(1 rows)
Notice though, that when I ran this command, it failed:
select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
Are you sure that you didn't mean to select from some_data instead?
Also, creating secondary indexes on high-cardinality columns (like a UUID) is generally not a good idea. If you need to query by run_id, then you should revisit your data model and come up with an appropriate query table to serve that.
Clarification:
Using secondary indexes in general is not considered good practice. In the new book Cassandra High Availability, Robbie Strickland identifies their use as an anti-pattern, due to poor performance.
Just because a column is of the UUID data type doesn't necessarily make it high-cardinality. That's more of a data model question for you. But knowing the nature of UUIDs and their underlying purpose toward being unique, is setting off red flags.
Put these two points together, and there isn't anything about creating an index on a UUID that sounds appealing to me. If it were my cluster, and (more importantly) I had to support it later, I wouldn't do it.

Cassandra: Query with where clause containing greather- or lesser-than (< and >)

I'm using Cassandra 1.1.2 I'm trying to convert a RDBMS application to Cassandra. In my RDBMS application I have following table called table1:
| Col1 | Col2 | Col3 | Col4 |
Col1: String (primary key)
Col2: String (primary key)
Col3: Bigint (index)
Col4: Bigint
This table counts over 200 million records. Mostly used query is something like:
Select * from table where col3 < 100 and col3 > 50;
In Cassandra I used following statement to create the table:
create table table1 (primary_key varchar, col1 varchar,
col2 varchar, col3 bigint, col4 bigint, primary key (primary_key));
create index on table1(col3);
I changed the primary key to an extra column (I calculate the key inside my application).
After importing a few records I tried to execute following cql:
select * from table1 where col3 < 100 and col3 > 50;
This result is:
Bad Request: No indexed columns present in by-columns clause with Equal operator
The Query select col1,col2,col3,col4 from table1 where col3 = 67 works
Google said there is no way to execute that kind of queries. Is that right? Any advice how to create such a query?
Cassandra indexes don't actually support sequential access; see http://www.datastax.com/docs/1.1/ddl/indexes for a good quick explanation of where they are useful. But don't despair; the more classical way of using Cassandra (and many other NoSQL systems) is to denormalize, denormalize, denormalize.
It may be a good idea in your case to use the classic bucket-range pattern, which lets you use the recommended RandomPartitioner and keep your rows well distributed around your cluster, while still allowing sequential access to your values. The idea in this case is that you would make a second dynamic columnfamily mapping (bucketed and ordered) col3 values back to the related primary_key values. As an example, if your col3 values range from 0 to 10^9 and are fairly evenly distributed, you might want to put them in 1000 buckets of range 10^6 each (the best level of granularity will depend on the sort of queries you need, the sort of data you have, query round-trip time, etc). Example schema for cql3:
CREATE TABLE indexotron (
rangestart int,
col3val int,
table1key varchar,
PRIMARY KEY (rangestart, col3val, table1key)
);
When inserting into table1, you should insert a corresponding row in indexotron, with rangestart = int(col3val / 1000000). Then when you need to enumerate all rows in table1 with col3 > X, you need to query up to 1000 buckets of indexotron, but all the col3vals within will be sorted. Example query to find all table1.primary_key values for which table1.col3 < 4021:
SELECT * FROM indexotron WHERE rangestart = 0 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 1000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 2000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 3000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 4000 AND col3val < 4021 ORDER BY col3val;
If col3 is always known small values/ranges, you may be able to get away with a simpler table that also maps back to the initial table, ex:
create table table2 (col3val int, table1key varchar,
primary key (col3val, table1key));
and use
insert into table2 (col3val, table1key) values (55, 'foreign_key');
insert into table2 (col3val, table1key) values (55, 'foreign_key3');
select * from table2 where col3val = 51;
select * from table2 where col3val = 52;
...
Or
select * from table2 where col3val in (51, 52, ...);
Maybe OK if you don't have too large of ranges. (you could get the same effect with your secondary index as well, but secondary indexes aren't highly recommended?). Could theoretically parallelize it "locally on the client side" as well.
It seems the "Cassandra way" is to have some key like "userid" and you use that as the first part of "all your queries" so you may need to rethink your data model, then you can have queries like select * from table1 where userid='X' and col3val > 3 and it can work (assuming a clustering key on col3val).

Resources