how to construct range query in cassandra? - cql

CREATE TABLE users (
userID uuid,
firstname text,
lastname text,
state text,
zip int,
age int,
PRIMARY KEY (userID)
);
I want to construct the following queries:
select * from users where age between 30 and 40
select * from users where state in "AZ" AND "WA"
I know I need two more tables to do this query but I dont know how the should be?
EDIT
From Carlo's comments, I see this is the only possibility
CREATE TABLE users (
userID uuid,
firstname text,
lastname text,
state text,
zip int,
age int,
PRIMARY KEY (age,zip,userID)
);
Now to select Users with age between 15 and 30. this is the only possibility:
select * from users where age IN (15,16,17,....30)
However, using IN operator here is not recommended and is anti-pattern.
How about creating secondary Index on age?
CREATE index users_age ON users(age)
will this help?
Thanks

Range queries is a prikly question.
The way to perform a real range query is to use a compound primary key, making the range on the clustering part. Since the range is on clustering part you can't perform the queries you wrote: you need at least to have an equal condition on the whole partition key.
Let's see an example:
CREATE TABLE users (
mainland text,
state text,
uid int,
name text,
zip int,
PRIMARY KEY ((mainland), state, uid)
)
The uid is now an int just to make tests easier
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'washington', 1, 'john', 98100);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'texas', 2, 'lukas', 75000);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'delaware', 3, 'henry', 19904);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'delaware', 4, 'dawson', 19910);
insert into users (mainland, state, uid, name, zip) VALUES ( 'centraleurope', 'italy', 5, 'fabio', 20150);
insert into users (mainland, state, uid, name, zip) VALUES ( 'southamerica', 'argentina', 6, 'alex', 10840);
Now the query can perform what you need:
select * from users where mainland = 'northamerica' and state > 'ca' and state < 'ny';
Output
mainland | state | uid | name | zip
-------------+----------+-----+--------+-------
northamerica | delaware | 3 | henry | 19904
northamerica | delaware | 4 | dawson | 19910
if you put an int (age, zipcode) as first column of the clustering key you can perform the same queries comparing integers.
TAKE CARE: most of people when looking at this situation starts thinking "ok, I can put a fake partition key that is always the same and then I can perform range queries". This is a huge error, the partition key is responsible for data distribution accross nodes. Setting a fix partition key means that all data will finish in the same node (and in its replica).
Dividing the world zone into 15/20 zones (in order to have 15/20 partition key) is something but is not enough and is made just to create a valid example.
EDIT: due to question's edit
I did not say that this is the only possibility; if you can't find a valid way to partition your users and need to perform this kind of query this is one possibility, not the only one. Range queries should be performed on clustering key portion. A weak point of the AGE as partition key is that you can't perform an UPDATE over it, anytime you need to update the user's age you have to perform a delete and an insert (an alternative could be writing the birth_year/birth_date and not the age, and then calculate client side)
To answer your question on adding a secondary index: actually queries on secondary index does not support IN operator. From the CQL message it looks like they're going to develop it soon
Bad Request: IN predicates on non-primary-key columns (xxx) is not yet
supported
However even if secondary index would support IN operator your query wouldn't change from
select * from users where age IN (15,16,17,....30)
Just to clarify my concept: anything that does not have a "clean" and "ready" solution requires the effort of the user to model data in a way that satisfy its needs. To make an example (I don't say this is a good solution: I would not use it)
CREATE TABLE users (
years_range text,
age int,
uid int,
PRIMARY KEY ((years_range), age, uid)
)
put some data
insert into users (years_range, age , uid) VALUES ( '11_15', 14, 1);
insert into users (years_range, age , uid) VALUES ( '26_30', 28, 3);
insert into users (years_range, age , uid) VALUES ( '16_20', 16, 2);
insert into users (years_range, age , uid) VALUES ( '26_30', 29, 4);
insert into users (years_range, age , uid) VALUES ( '41_45', 41, 5);
insert into users (years_range, age , uid) VALUES ( '21_25', 23, 5);
query data
select * from users where years_range in('11_15', '16_20', '21_25', '26_30') and age > 14 and age < 29;
output
years_range | age | uid
-------------+-----+-----
16_20 | 16 | 2
21_25 | 23 | 5
26_30 | 28 | 3
This solution might solve your problem and could be used in a small cluster, where about 20 keys (0_5 ...106_110) might have a good distribution. But this solution, like the one before, does not allow an UPDATE and reduces the distribution of key. The advantage is that you have small IN sets.
In a perfect world where S.I. already allows IN clause I'd use the UUID as partition key, the years_range (set as birth_year_range) as S.I. and "filter" my data client side (if interested in 10 > age > 22 I would ask for IN('1991_1995', '1996_2000', '2001_2005', '2006_2010', '2011_2015') calculating and removing unuseful years on my application)
HTH,
Carlo

I found that using allow filtering, I can query for range:
example is here:
CREATE TABLE users2 (
mainland text,
state text,
uid int,
name text,
age int,
PRIMARY KEY (uid, age, state)
) ;
insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'washington', 1, 'john', 81);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'texas', 1, 'lukas', 75);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'delaware', 1, 'henry', 19);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'delaware', 4, 'dawson', 90);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'centraleurope', 'italy', 5, 'fabio', 50);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'southamerica', 'argentina', 6, 'alex', 40);
select * from users2 where age>50 and age<=100 allow filtering;
uid | age | state | mainland | name
-----+-----+------------+--------------+--------
1 | 75 | texas | northamerica | lukas
1 | 81 | washington | northamerica | john
2 | 75 | texas | northamerica | lukas
4 | 90 | delaware | northamerica | dawson
(4 rows)
I am not sure if this performance killer. But this seems to work. Infact, I don't even have to give the primary key which is uid in this case during query execution

Related

Reference to a field of a row object

I'm having trouble accessing the fields of row objects which I have created in Presto. The Presto documentation claims "fields... are accessed with field reference operator." However that doesn't seem to work. This code reproduces the problem:
CREATE TABLE IF NOT EXISTS data AS
SELECT * FROM (VALUES
(1, 'Adam', 17),
(2, 'Bill', 42)
) AS x (id, name, age);
CREATE TABLE IF NOT EXISTS ungrouped_data AS
WITH grouped_data AS (
SELECT
id,
ROW(name, age) AS name_age
FROM data
)
SELECT
id,
name_age.1 AS name,
name_age.2 AS age
FROM grouped_data;
Which returns an "extraneous input '.1'" error.
Starting with Trino (formerly known as Presto) 314, it is now possible to reference ROW fields using the [] operator.
WITH grouped_data AS (
SELECT
id,
ROW(name, age) AS name_age
FROM data
)
SELECT
id,
name_age[1] AS name,
name_age[2] AS age
FROM grouped_data;
ROW(name, age) will create an row without field names. Today to access the fields in such row, you need to cast it into a row with field names. Try this:
WITH grouped_data AS (
SELECT
id,
CAST(ROW(name, age) AS ROW(col1 VARCHAR, col2 INTEGER)) AS name_age
FROM data
)
SELECT
id,
name_age.col1 AS name,
name_age.col2 AS age
FROM grouped_data;
Result:
id | name | age
----+------+-----
1 | Adam | 17
2 | Bill | 42
See https://github.com/prestodb/presto/issues/7640 for discussions on this.

limit the size of data type LIST of cassandra

friends
I am design a message history table
CREATE TABLE message_history (
user_name text PRIMARY KEY,
time timestamp,
message_details list<text>,
);
so that I can query a user's message via primary key user_name at once.
but the item in message_details list may be very long so that I want to limit the list size of the message_details list.
cause I just want care the latest, say, 1000 messages of a user.
can I achieve this?
thx!
I'm afraid collections are not designed for this use-case (read more). Is there any reason you can't use a clustering key instead of a list?
CREATE TABLE message_history (
user_name text,
time timestamp,
message_details text,
PRIMARY KEY(user_name, time)
) WITH CLUSTERING ORDER BY (time DESC);
insert into message_history (user_name, time, message_details) values ('user1', dateOf(now()), 'message text');
insert into message_history (user_name, time, message_details) values ('user1', dateOf(now()), 'message text2');
insert into message_history (user_name, time, message_details) values ('user1', dateOf(now()), 'message text3');
select * from message_history where user_name = 'user1' limit 1;
user_name | time | message_details
-----------+--------------------------+-----------------
user1 | 2015-08-13 15:44:45+0000 | message text3
You could do that using a map instead of a list. Keep a message id number and increment it each time the user has a new message. Then do a modulo on the message id by 1000 and use that as the map key.
By doing a modulo, the value will wrap around every 1000 and overwrite the oldest message to replace it with the most recent one.
So your table could look like this:
CREATE TABLE message_history (
user_name text PRIMARY KEY,
time timestamp,
last_msg_id int static,
message_details map<int, text>
);
Before you save a new message, read the current value of last_msg_id and increment it, calculate the modulo 1000, and then update the map using the modulo result as the key and the new message as the text, and also update last_msg_id.
Or you could keep last_msg_id as a counter column in a separate table.

Query results not ordered despite WITH CLUSTERING ORDER BY

I am storing posts from all users in table. I want to retrieve post from all users the user is following.
CREATE TABLE posts (
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (userid, time)
)WITH CLUSTERING ORDER BY (time DESC)
I have the data about who all user follows in another table
CREATE TABLE follow (
userid int,
who_follow_me set<int>,
who_i_follow set<int>,
PRIMARY KEY ((userid))
)
I am making query like
select * from posts where userid in(1,2,3,4....n);
2 questions:
why I still get data in random order, though CLUSTERING ORDER BY is specified in posts. ?
Is model correct to satisfy the query optimally (user can have n number of followers)?
I am using Cassandra 2.0.10.
"why I still get data in random order, though CLUSTERING ORDER BY is specified in posts?"
This is because ORDER BY only works for rows within a particular partitioning key. So in your case, if you wanted to see all of the posts for a specific user like this:
SELECT * FROM posts WHERE userid=1;
That return your results ordered by time, as all of the rows within the userid=1 partitioning key would be clustered by it.
"Is model correct to satisfy the query optimally (user can have n number of followers)?"
It will work, as long as you don't care about getting the results ordered by timestamp. To be able to query posts for all users ordered by time, you would need to come up with a different partitioning key. Without knowing too much about your application, you could use a column like GROUP (for instance) and partition on that.
So let's say that you evenly assign all of your users to eight groups: A, B, C, D, E, F, G and H. Let's say your table design changed like this:
CREATE TABLE posts (
group text,
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (group, time, userid)
)WITH CLUSTERING ORDER BY (time DESC)
You could then query all posts for all users for group B like this:
SELECT * FROM posts WHERE group='B';
That would give you all of the posts for all of the users in group B, ordered by time. So basically, for your query to order the posts appropriately by time, you need to partition your post data on something other than userid.
EDIT:
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
That's not going to work. In fact, that should produce the following error:
code=2200 [Invalid query] message="Missing CLUSTERING ORDER for column follows"
And even if you did add follows to your CLUSTERING ORDER clause, you would see this:
code=2200 [Invalid query] message="Only clustering key columns can be defined in CLUSTERING ORDER directive"
The CLUSTERING ORDER clause can only be used on the clustering column(s), which in this case, is only the follows column. Alter your PRIMARY KEY definition to cluster on follows (ASC) and created (DESC). I have tested this, and inserted some sample data, and can see that this query works:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2 AND follows=1;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(3 rows)
Although, if you want to query by just userid you can see posts from all of your followers. But in that case, the posts will only be ordered within each followerid, like this:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 0 | 2015-01-25 13:28:00-0600 | 94da27d0-e91f-4c1f-88f2-5a4bbc4a0096
2 | 0 | 2015-01-25 13:23:00-0600 | 798053d3-f1c4-4c1d-a79d-d0faff10a5fb
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(5 rows)
This is my new schema,
CREATE TABLE posts(id uuid,
userid int,
follows int,
created timestamp,
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
Here userid represents who posted it and follows represents userid for his one of the follower. Say user x follows 10 other people , i am making 10+1 inserts. Definitely there is too much data duplication. However now its easier to get timeline for one of the user with following query
select * from posts where follows=?

Mixing column types in Cassandra / wide rows

I am trying to learn how to implement a feed in cassandra (think twitter). I want to use wide rows to store all the posts made by a user.
I am thinking about adding user information or statistical information in the same row (num of posts, last post date, user name, etc.).
My question is: is name, age, etc. "field name" stored in column? Or those wide rows only store the column-name and values specified? Am I wasting disk space? Am I compromising performance somehow?
Thanks!
-- TABLE CREATION
CREATE TABLE user_feed (
owner_id int,
name text,
age int,
posted_at timestamp,
post_text text,
PRIMARY KEY (owner_id, posted_at)
);
-- INSERTING THE USER
insert into user_feed (owner_id, name, age, posted_at) values (1, 'marc', 36, 0);
-- INSERTING USER POSTS
insert into user_feed (owner_id, posted_at, post_text) values (1, dateof(now()), 'first post!');
insert into user_feed (owner_id, posted_at, post_text) values (1, dateof(now()), 'hello there');
insert into user_feed (owner_id, posted_at, post_text) values (1, dateof(now()), 'i am kind of happy');
-- GETTING THE FEED
select * from user_feed where owner_id=1 and posted_at>0;
-- RESULT
owner_id | posted_at | age | name | post_text
----------+--------------------------+------+------+--------------------
1 | 2014-07-04 12:01:23+0000 | null | null | first post!
1 | 2014-07-04 12:01:23+0000 | null | null | hello there
1 | 2014-07-04 12:01:23+0000 | null | null | i am kind of happy
-- GETTING USER INFO - ONLY USER INFO IS POSTED_AT=0
select * from user_feed where owner_id=1 and posted_at=0;
-- RESULT
owner_id | posted_at | age | name | post_text
----------+--------------------------+------+------+--------------------
1 | 1970-01-01 00:00:00+0000 | 36 | marc | null
What about making them static?
A static column is the same in all partition key and since your partition key is the id of the owner you could avoid wasting space and retrieve the user informations in any query.
CREATE TABLE user_feed (
owner_id int,
name text static,
age int static,
posted_at timestamp,
post_text text,
PRIMARY KEY (owner_id, posted_at)
);
Cheers,
Carlo

Primary key in cassandra is unique?

It could be kind of lame but in cassandra has the primary key to be unique?
For example in the following table:
CREATE TABLE users (
name text,
surname text,
age int,
adress text,
PRIMARY KEY(name, surname)
);
So if is it possible in my database to have 2 persons in my database with the same name and surname but different ages? Which means same primary key..
Yes the primary key has to be unique. Otherwise there would be no way to know which row to return when you query with a duplicate key.
In your case you can have 2 rows with the same name or with the same surname but not both.
By definition, the primary key has to be unique. But that doesn't mean you can't accomplish your goals. You just need to change your approach/terminology.
First of all, if you relax your goal of having the name+surname be a primary key, you can do the following:
CREATE TABLE users ( name text, surname text, age int, address text, PRIMARY KEY((name, surname),age) );
insert into users (name,surname,age,address) values ('name1','surname1',10,'address1');
insert into users (name,surname,age,address) values ('name1','surname1',30,'address2');
select * from users where name='name1' and surname='surname1';
name | surname | age | address
-------+----------+-----+----------
name1 | surname1 | 10 | address1
name1 | surname1 | 30 | address2
If, on the other hand, you wanted to ensure that the address is shared as well, then you probably just want to store a collection of ages in the user record. That could be achieved by:
CREATE TABLE users2 ( name text, surname text, age set<int>, address text, PRIMARY KEY(name, surname) );
insert into users2 (name,surname,age,address) values ('name1','surname1',{10,30},'address2');
select * from users2 where name='name1' and surname='surname1';
name | surname | address | age
-------+----------+----------+----------
name1 | surname1 | address2 | {10, 30}
So it comes back to what you actually need to accomplish. Hopefully the above examples give you some ideas.
The primary key is unique. With your data model, you can only have one age per (name, surname) combination.
Yes as mentioned in above comments you can have a composite key with name, surname, and age to achieve your goal but still, that won't solve the problem. Rather you can consider adding a new column userID and make that as the primary key. So even in case of name, surname and age duplicate, you don't have to revisit your data model.
CREATE TABLE users (
userId int,
name text,
surname text,
age int,
adress text,
PRIMARY KEY(userid)
);
I would state specifically that partition key should be unique.I could not get it in one place but from the following statements.
Cassandra needs all the partition key columns to be able to compute
the hash that will allow it to locate the nodes containing the
partition.
The partition key has a special use in Apache Cassandra beyond
showing the uniqueness of the record in the database..
Please note that there will not be any error if you insert same
partition key again and again as there is no constraint check.
Queries that you'll run equality searches on should be in a partition
key.
References
https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
how Cassandra chooses the coordinator node and the replication nodes?
Insert query replaces rows having same data field in Cassandra clustering column

Resources