Cassandra - alternate way for clustering key with ORDER BY and UPDATE - cassandra

My schema is :
CREATE TABLE friends (
userId timeuuid,
friendId timeuuid,
status varchar,
ts timeuuid,
PRIMARY KEY (userId,friendId)
);
CREATE TABLE friends_by_status (
userId timeuuid,
friendId timeuuid,
status varchar,
ts timeuuid,
PRIMARY KEY ((userId,status), ts)
)with clustering order by (ts desc);
Here, whenever a friend-request is made, I'll insert record in both tables.
When I want to check one to one status of users, i'll use this query:
SELECT status FROM friends WHERE userId=xxx AND friendId=xxx;
When I need to query all the records with pending status, i'll use :
SELECT * FROM friends_by_status WHERE userId=xxx AND status='pending';
But, when there is a status change, I can update the 'status' and 'ts' in the 'friends' table, but not in the 'friends_by_status' table as both are part of PRIMARY KEY.
You could see that even if I denormalise it, I definitely need to update 'status' and 'ts' in 'friends_by_status' table to maintain consistency.
Only way I can maintain consistency is to delete the record and insert again.
But frequent delete is also not recommended in cassandra model. As said in Cassaandra Spottify summit.
I find this as the biggest limitation in Cassandra.
Is there any other way to sort this issue.
Any solution is appreciated.

I don't know how soon you need to deploy this, but in Cassandra 3.0 you could handle this with a materialized view. Your friends table would be the base table, and the friends_by_status would be a view of the base table. Cassandra would take care updating the view when you changed the base table.
For example:
CREATE TABLE friends ( userid int, friendid int, status varchar, ts timeuuid, PRIMARY KEY (userId,friendId) );
CREATE MATERIALIZED VIEW friends_by_status AS
SELECT userId from friends WHERE userID IS NOT NULL AND friendId IS NOT NULL AND status IS NOT NULL AND ts IS NOT NULL
PRIMARY KEY ((userId,status), friendID);
INSERT INTO friends (userid, friendid, status, ts) VALUES (1, 500, 'pending', now());
INSERT INTO friends (userid, friendid, status, ts) VALUES (1, 501, 'accepted', now());
INSERT INTO friends (userid, friendid, status, ts) VALUES (1, 502, 'pending', now());
SELECT * FROM friends;
userid | friendid | status | ts
--------+----------+----------+--------------------------------------
1 | 500 | pending | a02f7fe0-49f9-11e5-9e3c-ab179e6a6326
1 | 501 | accepted | a6c80980-49f9-11e5-9e3c-ab179e6a6326
1 | 502 | pending | add10830-49f9-11e5-9e3c-ab179e6a6326
So now in the view you can select rows by the status:
SELECT * FROM friends_by_status WHERE userid=1 AND status='pending';
userid | status | friendid
--------+---------+----------
1 | pending | 500
1 | pending | 502
(2 rows)
And then when you update the status in the base table, it automatically updates in the view:
UPDATE friends SET status='pending' WHERE userid=1 AND friendid=501;
SELECT * FROM friends_by_status WHERE userid=1 AND status='pending';
userid | status | friendid
--------+---------+----------
1 | pending | 500
1 | pending | 501
1 | pending | 502
(3 rows)
But note that in the view you couldn't have ts as part of the key, since you can only add one non-key field from the base table as part of the key in the view, which in your case would be adding 'status' to the key.
I think the first beta release for 3.0 is coming out tomorrow if you want to try this out.

Why do you need status to be in the primary key for your second table? If this was your schema:
CREATE TABLE friends_by_status (
userId timeuuid,
friendId timeuuid,
status varchar,
ts timeuuid,
PRIMARY KEY ((userId), status, ts)
with clustering order by (ts desc));
you can update the status as needed and still filter by it. You will be storing more data under one partition but it seems like you are storing one row for each friend a user has. This will be the same as in the first table, so I don't see partition size being a problem.

Related

Group by on Primary Partition

I am not able to perform Group by on a primary partition. I am using Cassandra 3.10. When I group by I get the following error.
InvalidReqeust: Error from server: code=2200 [Invalid query] message="Group by currently only support groups of columns following their declared order in the Primary Key. My column is a primary key even still I am facing the problem.
My schema is
Table trends{
name text,
price int,
quantity int,
code text,
code_name text,
cluster_id text
uitime timeuuid,
primary key((name,price),code,uitime))
with clustering order by (code DESC, uitime DESC)
And the command that I run is: select sum(quantity) from trends group by code;
For starters your schema is invalid. You cannot set clustering order on code because it is the partition key. The order is going to be determined by the hash of it (unless using byte order partitioner - but don't do that).
The query and thing your talking about does work though. For example you can run
> SELECT keyspace_name, sum(partitions_count) AS approx_partitions FROM system.size_estimates GROUP BY keyspace_name;
keyspace_name | approx_partitions
--------------------+-------------------
system_auth | 128
basic | 4936508
keyspace1 | 870
system_distributed | 0
system_traces | 0
where they schema is:
CREATE TABLE system.size_estimates (
keyspace_name text,
table_name text,
range_start text,
range_end text,
mean_partition_size bigint,
partitions_count bigint,
PRIMARY KEY ((keyspace_name), table_name, range_start, range_end)
) WITH CLUSTERING ORDER BY (table_name ASC, range_start ASC, range_end ASC)
Perhaps the pseudo-schema you provided differs from the actual one. Can you provide output of describe table xxxxx in your question?

How should I design the schema to get the last 2 records of each clustering key in Cassandra?

Each row in my table has 4 values product_id, user_id, updated_at, rating.
I'd like to create a table to find out how many users changed rating during a given period.
Currently my schema looks like:
CREATE TABLE IF NOT EXISTS ratings_by_product (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((product_id ), updated_at , user_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, user_id ASC);
but I couldn't figure out the way to only get the last 2 rows of each user in a given time window.
Any advice on query or changing the schema would be appreciated.
Cassandra requires a query-based approach to table design. Which means that typically one table will serve one query. So to serve the query you are talking about (last two updated rows per user) you should build a table specifically designed to serve it:
CREATE TABLE ratings_by_user_by_time (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((user_id ), updated_at, product_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, product_id ASC );
Then you will be able to get the last two updated ratings for a user by doing the following:
SELECT * FROM ratings_by_user_by_time
WHERE user_id = 'Bob' LIMIT 2;
Note that you'll need to keep the two ratings tables in-sync yourself, and using a batch statement is a good way to accomplish that.

Order latest records by timestamp in Cassandra

I'm trying to display the latest values from a list of sensors. The list should also be sortable by the time-stamp.
I tried two different approaches. I included the update time of the sensor in the primary key:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Then I can select the list like this:
select * from sensors where customerid=0 order by changedate desc;
which results in this:
customerid | changedate | sensorid | value
------------+--------------------------+----------+-------
0 | 2015-07-10 12:46:53+0000 | 1 | 2
0 | 2015-07-10 12:46:52+0000 | 1 | 1
0 | 2015-07-10 12:46:52+0000 | 0 | 2
0 | 2015-07-10 12:46:26+0000 | 0 | 1
The problem is, I don't get only the latest results, but all the old values too.
If I remove the changedate from the primary key, the select fails all together.
InvalidRequest: code=2200 [Invalid query] message="Order by is currently only supported on the clustered columns of the PRIMARY KEY, got changedate"
Updating the sensor values is also no option:
update overview set changedate=unixTimestampOf(now()), value = '5' where customerid=0 and sensorid=0;
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY part changedate found in SET part"
This fails because changedate is part of the primary key.
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
Edit:
In the meantime I tried another approach, to only storing the latest value.
I used this schema:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Before inserting the latest value, I would delete all old values
DELETE FROM sensors WHERE customerid=? and sensorid=?;
But this fails because changedate is NOT part of the WHERE clause.
The problem is, I don't get only the latest results, but all the old values too.
Since you are storing in a CLUSTERING ORDER of DESC, it will always be very easy to get the latest records, all you need to do is add 'LIMIT' to your query, i.e.:
select * from sensors where customerid=0 order by changedate desc limit 10;
Would return you at most 10 records with the highest changedate. Even though you are using limit, you are still guaranteed to get the latest records since your data is ordered that way.
If I remove the changedate from the primary key, the select fails all together.
This is because you cannot order on a column that is not the clustering key(s) (the secondary part of the primary key) except maybe with a secondary index, which I would not recommend.
Updating the sensor values is also no option
Your update query is failing because it is not legal to include part of the primary key in 'set'. To make this work all you need to do is update your query to include changedate in the where clause, i.e.:
update overview set value = '5' and sensorid = 0 where customerid=0 and changedate=unixTimestampOf(now())
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
You can do this by creating a separate table named 'latest_sensor_data' with the same table definition with exception to the primary key. The primary key will now be 'customerid, sensorid' so you can only have 1 record per sensor. The process of creating separate tables is called denormalization and is a common use pattern particularly in Cassandra data modeling. When you insert sensor data you would now insert data into both 'sensors' and 'latest_sensor_data'.
CREATE TABLE latest_sensor_data (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid)
);
In cassandra 3.0 'materialized views' will be introduced which will make this unnecessary as you can use materialized views to accomplish this for you.
Now doing the following query:
select * from latest_sensor_data where customerid=0
Will give you the latest value for every sensor for that customer.
I would recommend renaming 'sensors' to 'sensor_data' or 'sensor_history' to make it more clear what the data is. Additionally you should change the primary key to 'customerid, changedate, sensorid' as that would allow you to have multiple sensors at the same date (which seems possible).
Your first approach looks reasonable. If you add "limit 1" to your query, you would only get the latest result, or limit 2 to see the latest 2 results, etc.
If you want to automatically remove old values from the table, you can specify a TTL (Time To Live) for data points when you do the insert. So if you wanted to keep data points for 10 days, you could do this by adding "USING TTL 864000" on your insert statements. Or you could set a default TTL for the entire table.

Query results not ordered despite WITH CLUSTERING ORDER BY

I am storing posts from all users in table. I want to retrieve post from all users the user is following.
CREATE TABLE posts (
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (userid, time)
)WITH CLUSTERING ORDER BY (time DESC)
I have the data about who all user follows in another table
CREATE TABLE follow (
userid int,
who_follow_me set<int>,
who_i_follow set<int>,
PRIMARY KEY ((userid))
)
I am making query like
select * from posts where userid in(1,2,3,4....n);
2 questions:
why I still get data in random order, though CLUSTERING ORDER BY is specified in posts. ?
Is model correct to satisfy the query optimally (user can have n number of followers)?
I am using Cassandra 2.0.10.
"why I still get data in random order, though CLUSTERING ORDER BY is specified in posts?"
This is because ORDER BY only works for rows within a particular partitioning key. So in your case, if you wanted to see all of the posts for a specific user like this:
SELECT * FROM posts WHERE userid=1;
That return your results ordered by time, as all of the rows within the userid=1 partitioning key would be clustered by it.
"Is model correct to satisfy the query optimally (user can have n number of followers)?"
It will work, as long as you don't care about getting the results ordered by timestamp. To be able to query posts for all users ordered by time, you would need to come up with a different partitioning key. Without knowing too much about your application, you could use a column like GROUP (for instance) and partition on that.
So let's say that you evenly assign all of your users to eight groups: A, B, C, D, E, F, G and H. Let's say your table design changed like this:
CREATE TABLE posts (
group text,
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (group, time, userid)
)WITH CLUSTERING ORDER BY (time DESC)
You could then query all posts for all users for group B like this:
SELECT * FROM posts WHERE group='B';
That would give you all of the posts for all of the users in group B, ordered by time. So basically, for your query to order the posts appropriately by time, you need to partition your post data on something other than userid.
EDIT:
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
That's not going to work. In fact, that should produce the following error:
code=2200 [Invalid query] message="Missing CLUSTERING ORDER for column follows"
And even if you did add follows to your CLUSTERING ORDER clause, you would see this:
code=2200 [Invalid query] message="Only clustering key columns can be defined in CLUSTERING ORDER directive"
The CLUSTERING ORDER clause can only be used on the clustering column(s), which in this case, is only the follows column. Alter your PRIMARY KEY definition to cluster on follows (ASC) and created (DESC). I have tested this, and inserted some sample data, and can see that this query works:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2 AND follows=1;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(3 rows)
Although, if you want to query by just userid you can see posts from all of your followers. But in that case, the posts will only be ordered within each followerid, like this:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 0 | 2015-01-25 13:28:00-0600 | 94da27d0-e91f-4c1f-88f2-5a4bbc4a0096
2 | 0 | 2015-01-25 13:23:00-0600 | 798053d3-f1c4-4c1d-a79d-d0faff10a5fb
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(5 rows)
This is my new schema,
CREATE TABLE posts(id uuid,
userid int,
follows int,
created timestamp,
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
Here userid represents who posted it and follows represents userid for his one of the follower. Say user x follows 10 other people , i am making 10+1 inserts. Definitely there is too much data duplication. However now its easier to get timeline for one of the user with following query
select * from posts where follows=?

IN operator in Cassandra doesn't work for table having a column with type-collection(Map or List)

I'm working on Cassandra, trying to get to know how it works. Encountered something strange while using IN operator. Example:
Table:
CREATE TABLE test_time (
name text,
age int,
time timeuuid,
"timestamp" timestamp,
PRIMARY KEY ((name, age), time)
)
I have inserted few dummy data. Used IN operator as follows:
SELECT * from test_time
where name="9" and age=81
and time IN (c7c88000-190e-11e4-8000-000000000000, c7c88000-190e-11e4-7000-000000000000);
It worked properly.
Then, added a column of type Map. Table will look like:
CREATE TABLE test_time (
name text,
age int,
time timeuuid,
name_age map<text, int>,
"timestamp" timestamp,
PRIMARY KEY ((name, age), time)
)
On executing same query, I got following error:
Bad Request: Cannot restrict PRIMARY KEY part time by IN relation as a collection is selected by the query
From the above examples, we can say, IN operator doesn't work if there are any column of type collection(Map or List) in the table.
I don't understand why it behaves like this. Please let me know If I'm missing anything here. Thanks in advance.
Yup...that is a limitation. You can do the following:
select * from ...where name='9' and age=81 and time > x and time < y
select [columns except collection] from ...where name='9' and age=81 and time in (...)
You can then filter client side, or do another query.
You can either include your column as a part of partitioning expression in the primary key
CREATE TABLE test_time (
name text,
age int,
time timeuuid,
"timestamp" timestamp,
PRIMARY KEY ((name, time), age)
);
or create a separate Materialized View to satisfy your query requirements:
CREATE MATERIALIZED VIEW test_time_mv AS
SELECT * FROM test_time
WHERE name IS NOT NULL AND time IS NOT NULL AND age IS NOT NULL
PRIMARY KEY ((name, time), age);
Now use the Materialized View in your query instead of the base table:
SELECT * from test_time_mv
where name='9'
and age=81
and time IN (c7c88000-190e-11e4-8000-000000000000,
c7c88000-190e-11e4-7000-000000000000);

Resources