Using secondary indexes to update rows in Cassandra 2.1 - cassandra

I'm using Cassandra 2.1 and have a model that roughly looks as follows:
CREATE TABLE events (
client_id bigint,
bucket int,
timestamp timeuuid,
...
ticket_id bigint,
PRIMARY KEY ((client_id, bucket), timestamp)
);
CREATE INDEX events_ticket ON events(ticket_id);
As you can see, I've created a secondary index on ticket_id. This index works ok. events contains around 100 million rows, while only 5 million of these rows have around 50,000 distinct tickets. So a ticket - on average - has 100 events.
Querying the secondary index works without supplying the partition key, which is convenient in our situation. As the bucket column is sometimes hard to determine beforehand (i.e. you should know the date of the events, bucket is currently the date).
cqlsh> select * from events where ticket_id = 123;
client_id | bucket | timestamp | ... | ticket_id
-----------+--------+-----------+-----+-----------
(0 rows)
How do I solve the problem when all events of a ticket should be moved to another ticket? I.e. the following query won't work:
cqlsh> UPDATE events SET ticket_id = 321 WHERE ticket_id = 123;
InvalidRequest: code=2200 [Invalid query] message="Non PRIMARY KEY ticket_id found in where clause"
Does this imply secondary indexes cannot be used in UPDATE queries?
What model should I use to support these changes?

First of all, UPDATE and INSERT operations are treated the same in Cassandra. They are colloquially known as "UPSERTs."
Does this imply secondary indexes cannot be used in UPDATE queries?
Correct. You cannot perform an UPSERT in Cassandra without specifying the complete PRIMARY KEY. Even UPSERTs with a partial PRIMARY KEY will not work. And (as you have discovered) UPSERTing by an indexed value does not work, either.
How do I solve the problem when all events of a ticket should be moved to another ticket?
Unfortunately, the only way to accomplish this, is to query the keys of each row in events (with a particular ticket_id) and UPSERT ticket_id by those keys. The nice thing, is that you don't have to first DELETE them, because ticket_id is not part of the PRIMARY KEY.
How do I solve the problem when all events of a ticket should be moved to another ticket?
I think your best plan here would be to forego a secondary index all together, and create a query table to work alongside your events table:
CREATE TABLE eventsbyticketid (
client_id bigint,
bucket int,
timestamp timeuuid,
...
ticket_id bigint,
PRIMARY KEY ((ticket_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
This would allow you to query by ticket_id quickly (to obtain your client_id, bucket, and timestamp. This would give you the information you need to UPSERT the new ticket_id on your events table.
You could also then perform a DELETE by ticket_id (on the eventsbyticketid table). Cassandra does allow a DELETE operation with a partial PRIMARY KEY, as long as you have the full partition key (ticket_id). So removing old ticket_ids from the query table would be easy. And to ensure write atomicity, you could batch the UPSERTs together:
BEGIN BATCH
UPDATE events SET ticket_id = 321 WHERE client_id=2112 AND bucket='2015-04-22 14:53' AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d;
UPDATE eventsbyticketid SET client_id=2112, bucket='2015-04-22 14:53' WHERE ticket_id=321 AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d
APPLY BATCH;
Which is actually the same as performing:
BEGIN BATCH
INSERT INTO events (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
INSERT INTO eventsbyticketid (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
APPLY BATCH;
Side note: timestamp is actually a (reserved word) data type in Cassandra. This makes it a pretty lousy name for a timeuuid column.

You can use the secondary index to query the events for the old ticket, and then use the primary key from those retrieved events to update the events.
I'm not sure why you need to do this manually, seems like something Cassandra should be able to do under the hood.

Related

Cassandra CLUSTERING ORDER with updates [performance]

With Cassandra it is possible to specify the cluster ordering on a table with a particular column.
CREATE TABLE myTable (
user_id INT,
message TEXT,
modified DATE,
PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);
Note: In this example, there is one message per user_id (intended)
Given this table my understanding is that the query's performance will be better in cases where recent data is queried.
However, if one where to make updates to the "modified" column does it add extra overhead on the server to "re-order" and is that overhead vs query performance significant?
In other words given this table would it perform better if the "CLUSTERING ORDER BY (modified DESC)" was dropped?
UPDATE: Updated the invalid CQL by adding modified to primary key, however, the original questions still stand.
In order to make modified a clustering column, it needs to be defined in the primary key.
CREATE TABLE myTable (
user_id INT,
message TEXT,
modified DATE,
PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);
This way, your data will be sorted primarily by the hashed value of the user_id, and within each user_id by modified. You don't need to drop the "WITH CLUSTERING ORDER BY (modified DESC)"
Moving the comment as an answer, as reply of the updated question:
if one where to make updates to the "modified" column does it add
extra overhead on the server to "re-order" and is that overhead vs
query performance significant?
If modified is defined as part of the clustering key, you won't be able to update that record, but you will be able to add as many records as needed, each time with a different modified date.
Cassandra is an append-only database engine: this means that any update to the records will add a new record with a different timestamp, a select will consider the records with the latest timestamp. This means that there is no "re-order" operation.
Dropping or creating the clustering order should be defined in base of the query of how the information will be retrieved, if you are going to use only the latest records of that user_id, it makes sense to have the clustering order as you defined it.
in your data model user_id is a rowkey/shardkey/partition key (userid) that is important for data locality and the clustering column (modified) specifies the order that the data is arranged inside the partition. combination of these two keys makes the primary key.
Even in RDBS world, updating PK is avoidble for sake of data integrity.
however in cassandra there is no constraints/relation between column families/tables.
Assigning exact same values to Pk fields(userid,modified) will result in update the existing record else it will add set of fields.
refence:
https://www.datastax.com/dev/blog/we-shall-have-order

Delete whole row based on one of clusturing column value in cassandra

Schema I am using is as follows:
CREATE TABLE mytable(
id int,
name varchar,
PRIMARY KEY ((id),name)
) WITH CLUSTERING ORDER BY (name desc);
I wanted to delete records by following command :
DELETE FROM mytable WHERE name = 'Jhon';
But gived error
[Invalid query] message="Some partition key parts are missing: name"
As I looked for the reason, I came to know that only delete in not possible only with clustering columns.
Then I tried
DELETE FROM mytable WHERE id IN (SELECT id FROM mytable WHERE name='Jhon') AND name = 'Jhon';
But obviously it did not work.
I then tried with setting TTL to 0 for deleting row. But TTL can be set only for particular column, not the entire row.
What are feasible alternates to perform this operation?
In Cassandra, you need to design your data model to support your query. When you query your data, you always have to provide the partition key (otherwise the query would be inefficient).
The problem is that you want to query your data without a partition key. You would need to denormalize your data to support this kind or request. For example, you could add an additional table, such as:
CREATE TABLE id_by_name(
name varchar,
id int,
name varchar,
PRIMARY KEY (name, id)
) WITH CLUSTERING ORDER BY (id desc);
Then, you would be able to do your delete with a few queries:
SELECT ID from id_by_name WHERE name='John';
let's assume this returns 4.
DELETE FROM mytable WHERE id=4;
DELETE FROM id_by_name WHERE name='John' and id=4;
You could try to leverage materialized view (instead of maintaining yourself id_by_name) but materialized views are currently marked as unstable.
Now, there are still a few issues you need to address in your data model, in particular, how do you handle multiple user with the same name etc...
You cannot delete primary key if not complete. Primary key decisions are for sharding and load balancing. Cassandra can get complex if you are not used to thinking in columns.
I don't like the above answer, which though is good, complicates your solution. If you are thinking relational but getting lost in Cassandra I suggest using something that simplifies and maps your thinking to relational views.

If not MaterializedViews and not secondary indices then what else is the recommended way to query data in cassandra

I have some data in Cassandra. Say
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp
}
My application in addition to querying this data by primary key id, needs to query it by updated_on timestamp as well. To fulfil the query by time use case I have tried the following.
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp,
updated_on_minute timestamp
}
Secondary index on the updated_on_minute field. As I understand, secondary indexes are not recommended for high cardinality cases (which is my case, because I could have a lot of data at the same minute mark). Moreover I have data that gets frequently updated, which means the updated_on_minute will keep revving.
MaterializedView with updated_on_minute as the partition key and a id as the clustering key. I am on version 3.9 of cassandra and had just begun using these, but alas I find these release notes for 3.11x (https://github.com/apache/cassandra/blob/cassandra-3.11/NEWS.txt), which declare them purely experimental and not meant for production clusters.
So then what are my options? Do I just need to maintain my own tables to track data that comes in timewise? Would love some input on this.
Thanks in advance.
As always have been the case, create additional table to query by a different partition key.
In your case the table would be
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
Primary key(updated_on, id)
}
Write to both tables mytable_by_timetamp and mytable_by_id. Use the corresponding table to READ from based on the partition key either updated_on or id.
It’s absolutely fine to duplicate data based on the use case (query) it’s trying solve.
Edited:
In case there is a fear about huge partition, you can always bucket into smaller partitions. For example the table above could be broken down into
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
updated_min timestamp,
Primary key(updated_min, id)
}
Here I have chosen every minute as the bucket size. Depending on how many updates you receive, you can change it to seconds (updated_sec) to reduce the partition size further.

Get first row for each partition key in Cassandra

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.
My idea was to create staging table with business id as a partition key and timestamp as a clustering key:
CREATE TABLE sample_keyspace.table1_copy1 (
id uuid,
time timestamp,
value text,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )
Now if I insert some data in this table I can get latest value for some given partition key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But that would require to issue such query for every business key I'm interested in.
Is there some effective way I could do it in CQL?
I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.
Is that supported?
If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.
SELECT * FROM table1 PER PARTITION LIMIT 1;
In a word: no.
The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.
*) I am omitting ALLOW FILTERING here, since it does not help in this context.

cassandra primary key column cannot be restricted

I am using Cassandra for the first time in a web app and I got a query problem.
Here is my tab :
CREATE TABLE vote (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), user_id, schedule_id)
);
On every request, I indicate my partition key, doodle_id.
For example I can make without any problems :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and user_id = 97a7378a-e1bb-4586-ada1-177016405142;
But on the last request I made :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
I got the following error :
Bad Request: PRIMARY KEY column "schedule_id" cannot be restricted (preceding column "user_id" is either not restricted or by a non-EQ relation)
I'm new with Cassandra, but correct me if I'm wrong, in a composite primary key, the first part is the PARTITION KEY which is mandatory to allow Cassandra to know where to look for data.
Then the others parts are CLUSTERING KEY to sort data.
But I still don't get why my first request is working and not the second one ?
If anyone could help it will be a great pleasure.
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your second query (queries by doodle_id and schedule_id, but not necessarilly with user_id), is to create a new table to handle that specific query. This table will be pretty much the same, except the PRIMARY KEY will be slightly different:
CREATE TABLE votebydoodleandschedule (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), schedule_id, user_id)
);
Now this query will work:
SELECT * FROM votebydoodleandschedule
WHERE doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7
AND schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
This gets you around having to specify ALLOW FILTERING. Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster.
The clustering key is also used to find the columns within a given partition. With your model, you'll be able to query by:
doodle_id
doodle_id/user_id
doodle_id/user_id/schedule_id
user_id using ALLOW FILTERING
user_id/schedule_id using ALLOW FILTERING
You can see your primary key as a file path doodle_id#123/user_id#456/schedule_id#789 where all data is stored in the deepest folder (ie schedule_id#789). When you're querying you have to indicate the subfolder/subtree from where you start searching.
Your 2nd query doesn't work because of how columns are organized within partition. Cassandra can not get a continuous slice of columns in the partition because they are interleaved.
You should invert the primary key order (doodle_id, schedule_id, user_id) to be able to run your query.

Resources