Delete whole row based on one of clusturing column value in cassandra - cassandra

Schema I am using is as follows:
CREATE TABLE mytable(
id int,
name varchar,
PRIMARY KEY ((id),name)
) WITH CLUSTERING ORDER BY (name desc);
I wanted to delete records by following command :
DELETE FROM mytable WHERE name = 'Jhon';
But gived error
[Invalid query] message="Some partition key parts are missing: name"
As I looked for the reason, I came to know that only delete in not possible only with clustering columns.
Then I tried
DELETE FROM mytable WHERE id IN (SELECT id FROM mytable WHERE name='Jhon') AND name = 'Jhon';
But obviously it did not work.
I then tried with setting TTL to 0 for deleting row. But TTL can be set only for particular column, not the entire row.
What are feasible alternates to perform this operation?

In Cassandra, you need to design your data model to support your query. When you query your data, you always have to provide the partition key (otherwise the query would be inefficient).
The problem is that you want to query your data without a partition key. You would need to denormalize your data to support this kind or request. For example, you could add an additional table, such as:
CREATE TABLE id_by_name(
name varchar,
id int,
name varchar,
PRIMARY KEY (name, id)
) WITH CLUSTERING ORDER BY (id desc);
Then, you would be able to do your delete with a few queries:
SELECT ID from id_by_name WHERE name='John';
let's assume this returns 4.
DELETE FROM mytable WHERE id=4;
DELETE FROM id_by_name WHERE name='John' and id=4;
You could try to leverage materialized view (instead of maintaining yourself id_by_name) but materialized views are currently marked as unstable.
Now, there are still a few issues you need to address in your data model, in particular, how do you handle multiple user with the same name etc...

You cannot delete primary key if not complete. Primary key decisions are for sharding and load balancing. Cassandra can get complex if you are not used to thinking in columns.
I don't like the above answer, which though is good, complicates your solution. If you are thinking relational but getting lost in Cassandra I suggest using something that simplifies and maps your thinking to relational views.

Related

Creating secondary index on table in Cassandra

I have just started working on Cassandra.
I am bit confuse with the concept of secondary key.
From the definition I understood is indexing on the non key attribute of a table which is not sorted is secondary index.
So I have this table
CREATE TABLE IF NOT EXISTS userschema.user (id int,name text, address text, company text, PRIMARY KEY (id, name))
So If I create index like this
CREATE INDEX IF NOT EXISTS user_name_index ON userschema.user (name)
this should be secondary index.
But my requirement is to create index containing columns name , id , company.
How can I create a secondary index like this in Cassandra ?
I got this link which defines something of this short, but how come are these secondary indexes aren't they just table ?
These above user table is just the example not the actual one.
I am using Cassandra 3.0.9
id and name are already part of primary key.
So following queries will work
SELECT * FROM table WHERE id=1
SELECT * FROM table WHERE id=1 and name='some value'
SELECT * FROM table WHERE name='some value' ALLOW FILTERING (This is inefficeint)
You can create secondary index on company column
CREATE INDEX IF NOT EXISTS company_index ON userschema.user (company)
Now once secondary index is defined, it can be used in where clause along with primary key.
SELECT * FROM table WHERE id=1 and name='some value' and company='some value'
Though SELECT * FROM table WHERE company='some value' ALLOW FILTERING works it will be highly inefficient.
Before creating secondary index have look at When to use secondary index in cassandra
The link which you have referred mainly focuses on materialized views, in which we create virtual tables to execute the queries with non-primary keys.
Moreover, it seems you are creating secondary key on a Primary Key, which you have already defined in the creation of the table. Always remember that Secondary Index should be Non-Primary key.
To have a clear idea about the Secondary Indexes- Refer this https://docs.datastax.com/en/cql/3.3/cql/cql_using/useSecondaryIndex.html
Now, Pros and cons of the alternative methods for the secondary index
1.Materialized views:
It will create new virtual tables and you should run the queries in a virtual table using the old Primary keys in old and original tables and new virtual Primary keys in the new materialized table. Any changes in data modification in the original old table will be reflected at materialized table. If you drop the materialized table, but the data will be created as tombstones whose gcc_graceseconds is 864000(10 days) default. Dropping the materialized table will not have any effect on original table.
2.ALLOW FILTERING:
It is highly inefficient and is not at all advised to use allow filtering as the latencies will be high and performance will be degraded.
If you want much more information, refer this link too How do secondary indexes work in Cassandra?
Correct me if I am wrong

Filter on the partition and the clustering key with an additional criteria

I want to filter on a table that has a partition and a clustering key with another criteria on a regular column. I got the following warning.
InvalidQueryException: Cannot execute this query as it might involve
data filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance unpredictability,
use ALLOW FILTERING
I understand the problem if the partition and the clustering key are not used. In my case, is it a relevant error or can I ignore it?
Here is an example of the table and query.
CREATE TABLE mytable(
name text,
id uuid,
deleted boolean
PRIMARY KEY((name),id)
)
SELECT id FROM mytable WHERE name='myname' AND id='myid' AND deleted=false;
In Cassandra you can't filter data with non-primary key column unless you create index in it.
Cassandra 3.0 or up it is allowed to filter data with non primary key but in unpredictable performance
Cassandra 3.0 or up, If you provide all the primary key (as your given query) then you can use the query with ALLOW FILTERING, ignoring the warning
Otherwise filter from the client side or remove the field deleted and create another table :
Instead of updating the field to deleted true move your data to another table let's say mytable_deleted
CREATE TABLE mytable_deleted (
name text,
id uuid
PRIMARY KEY (name, id)
);
Now if you only have the non deleted data on mytable and deleted data on mytable_deleted table
or
Create index on it :
The column deleted is a low cardinality column. So remember
A query on an indexed column in a large cluster typically requires collating responses from multiple data partitions. The query response slows down as more machines are added to the cluster. You can avoid a performance hit when looking for a row in a large partition by narrowing the search.
Read More : When not to use an index

How to choose proper tables structure in cassandra?

Suppose I have table with the following structure
create table tasks (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
It allows me to get all tasks for user sorted by name ascending. Also I added task_id to primary key to avoid upserts. The following query holds
select * from tasks where user_id = ?
as well as
select * from tasks where user_id = ? and name > ?
However, I cannot get task with specific task_id. For example, following query crashes
select * from tasks where user_id = ? and task_id = ?
with this error
PRIMARY KEY column "task_id" cannot be restricted as preceding column "name" is not restricted
It requires name column to be specified, but at the moment I have only task_id (from url, for example) and user_id (from session).
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
Then you can query user_id=? and tsakId=?
PRIMARY KEY column "task_id" cannot be restricted as preceding
column "name" is not restricted
You are seeing this error because CQL does not permit queries to skip primary key components.
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
As you suspect, the typical way that problems like this are solved with Cassandra is that an additional table is created for each query. In this case, recreating the table with a PRIMARY KEY designed to match your additional query pattern would simply look like this:
create table tasks_by_user_and_task (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
While I am usually not a fan of using secondary indexes, in this case it may perform ok. Reason being, is that you would still be restricting your query by partition key, which would eliminate the need to examine additional nodes. The drawback (as Undefined_variable pointed out) is that you cannot create a secondary index on a primary key component, so you would need to duplicate that column (and apply the index to the non-primary key column) to get that solution to work.
It might be a good idea to model and test both solutions for performance.
If you have the extra disk space, the best method would be to replicate the data in a second table. You should avoid using secondary indexes in production. Your application would, of course, need to write to both these tables. But Cassandra is darn good at making that efficient.
create table tasks_by_name (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
create table tasks_by_id (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);

Using secondary indexes to update rows in Cassandra 2.1

I'm using Cassandra 2.1 and have a model that roughly looks as follows:
CREATE TABLE events (
client_id bigint,
bucket int,
timestamp timeuuid,
...
ticket_id bigint,
PRIMARY KEY ((client_id, bucket), timestamp)
);
CREATE INDEX events_ticket ON events(ticket_id);
As you can see, I've created a secondary index on ticket_id. This index works ok. events contains around 100 million rows, while only 5 million of these rows have around 50,000 distinct tickets. So a ticket - on average - has 100 events.
Querying the secondary index works without supplying the partition key, which is convenient in our situation. As the bucket column is sometimes hard to determine beforehand (i.e. you should know the date of the events, bucket is currently the date).
cqlsh> select * from events where ticket_id = 123;
client_id | bucket | timestamp | ... | ticket_id
-----------+--------+-----------+-----+-----------
(0 rows)
How do I solve the problem when all events of a ticket should be moved to another ticket? I.e. the following query won't work:
cqlsh> UPDATE events SET ticket_id = 321 WHERE ticket_id = 123;
InvalidRequest: code=2200 [Invalid query] message="Non PRIMARY KEY ticket_id found in where clause"
Does this imply secondary indexes cannot be used in UPDATE queries?
What model should I use to support these changes?
First of all, UPDATE and INSERT operations are treated the same in Cassandra. They are colloquially known as "UPSERTs."
Does this imply secondary indexes cannot be used in UPDATE queries?
Correct. You cannot perform an UPSERT in Cassandra without specifying the complete PRIMARY KEY. Even UPSERTs with a partial PRIMARY KEY will not work. And (as you have discovered) UPSERTing by an indexed value does not work, either.
How do I solve the problem when all events of a ticket should be moved to another ticket?
Unfortunately, the only way to accomplish this, is to query the keys of each row in events (with a particular ticket_id) and UPSERT ticket_id by those keys. The nice thing, is that you don't have to first DELETE them, because ticket_id is not part of the PRIMARY KEY.
How do I solve the problem when all events of a ticket should be moved to another ticket?
I think your best plan here would be to forego a secondary index all together, and create a query table to work alongside your events table:
CREATE TABLE eventsbyticketid (
client_id bigint,
bucket int,
timestamp timeuuid,
...
ticket_id bigint,
PRIMARY KEY ((ticket_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
This would allow you to query by ticket_id quickly (to obtain your client_id, bucket, and timestamp. This would give you the information you need to UPSERT the new ticket_id on your events table.
You could also then perform a DELETE by ticket_id (on the eventsbyticketid table). Cassandra does allow a DELETE operation with a partial PRIMARY KEY, as long as you have the full partition key (ticket_id). So removing old ticket_ids from the query table would be easy. And to ensure write atomicity, you could batch the UPSERTs together:
BEGIN BATCH
UPDATE events SET ticket_id = 321 WHERE client_id=2112 AND bucket='2015-04-22 14:53' AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d;
UPDATE eventsbyticketid SET client_id=2112, bucket='2015-04-22 14:53' WHERE ticket_id=321 AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d
APPLY BATCH;
Which is actually the same as performing:
BEGIN BATCH
INSERT INTO events (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
INSERT INTO eventsbyticketid (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
APPLY BATCH;
Side note: timestamp is actually a (reserved word) data type in Cassandra. This makes it a pretty lousy name for a timeuuid column.
You can use the secondary index to query the events for the old ticket, and then use the primary key from those retrieved events to update the events.
I'm not sure why you need to do this manually, seems like something Cassandra should be able to do under the hood.

cassandra primary key column cannot be restricted

I am using Cassandra for the first time in a web app and I got a query problem.
Here is my tab :
CREATE TABLE vote (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), user_id, schedule_id)
);
On every request, I indicate my partition key, doodle_id.
For example I can make without any problems :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and user_id = 97a7378a-e1bb-4586-ada1-177016405142;
But on the last request I made :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
I got the following error :
Bad Request: PRIMARY KEY column "schedule_id" cannot be restricted (preceding column "user_id" is either not restricted or by a non-EQ relation)
I'm new with Cassandra, but correct me if I'm wrong, in a composite primary key, the first part is the PARTITION KEY which is mandatory to allow Cassandra to know where to look for data.
Then the others parts are CLUSTERING KEY to sort data.
But I still don't get why my first request is working and not the second one ?
If anyone could help it will be a great pleasure.
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your second query (queries by doodle_id and schedule_id, but not necessarilly with user_id), is to create a new table to handle that specific query. This table will be pretty much the same, except the PRIMARY KEY will be slightly different:
CREATE TABLE votebydoodleandschedule (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), schedule_id, user_id)
);
Now this query will work:
SELECT * FROM votebydoodleandschedule
WHERE doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7
AND schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
This gets you around having to specify ALLOW FILTERING. Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster.
The clustering key is also used to find the columns within a given partition. With your model, you'll be able to query by:
doodle_id
doodle_id/user_id
doodle_id/user_id/schedule_id
user_id using ALLOW FILTERING
user_id/schedule_id using ALLOW FILTERING
You can see your primary key as a file path doodle_id#123/user_id#456/schedule_id#789 where all data is stored in the deepest folder (ie schedule_id#789). When you're querying you have to indicate the subfolder/subtree from where you start searching.
Your 2nd query doesn't work because of how columns are organized within partition. Cassandra can not get a continuous slice of columns in the partition because they are interleaved.
You should invert the primary key order (doodle_id, schedule_id, user_id) to be able to run your query.

Resources