SELECT with yb_hash_code() and DELETE in YugabyteDB - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
We have below schema in postgresql (yugabyte DB 2.8.3) using YSQL:
CREATE TABLE IF NOT EXISTS public.table1
(
customer_id uuid NOT NULL ,
item_id uuid NOT NULL ,
kind character varying(100) NOT NULL ,
details character varying(100) NOT NULL ,
created_date timestamp without time zone NOT NULL,
modified_date timestamp without time zone NOT NULL,
CONSTRAINT table1_pkey PRIMARY KEY (customer_id, kind, item_id)
);
CREATE UNIQUE INDEX IF NOT EXISTS unique_item_id ON table1(item_id);
CREATE UNIQUE INDEX IF NOT EXISTS unique_item ON table1(customer_id, kind) WHERE kind='NEW' OR kind='BACKUP';
CREATE TABLE IF NOT EXISTS public.item_data
(
item_id uuid NOT NULL,
id2 integer NOT NULL,
create_date timestamp without time zone NOT NULL,
modified_date timestamp without time zone NOT NULL,
CONSTRAINT item_data_pkey PRIMARY KEY (item_id, id2)
);
Goal:
Step 1) Select item_id’s from table1 WHERE modified_date < someDate
Step 2) DELETE FROM table item_data WHERE item_id = any of those item_id’s from step 1
Currently we use query
SELECT item_id FROM table1 WHERE modified_date < $1
Can the SELECT query apply yb_hash_code(item_id) with the SELECT query? Because table1 is indexed on item_id ? to enhance the performance of the SELECT query
Currently we perform:
DELETE FROM item_data x WHERE x.item_id IN the listOfItemIds(provided in Step1 above).
With the given listOfItemIds, can we use yb_hash_code(item_id) to enhance performance of DELETE operation?

Yes, it should work out. Something like:
SELECT item_id FROM item_data WHERE yb_hash_code(customer_id, kind, item_id) <= 128 AND yb_hash_code(customer_id, kind, item_id) >= 0 AND modified_date < x;
While you can combine the SELECT + DELETE in 1 query (like a subselect), this is probably better because it will result in smaller transactions.
Also, no need to use yb_hash_code. The db should be able to find the correct rows since you’re sending the columns that are used for partitioning.

Related

How should I design the schema to get the last 2 records of each clustering key in Cassandra?

Each row in my table has 4 values product_id, user_id, updated_at, rating.
I'd like to create a table to find out how many users changed rating during a given period.
Currently my schema looks like:
CREATE TABLE IF NOT EXISTS ratings_by_product (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((product_id ), updated_at , user_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, user_id ASC);
but I couldn't figure out the way to only get the last 2 rows of each user in a given time window.
Any advice on query or changing the schema would be appreciated.
Cassandra requires a query-based approach to table design. Which means that typically one table will serve one query. So to serve the query you are talking about (last two updated rows per user) you should build a table specifically designed to serve it:
CREATE TABLE ratings_by_user_by_time (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((user_id ), updated_at, product_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, product_id ASC );
Then you will be able to get the last two updated ratings for a user by doing the following:
SELECT * FROM ratings_by_user_by_time
WHERE user_id = 'Bob' LIMIT 2;
Note that you'll need to keep the two ratings tables in-sync yourself, and using a batch statement is a good way to accomplish that.

Cassandra does not support DELETE on indexed columns

Say I have a cassandra table xyz with the following schema :
create table xyz(
xyzid uuid,
name text,
fileid int,
sid int,
PRIMARY KEY(xyzid));
I create index on columns fileid , sid:
CREATE INDEX file_index ON xyz (fileid);
CREATE INDEX sid_index ON xyz (sid);
I insert data :
INSERT INTO xyz (xyzid, name , fileid , sid ) VALUES ( now(), 'p120' , 1, 100);
INSERT INTO xyz (xyzid, name , fileid , ssid ) VALUES ( now(), 'p120' , 1, 101);
INSERT INTO xyz (xyzid, name , fileid , sid ) VALUES ( now(), 'p122' , 2, 101);
I want to delete data using the indexed columns :
DELETE from xyz WHERE fileid=1 and sid=101;
Why do I get this error ?
InvalidRequest: code=2200 [Invalid query] message="Non PRIMARY KEY fileid found in where clause"
Is it mandatory to specify the primary key in the where clause for delete queries ?
Does Cassandra supports deletes using secondary index s ?
What has to be done to delete data using secondary index s ?
Any suggestions that could help .
I am using Data Stax Community Cassandra 2.1.8 but I also want to know whether delete using indexed columns is supported by Data Stax Community Cassandra 3.2.1
Thanks
Let me try and answer your questions in order:
1) Yes, if you are going to use a where clause in a CQL statement then the PARTITION KEY must be an equality operator in the where clause. Other than that you are only allowed to filter on clustering columns specified in your primary key. (Unless you have a secondary index)
2) No it does not. See this post for some more information as it is essentially the same problem.
Why can cassandra "select" on secondary key, but not update using secondary key? (1.2.8+)
3) Why not add sid as a clustering column in your primary key. This would allow you to do the delete or query using both as you have shown.
create table xyz(
xyzid uuid,
name text,
fileid int,
sid int,
PRIMARY KEY(xyzid, sid));
4) In general using secondary indexes is considered an anti-pattern (a bit less so with SASI indexes in C* 3.4) so my question is can you add these fields as clustering columns to your primary key? How are you querying these secondary indexes?
I suppose you can perform delete in two steps:
Select data by secondary index and get primary index column values
(xyzid) from query result
Perform delete by primary index values.

CQLSH - Check for null in where clause for MAP Data type

CASSANDRA Version : 2.1.10
CREATE TABLE customer_raw_data (
id uuid,
hash_prefix bigint,
profile_data map<varchar,varchar>
PRIMARY KEY (hash_prefix,id));
I have an index on profile_data and I have row where profile_data is null.
How to write a select query to retrieve the rows where profile_data is null ?
I tried the following
select count(*) from customer_raw_data where profile_data=null;
select count(*) from customer_raw_data where profile_data CONTAINS KEY null;
With Reference to : https://issues.apache.org/jira/browse/CASSANDRA-3783
There is currently no select support for indexed nulls, and given the design of Cassandra, is considered a difficult/prohibitive problem.
Basic problem.
where condition column has to be either primary key or secondary index so make your column what-ever is suitable and then try below query.
Try this..
select count(*) from customer_raw_data where profile_data='';
SELECT * FROM TableName WHERE colName > 5000 ALLOW FILTERING; //Work fine
SELECT * FROM TableName WHERE colName > 5000 limit 10 ALLOW FILTERING;
https://cassandra.apache.org/doc/old/CQL-3.0.html
Check the "ALLOW FILTERING" Part.

How to retrieve a date range from cassandra

I have a very simple table to store collection of IDs by a date rage
CREATE TABLE schedule_range (
start_date timestamp,
end_date timestamp,
schedules set<text>,
PRIMARY KEY ((start_date, end_date)));
I was hoping to be able to query it by a date range
SELECT *
FROM schedule_range
WHERE start_date >= 'xxx'
AND end_date < 'yyy'
Unfortunately it doesn't work this way. I've tried few different approaches and it always fail for a different reason.
How should I store IDs to be able to get them all by a date range?
In cassandra you only can use >, < operators with last field of primary key, in your case 'end_date'. For previous fields you must use equal operator. If you just considerate that schema maybe you could use other choices.
One approximation is use Apache Spark. There is some projects that built an abstraction layer in Spark over Cassandra and let you make operations in cassandra such as joins, any filter, groups by ...
Check this projects:
Stratio Deep
Datastax Connector
Using this table with a query that somewhat resembles yours works because 1) it doesn't use the conditional on the partition key start_date. Only EQ and IN relation are supported on the partition key. 2) The greater-than and less-than comparison on the clustering column is restricted to filters that select a contiguous ordering of rows. Filtering by the clustering column--2nd component in the compound key--id, does the latter.
create table schedule_range2(start_date timestamp, end_date timestamp, id int, schedules set<text>, primary key (start_date, id, end_date));
insert into schedule_range2 (start_date, id, end_date, schedules) VALUES ('2014-02-03 04:05', 1, '2014-02-04 04:00', {'event1', 'event2'});
insert into schedule_range2 (start_date, id, end_date, schedules) VALUES ('2014-02-05 04:05', 1, '2014-02-06 04:00', {'event3', 'event4'});
select * from schedule_range2 where id=1 and end_date >='2014-02-04 04:00' and end_date < '2014-02-06 04:00' ALLOW FILTERING;

IN operator in Cassandra doesn't work for table having a column with type-collection(Map or List)

I'm working on Cassandra, trying to get to know how it works. Encountered something strange while using IN operator. Example:
Table:
CREATE TABLE test_time (
name text,
age int,
time timeuuid,
"timestamp" timestamp,
PRIMARY KEY ((name, age), time)
)
I have inserted few dummy data. Used IN operator as follows:
SELECT * from test_time
where name="9" and age=81
and time IN (c7c88000-190e-11e4-8000-000000000000, c7c88000-190e-11e4-7000-000000000000);
It worked properly.
Then, added a column of type Map. Table will look like:
CREATE TABLE test_time (
name text,
age int,
time timeuuid,
name_age map<text, int>,
"timestamp" timestamp,
PRIMARY KEY ((name, age), time)
)
On executing same query, I got following error:
Bad Request: Cannot restrict PRIMARY KEY part time by IN relation as a collection is selected by the query
From the above examples, we can say, IN operator doesn't work if there are any column of type collection(Map or List) in the table.
I don't understand why it behaves like this. Please let me know If I'm missing anything here. Thanks in advance.
Yup...that is a limitation. You can do the following:
select * from ...where name='9' and age=81 and time > x and time < y
select [columns except collection] from ...where name='9' and age=81 and time in (...)
You can then filter client side, or do another query.
You can either include your column as a part of partitioning expression in the primary key
CREATE TABLE test_time (
name text,
age int,
time timeuuid,
"timestamp" timestamp,
PRIMARY KEY ((name, time), age)
);
or create a separate Materialized View to satisfy your query requirements:
CREATE MATERIALIZED VIEW test_time_mv AS
SELECT * FROM test_time
WHERE name IS NOT NULL AND time IS NOT NULL AND age IS NOT NULL
PRIMARY KEY ((name, time), age);
Now use the Materialized View in your query instead of the base table:
SELECT * from test_time_mv
where name='9'
and age=81
and time IN (c7c88000-190e-11e4-8000-000000000000,
c7c88000-190e-11e4-7000-000000000000);

Resources