nested map in cassandra data modelling - cassandra

I have following requirement of my dataset, need to unserstand what datatype should I use and how to save my data accordingly :-
CREATE TABLE events (
id text,
evntoverlap map<text, map<timestamp,int>>,
PRIMARY KEY (id)
)
evntoverlap = {
'Dig1': {{'2017-10-09 04:10:05', 0}},
'Dig2': {{'2017-10-09 04:11:05', 0},{'2017-10-09 04:15:05', 0}},
'Dig3': {{'2017-10-09 04:11:05', 0},{'2017-10-09 04:15:05', 0},{'2017-10-09 04:11:05', 0}}
}
This gives an error :-
Error from server: code=2200 [Invalid query] message="Non-frozen collections are not allowed inside collections: map<text, map<timestamp, int>>"
How should I store this type of data in single column . Please suggest datatype and insert command for the same.
Thanks,

There is limitation of Cassandra - you can't nest collection (or UDT) inside collection without making it frozen. So you need to "froze" one of the collections - either nested:
CREATE TABLE events (
id text,
evntoverlap map<text, frozen<map<timestamp,int>>>,
PRIMARY KEY (id)
);
or top-level:
CREATE TABLE events (
id text,
evntoverlap frozen<map<text, map<timestamp,int>>>,
PRIMARY KEY (id)
);
See documentation for more details.

CQL collections limited to 64kb, if putting things like maps in maps you might push that limit. Especially with frozen maps you are deserializing the entire map, modifying it, and re inserting. Might be better off with a
CREATE TABLE events (
id text,
evnt_key, text
value map<timestamp, int>,
PRIMARY KEY ((id), evnt_key)
)
Or even a
CREATE TABLE events (
id text,
evnt_key, text
evnt_time timestamp
value int,
PRIMARY KEY ((id), evnt_key, evnt_time)
)
It would be more efficient and safer while giving additional benefits like being able to order the event_time's in ascending or descending order.

Related

SyntaxException: line 2:10 no viable alternative at input 'UNIQUE' > (...NOT EXISTS books ( id [UUID] UNIQUE...)

I am trying the following codes to create a keyspace and a table inside of it:
CREATE KEYSPACE IF NOT EXISTS books WITH REPLICATION = { 'class': 'SimpleStrategy',
'replication_factor': 3 };
CREATE TABLE IF NOT EXISTS books (
id UUID PRIMARY KEY,
user_id TEXT UNIQUE NOT NULL,
scale TEXT NOT NULL,
title TEXT NOT NULL,
description TEXT NOT NULL,
reward map<INT,TEXT> NOT NULL,
image_url TEXT NOT NULL,
video_url TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
But I do get:
SyntaxException: line 2:10 no viable alternative at input 'UNIQUE'
(...NOT EXISTS books ( id [UUID] UNIQUE...)
What is the problem and how can I fix it?
I see three syntax issues. They are mainly related to CQL != SQL.
The first, is that NOT NULL is not valid at column definition time. Cassandra doesn't enforce constraints like that at all, so for this case, just get rid of all of them.
Next, Cassandra CQL does not allow default values, so this won't work:
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
Providing the current timestamp for created_at is something that will need to be done at write-time. Fortunately, CQL has a few of built-in functions to make this easier:
INSERT INTO books (id, user_id, created_at)
VALUES (uuid(), 'userOne', toTimestamp(now()));
In this case, I've invoked the uuid() function to generate a Type-4 UUID. I've also invoked now() for the current time. However now() returns a TimeUUID (Type-1 UUID) so I've nested it inside of the toTimestamp function to convert it to a TIMESTAMP.
Finally, UNIQUE is not valid.
user_id TEXT UNIQUE NOT NULL,
It looks like you're trying to make sure that duplicate user_ids are not stored with each id. You can help to ensure uniqueness of the data in each partition by adding user_id to the end of the primary key definition as a clustering key:
CREATE TABLE IF NOT EXISTS books (
id UUID,
user_id TEXT,
...
PRIMARY KEY (id, user_id));
This PK definition will ensure that data for books will be partitioned by id, containing multiple user_id rows.
Not sure what the relationship is between books and users is, though. If one book can have many users, then this will work. If one user can have many books, then you'll want to switch the order of the keys to this:
PRIMARY KEY (user_id, id));
In summary, a working table definition for this problem looks like this:
CREATE TABLE IF NOT EXISTS books (
id UUID,
user_id TEXT,
scale TEXT,
title TEXT,
description TEXT,
reward map<INT,TEXT>,
image_url TEXT,
video_url TEXT,
created_at TIMESTAMP,
PRIMARY KEY (id, user_id));

Yugabyte YCQL check if a set contain a value?

Is there there any way to query on a SET type(or MAP/LIST) to find does it contain a value or not?
Something like this:
CREATE TABLE test.table_name(
id text,
ckk SET<INT>,
PRIMARY KEY((id))
);
Select * FROM table_name WHERE id = 1 AND ckk CONTAINS 4;
Is there any way to reach this query with YCQL api?
And can we use a SET type in SECONDRY INDEX?
Is there any way to reach this query with YCQL api?
YCQL does not support the CONTAINS keyword yet (feel free to open an issue for this on the YugabyteDB GitHub).
One workaround can be to use MAP<INT, BOOLEAN> instead of SET<INT> and the [] operator.
For instance:
CREATE TABLE test.table_name(
id text,
ckk MAP<int, boolean>,
PRIMARY KEY((id))
);
SELECT * FROM table_name WHERE id = 'foo' AND ckk[4] = true;
And can we use a SET type in SECONDRY INDEX?
Generally, collection types cannot be part of the primary key, or an index key.
However, "frozen" collections (i.e. collections serialized into a single value internally) can actually be part of either primary key or index key.
For instance:
CREATE TABLE table2(
id TEXT,
ckk FROZEN<SET<INT>>,
PRIMARY KEY((id))
) WITH transactions = {'enabled' : true};
CREATE INDEX table2_idx on table2(ckk);
Another option is to use with compound primary key and defining ckk as clustering key:
cqlsh> CREATE TABLE ybdemo.tt(id TEXT, ckk INT, PRIMARY KEY ((id), ckk)) WITH CLUSTERING ORDER BY (ckk DESC);
cqlsh> SELECT * FROM ybdemo.tt WHERE id='foo' AND ckk=4;

Cassandra - how to update a record with a compound key

In the process of learning Cassandra and using it on a small pilot project at work. I've got one table that is filtered by 3 fields:
CREATE TABLE webhook (
event_id text,
entity_type text,
entity_operation text,
callback_url text,
create_timestamp timestamp,
webhook_id text,
last_mod_timestamp timestamp,
app_key text,
status_flag int,
PRIMARY KEY ((event_id, entity_type, entity_operation))
);
Then I can pull records like so, which is exactly the query I need for this:
select * from webhook
where event_id = '11E7DEB1B162E780AD3894B2C0AB197A'
and entity_type = 'user'
and entity_operation = 'insert';
However, I have an update query to set the record inactive (soft delete), which would be most convenient by partition key in the same table. Of course, this isn't possible:
update webhook
set status_flag = 0
where webhook_id = '11e8765068f50730ac964b31be21d64e'
An example of why I'd want to do this, is a simple DELETE from an API endpoint:
http://myapi.com/webhooks/11e8765068f50730ac964b31be21d64e
Naturally, if I update based on the composite key, I'd potentially inactivate more records than I intend to.
Seems like my only choice, doing it the "Cassandra Way", is to use two tables; the one I already have and one to track status_flag by webhook_id, so I can update based on that id. I'd then have to select by webhook_id in the first table and disable it there as well? Otherwise, I'd have to force users to pass all the compound key values in the URL of the API's DELETE request.
Simple things you take for granted in relational data, seem to get complex very quickly in Cassandraland. Is this the case or am I making it more complicated than it really is?
You can add webhook to your primary key.
So your table defination becomes somethign like this.
CREATE TABLE webhook (
event_id text,
entity_type text,
entity_operation text,
callback_url text,
create_timestamp timestamp,
webhook_id text,
last_mod_timestamp timestamp,
app_key text,
status_flag int,
PRIMARY KEY ((event_id, entity_type, entity_operation),webhook_id)
Now lets say you insert 2 records.
INSERT INTO dev_cybs_rtd_search.webhook(event_id,entity_type,entity_operation,status_flag,webhook_id) VALUES('11E7DEB1B162E780AD3894B2C0AB197A','user','insert',1,'web_id');
INSERT INTO dev_cybs_rtd_search.webhook(event_id,entity_type,entity_operation,status_flag,webhook_id) VALUES('12313131312313','user','insert',1,'web_id_1');
And you can update like following
update webhook
set status_flag = 0
where webhook_id = 'web_id' AND event_id = '11E7DEB1B162E780AD3894B2C0AB197A' AND entity_type = 'user'
AND entity_operation = 'insert';
It will only update 1 record.
However you have to send all the things defined in your primary key.

Does using all fields as a partitioning keys in a table a drawback in cassandra?

my aim is to get the msgAddDate based on below query :
select max(msgAddDate)
from sampletable
where reportid = 1 and objectType = 'loan' and msgProcessed = 1;
Design 1 :
here the reportid, objectType and msgProcessed may not be unique. To add the uniqueness I have added msgAddDate and msgProcessedDate (an additional unique value).
I use this design because I don't perform range query.
Create table sampletable ( reportid INT,
objectType TEXT,
msgAddDate TIMESTAMP,
msgProcessed INT,
msgProcessedDate TIMESTAMP,
PRIMARY KEY ((reportid ,msgProcessed,objectType,msgAddDate,msgProcessedDate));
Design 2 :
create table sampletable (
reportid INT,
objectType TEXT,
msgAddDate TIMESTAMP,
msgProcessed INT,
msgProcessedDate TIMESTAMP,
PRIMARY KEY ((reportid ,msgProcessed,objectType),msgAddDate, msgProcessedDate))
);
Please advice which one to use and what will be the pros and cons between two based on performance.
Design 2 is the one you want.
In Design 1, the whole primary key is the partition key. Which means you need to provide all the attributes (which are: reportid, msgProcessed, objectType, msgAddDate, msgProcessedDate) to be able to query your data with a SELECT statement (which wouldn't be useful as you would not retrieve any additional attributes than the one you already provided in the WHERE statemenent)
In Design 2, your partition key is reportid ,msgProcessed,objectType which are the three attributes you want to query by. Great. msgAddDate is the first clustering column, which will be automatically sorted for you. So you don't even need to run a max since it is sorted. All you need to do is use LIMIT 1:
SELECT msgAddDate FROM sampletable WHERE reportid = 1 and objectType = 'loan' and msgProcessed = 1 LIMIT 1;
Of course, make sure to define a DESC sorted order on msgAddDate (I think by default it is ascending...)
Hope it helps!

Using MATERIALIZED VIEW in Cassandra gives error

I have following table.
CREATE TABLE test_x (id text PRIMARY KEY, type frozen<mycustomtype>);
mycustomtype is defined as follows,
CREATE TABLE mycustomtype (
id uuid PRIMARY KEY,
name text
)
And i have created following materialized view for queries based on mycustometype filed.
CREATE MATERIALIZED VIEW test_x_by_mycustomtype_name AS
SELECT id, type
FROM test_x
WHERE type IS NOT NULL
PRIMARY KEY (id, type)
WITH CLUSTERING ORDER BY (type ASC)
With above view i hope to execute following query.
select id from test_x_by_mycustomtype_name where type =
{id: a3e64f8f-bd44-4f28-b8d9-6938726e34d4, name: 'Sample'};
But the query fails saying i need to use 'ALLOW FILTERING'. I created the view not to use ALLOW FILTERING. Why this error is happening here since i have used the part of primary key of the view ?
In you view, the type column is still clustering key. Hence, ALLOW FILTER should be used. You can change the view as per below and retry
CREATE MATERIALIZED VIEW test_x_by_mycustomtype_name_2 AS
SELECT id, type
FROM test_x
WHERE type IS NOT NULL
PRIMARY KEY (type, id)
WITH CLUSTERING ORDER BY (id ASC);
cqlsh:test> select id from test_x_by_mycustomtype_name_2 where type = {id: a3e64f8f-bd44-4f28-b8d9-6938726e34d4, name: 'Sample'};
id
----
Change the order of the primary key of materialized view
CREATE MATERIALIZED VIEW test_x_by_mycustomtype_name AS
SELECT id, type
FROM test_x
WHERE type IS NOT NULL
PRIMARY KEY (type, id)
WITH CLUSTERING ORDER BY (type ASC);

Resources