Search For Multiple Properties by Value Cassandra - cassandra

How can we design a cassandra model for storing a group say 'Item' having n properties P1,P2...PN and
retrieve the item by searching the item property by value
For Example
Item Item_Type State Country
Item1 Solid State1 Country1
In traditional RDBMS we can issue a select query
select Item from table where Item_Type='Solid' and Country='Country1'
How can we achieve such a model in NoSql Cassandra,we have tried cassandra secondary index but it seems to be not applicable.

For properties P1..PN you will have to ALTER the table as with RDMSs or use an outdated thrift protocol based API (i'd suggest Astyanax for this) which can add columns on-the-fly (but this is considered bad practice). Another possibility is to use a collection of properties where one of your columns is a collection of values:
CREATE TABLE item (
item_id text PRIMARY KEY,
property set<text>
);
For SELECTing values with multiple WHERE clauses you can use secondary indexing or if you know what columns are going to be required in the WHERE clause you can use a composite key, but I would recommend secondary indexes if you are going to have a lot of columns that need to be in the WHERE clause.

The answer to many Cassandra data modelling questions is: denormalize.
You can solve your problem by building indexes yourself. For each property have a row with the property name as key and the values and item ID as columns:
CREATE TABLE item_index (
property TEXT,
value TEXT,
item_id TEXT,
PRIMARY KEY (property, value, item_id)
)
you also need a table for the items:
CREATE TABLE items (
item_id TEXT,
property TEXT,
value TEXT,
PRIMARY KEY (item_id, property)
)
(notice that in the item_index table all three columns are in the primary key, because I assume that multiple items can have the same value for the same property, but in the items table only has item_id and property in the primary key, because I assume that an item can only have one value for a property -- you can solve this for multi-valued properties too, but you have to do a few more things and it will complicate the example)
Every time you insert an item you also insert a row in the item_index table for each property of the item:
INSERT INTO items (item_id, property, value) VALUES ('thing1', 'color', 'blue');
INSERT INTO items (item_id, property, value) VALUES ('thing1', 'shoe_size', '8');
INSERT INTO item_index (property, value, item_id) VALUES ('color', 'blue', 'thing1');
INSERT INTO item_index (property, value, item_id) VALUES ('shoe_size', '8', 'thing1');
(you might want to insert the item as a single BATCH command too)
to find items by shoe size you need to do two queries (sorry, but that's the price you pay for the flexibility -- maybe someone else can come up with a solution that does not require two queries):
SELECT item_id FROM item_index WHERE property = 'shoe_size' AND value = '8';
SELECT * FROM items WHERE item_id = ?;
where the ? is one of the item_ids returned from the first query (because more than one can match, remember).

Related

Query by Interleaved table fields using Spring Data Spanner

I'm trying to query by a field of a Interleaved table using Spring Data Spanner. The id comparison is automatically done by Spring Data Spanner when it does the ARRAY STRUCT inner join, but I'm not being able to add a WHERE clause to the Interleaved table query.
Considering the example below:
CREATE TABLE Singers (
Id INT64 NOT NULL,
FirstName STRING(1024),
LastName STRING(1024),
SingerInfo BYTES(MAX),
) PRIMARY KEY (Id);
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
Id INT64 NOT NULL,
AlbumTitle STRING(MAX),
) PRIMARY KEY (SingerId, Id),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
Let's suppose I want to query all Singers where the AlbumTitle is "Fear of the Dark", how can I write a repository method to achieve that using Spring Data Spanner?
You're example seems to either contain a couple of typos, or it is otherwise not completely correct:
The Singers table has a column Id which is the primary key. That is in itself fine, but when creating a hierarchy of interleaved tables, it is recommended to prefix the primary key column with the table name. So it would be better to name it SingerId.
The Albums table has a SingerId column and an Id column. These two columns form the primary key of the Albums table. This is technically incorrect (and confusing), and also the reason that I think that your example is not completely correct. Because Albums is interleaved in Singers, Albums must contain the same primary key columns as the Singers table, in addition to any additional columns that form the primary key of Albums. In this case Id references the Singers table, and the SingerId is an additional column in the Albums table that has nothing to do with the Singers table. The primary key columns of the parent table must also appear in the same order as in the parent table.
The example data model should therefore be changed to:
CREATE TABLE Singers (
SingerId INT64 NOT NULL,
FirstName STRING(1024),
LastName STRING(1024),
SingerInfo BYTES(MAX),
) PRIMARY KEY (SingerId);
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
AlbumTitle STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
From this point on you can consider the SingerId column in the Albums table as a foreign key relationship to a Singer and treat it as you would in any other database system. Note also that there can be multiple albums for each singer, so a query for ...I want to query all Singers where the AlbumTitle is "Fear of the Dark" is slightly ambiguous. I would rather say:
Give me all singers that have at least one album with the title "Fear of the Dark"
A valid query for that would be:
SELECT *
FROM Singers
WHERE SingerId IN (
SELECT SingerId
FROM Albums
WHERE AlbumTitle='Fear of the Dark'
)

Cassandra: Is there a limit to amount of data that a collection column can hold?

In the below table, what is the maximum size phone_numbers column can accommodate ?
Like normal columns, is it 2GB ?
Is it 64K*64K as mentioned here
CREATE TABLE d2.employee (
id int PRIMARY KEY,
doj timestamp,
name text,
phone_numbers map<text, text>
)
Collection types in Cassandra are represented as a set of distinct cells in the internal data model: you will have a cell for each key of your phone_numbers column. Therefore they are not normal columns, but a set of columns. You can verify this by executing the following command in cassandra-cli (1001 stands for a valid employee id):
use d2;
get employee[1001];
The good answer is your point 2.

How to only return some map keys (aka, slice a range of map/set elements) in CQL 3?

I'm trying to do my own CF reverse index in Cassandra right now, for a geohash lookup implementation.
In CQL 2, I could do this:
CREATE COLUMNFAMILY song_tags (id uuid PRIMARY KEY) WITH comparator=text;
insert into song_tags ('id', 'blues', '1973') values ('a3e64f8f-bd44-4f28-b8d9-6938726e34d4', '', '');
insert into song_tags ('id', 'covers', '2007') values ('8a172618-b121-4136-bb10-f665cfc469eb', '', '');
SELECT * FROM song_tags;
Which resulted in:
id,8a172618-b121-4136-bb10-f665cfc469eb | 2007, | covers,
id,a3e64f8f-bd44-4f28-b8d9-6938726e34d4 | 1973, | blues,
And allowed to return 'covers' and 'blues' via:
SELECT 'a'..'f' FROM song_tags
Now, I'm trying to use CQL 3, which has gotten rid of dynamic columns, and suggests using a set or map column type instead. sets and maps have their values/keys ordered alphabetically, and under the hood (iirc) are columns - hence, they should support the same type of range slicing... but how?
Suggest to forget what you know about 'under the hood' implementation details and focus on what the query language lets you do.
Long reason why is in CQL3, multiple rows map to a single columnfamily though the query language presents them as different rows. It's just a different way of querying the same data.
Range slicing does not exist, the query language is flexible enough to support its use cases.
To do what you want, make an index on the genres so it is query-able without using the primary key and then select the genres value itself.
The 'gotcha' is that some functions can only be performed on partition keys, like distinct. Will have to do distinct client side in that case.
For example:
CREATE TABLE song_tags (
id uuid PRIMARY KEY,
year text,
genre list<text>
);
CREATE INDEX ON song_tags(genre);
INSERT INTO song_tags (id, year, genre)
VALUES(8a172618-b121-4136-bb10-f665cfc469eb, '2007', ['covers']);
INSERT INTO song_tags (id, year, genre)
VALUES(a3e64f8f-bd44-4f28-b8d9-6938726e34d4, '1973', ['blues']);
Can then query as:
SELECT genre from song_tags;
genre
------------
['blues']
['covers']

Secondary index in cassandra

In my application I have lists which have items in them, they would look like that
1. list uuid: b1d19224-ebcc-4f69-a98e-4096a4b28121
1. item
2. item
3. item
2. list uuid: 54b17b3a-5d83-4aec-9e7e-16bff1ba336b
1. item
Those items are indexed by there numbers. What I would like to do is add items to those lists, but not just at the end of the list but sometimes also after a specific item for example after the first item.
The way I thought of doing that is by giving those items a unique id looking like that: (uuid of list).(number of item) for example b1d19224-ebcc-4f69-a98e-4096a4b28121.1. So every time I would like to add a new item it's either I would add it to the end of the list or after some item giving the rest of the items after that new an index+1 for example (uuid of list).(number+1).
Is there another way of accomplishing that, or should I do it like that?
If you want to insert your items in your lists sorted on the unique item number, you should use CQL3 based composite primary keyed column family.
create table list (
partkey varchar,
item_num int,
id varchar,
data varchar,
PRIMARY KEY (partkey, item_num)
) with clustering order by (item_num desc);
Where the first part of primary key would server as the partition key and the second one serves as the sorting value. Have a look at the following link :
http://rollerweblogger.org/roller/entry/composite_keys_in_cassandra

Query using composite keys, other than Row Key in Cassandra

I want to query data filtering by composite keys other than Row Key in CQL3.
These are my queries:
CREATE TABLE grades (id int,
date timestamp,
subject text,
status text,
PRIMARY KEY (id, subject, status, date)
);
When I try and access the data,
SELECT * FROM grades where id = 1098; //works fine
SELECT * FROM grades where subject = 'English' ALLOW FILTERING; //works fine
SELECT * FROM grades where status = 'Active' ALLOW FILTERING; //gives an error
Bad Request: PRIMARY KEY part status cannot be restricted (preceding part subject is either not restricted or by a non-EQ
relation)
Just to experiment, I shuffled the keys around keeping 'id' as my Primary Row Key always. I am always ONLY able to query using either the Primary Row key or the second key, considering above example, if I swap subjects and status in Primary Key list, I can then query with status but I get similar error if I try to do by subject or by time.
Am I doing something wrong? Can I not query data using any other composite key in CQL3?
I'm using Cassandra 1.2.6 and CQL3.
That looks all normal behavior according to Cassandra Composite Key model (http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT). Cassandra data model aims (and this is a general NoSQL way of thinking) at granting that queries are performant, that comes to the expense of "restrictions" on the way you store and index your data, and then how you query it, namely you "always need to restrict the preceding part of subject" on the primary key.
You cannot swap elements on the primary key list on the queries (that is more a SQL way of thinking). You always need to "Constraint"/"Restrict" the previous element of the primary key if you are to use multiple elements of the composite key. This means that if you have composite key = (id, subject, status, date) and want to query "status", you will need to restrict "id" and/or "subject" ("or" is possible in case you use "allow filtering", i.e., you can restrict only "subject" and do not need to restrict "id"). So, if you want to query on "status" you will b able to query in two different ways:
select * from grades where id = '1093' and subject = 'English' and status = 'Active';
Or
select * from grades where subject = 'English' and status = 'Active' allow filtering;
The first is for a specific "student", the second for all the "students" on the subject in status = "Active".

Resources