Secondary index in cassandra - cassandra

In my application I have lists which have items in them, they would look like that
1. list uuid: b1d19224-ebcc-4f69-a98e-4096a4b28121
1. item
2. item
3. item
2. list uuid: 54b17b3a-5d83-4aec-9e7e-16bff1ba336b
1. item
Those items are indexed by there numbers. What I would like to do is add items to those lists, but not just at the end of the list but sometimes also after a specific item for example after the first item.
The way I thought of doing that is by giving those items a unique id looking like that: (uuid of list).(number of item) for example b1d19224-ebcc-4f69-a98e-4096a4b28121.1. So every time I would like to add a new item it's either I would add it to the end of the list or after some item giving the rest of the items after that new an index+1 for example (uuid of list).(number+1).
Is there another way of accomplishing that, or should I do it like that?

If you want to insert your items in your lists sorted on the unique item number, you should use CQL3 based composite primary keyed column family.
create table list (
partkey varchar,
item_num int,
id varchar,
data varchar,
PRIMARY KEY (partkey, item_num)
) with clustering order by (item_num desc);
Where the first part of primary key would server as the partition key and the second one serves as the sorting value. Have a look at the following link :
http://rollerweblogger.org/roller/entry/composite_keys_in_cassandra

Related

Cassandra cql - sorting unique IDs by time

I am writting messaging chat system, similar to FB messaging. I did not find the way, how to effectively store conversation list (each row different user with last sent message most recent on top). If I list conversations from this table:
CREATE TABLE "conversation_list" (
"user_id" int,
"partner_user_id" int,
"last_message_time" time,
"last_message_text" text,
PRIMARY KEY ("user_id", "partner_user_id")
)
I can select from this table conversations for any user_id. When new message is sent, we can simply update the row:
UPDATE conversation_list SET last_message_time = '...', last_message_text='...' WHERE user_id = '...' AND partner_user_id = '...'
But it is sorted by clustering key of course. My question: How to create list of conversations, which is sorted by last_message_time, but partner_user_id will be unique for given user_id?
If last_message_time is clustering key and we delete the row and insert new (to keep partner_user_id unique), I will have many so many thumbstones in the table.
Thank you.
A slight change to your original model should do what you want:
CREATE TABLE conversation_list (
user_id int,
partner_user_id int,
last_message_time timestamp,
last_message_text text,
PRIMARY KEY ((user_id, partner_user_id), last_message_time)
) WITH CLUSTERING ORDER BY (last_message_time DESC);
I combined "user_id" and "partner_user_id" into one partition key. "last_message_time" can be the single clustering column and provide sorting. I reversed the default sort order with the CLUSTERING ORDER BY to make the timestamps descending. Now you should be able to just insert any time there is a message from a user to a partner id.
The select now will give you the ability to look for the last message sent. Like this:
SELECT last_message_time, last_message_text
FROM conversation_list
WHERE user_id= ? AND partner_user_id = ?
LIMIT 1

Cassandra order by on combination of composite keys

I originally wrote a table that tracks feeds that have been assigned to a user for review.
create table user_feed
{
userid uuid,
languageid uuid,
topicid_uuid,
dateinserted timeuuid,
primary key (userid, languageid, topicid, dateinserted)
};
I realized soon after I created this table that I wouldn't be able to sort this table (order by DESC) by dateinserted because for some weird reason, in Cassandra I can only order by the second (and last) column of a composite key table (as in, the table has to have 2 composite keys and order by can only happen on the second column of this key) so I changed my table to this:
create table user_feed
{
userid uuid,
languageid uuid,
topicid_uuid,
dateinserted timeuuid,
primary key (userid, dateinserted)
};
and now I was able to run a query to get the latest feeds for the user, using order by.
However, I have a new requirement that requires me to sort the feeds by a combination of (languageid + userid) or (topicid + userid) or (languageid + topicid + userid).
I had an idea to create three new tables and have the keys combined into one key column. For example, for userid + topic query, I would use:
create table user_feed_by_topic
{
usertopicidkey text,
dateinserted timeuuid,
primary key (usertopicidkey, dateinserted)
};
where usertopididkey = userid.toString() + topicid.toString().
Of course, this solution requires 4 separate inserts whenever I need to insert a new feed row since I have 4 rows, tracking identical data but partitioned differently to allow sorting.
My question is, is there a better way to do this? Is there any way to achieve what I want (query by a combination of columns and order by another column) or am I stuck with my 4 table design approach?
Many thanks,
Cassandra will order all rows based on the PKs clustering columns. In case your PK is primary key (userid, languageid, topicid, dateinserted) all rows will be sorted by languageid, topicid and dateinserted in ascending order. This implies that all rows will only be sorted within a specific language and topic by date. You'd have to use the date as the first clustering key column to change this behaviour.
Its common practice to denormalize your data across multiple tables to implement different ordering strategies.

Query using composite keys, other than Row Key in Cassandra

I want to query data filtering by composite keys other than Row Key in CQL3.
These are my queries:
CREATE TABLE grades (id int,
date timestamp,
subject text,
status text,
PRIMARY KEY (id, subject, status, date)
);
When I try and access the data,
SELECT * FROM grades where id = 1098; //works fine
SELECT * FROM grades where subject = 'English' ALLOW FILTERING; //works fine
SELECT * FROM grades where status = 'Active' ALLOW FILTERING; //gives an error
Bad Request: PRIMARY KEY part status cannot be restricted (preceding part subject is either not restricted or by a non-EQ
relation)
Just to experiment, I shuffled the keys around keeping 'id' as my Primary Row Key always. I am always ONLY able to query using either the Primary Row key or the second key, considering above example, if I swap subjects and status in Primary Key list, I can then query with status but I get similar error if I try to do by subject or by time.
Am I doing something wrong? Can I not query data using any other composite key in CQL3?
I'm using Cassandra 1.2.6 and CQL3.
That looks all normal behavior according to Cassandra Composite Key model (http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT). Cassandra data model aims (and this is a general NoSQL way of thinking) at granting that queries are performant, that comes to the expense of "restrictions" on the way you store and index your data, and then how you query it, namely you "always need to restrict the preceding part of subject" on the primary key.
You cannot swap elements on the primary key list on the queries (that is more a SQL way of thinking). You always need to "Constraint"/"Restrict" the previous element of the primary key if you are to use multiple elements of the composite key. This means that if you have composite key = (id, subject, status, date) and want to query "status", you will need to restrict "id" and/or "subject" ("or" is possible in case you use "allow filtering", i.e., you can restrict only "subject" and do not need to restrict "id"). So, if you want to query on "status" you will b able to query in two different ways:
select * from grades where id = '1093' and subject = 'English' and status = 'Active';
Or
select * from grades where subject = 'English' and status = 'Active' allow filtering;
The first is for a specific "student", the second for all the "students" on the subject in status = "Active".

Search For Multiple Properties by Value Cassandra

How can we design a cassandra model for storing a group say 'Item' having n properties P1,P2...PN and
retrieve the item by searching the item property by value
For Example
Item Item_Type State Country
Item1 Solid State1 Country1
In traditional RDBMS we can issue a select query
select Item from table where Item_Type='Solid' and Country='Country1'
How can we achieve such a model in NoSql Cassandra,we have tried cassandra secondary index but it seems to be not applicable.
For properties P1..PN you will have to ALTER the table as with RDMSs or use an outdated thrift protocol based API (i'd suggest Astyanax for this) which can add columns on-the-fly (but this is considered bad practice). Another possibility is to use a collection of properties where one of your columns is a collection of values:
CREATE TABLE item (
item_id text PRIMARY KEY,
property set<text>
);
For SELECTing values with multiple WHERE clauses you can use secondary indexing or if you know what columns are going to be required in the WHERE clause you can use a composite key, but I would recommend secondary indexes if you are going to have a lot of columns that need to be in the WHERE clause.
The answer to many Cassandra data modelling questions is: denormalize.
You can solve your problem by building indexes yourself. For each property have a row with the property name as key and the values and item ID as columns:
CREATE TABLE item_index (
property TEXT,
value TEXT,
item_id TEXT,
PRIMARY KEY (property, value, item_id)
)
you also need a table for the items:
CREATE TABLE items (
item_id TEXT,
property TEXT,
value TEXT,
PRIMARY KEY (item_id, property)
)
(notice that in the item_index table all three columns are in the primary key, because I assume that multiple items can have the same value for the same property, but in the items table only has item_id and property in the primary key, because I assume that an item can only have one value for a property -- you can solve this for multi-valued properties too, but you have to do a few more things and it will complicate the example)
Every time you insert an item you also insert a row in the item_index table for each property of the item:
INSERT INTO items (item_id, property, value) VALUES ('thing1', 'color', 'blue');
INSERT INTO items (item_id, property, value) VALUES ('thing1', 'shoe_size', '8');
INSERT INTO item_index (property, value, item_id) VALUES ('color', 'blue', 'thing1');
INSERT INTO item_index (property, value, item_id) VALUES ('shoe_size', '8', 'thing1');
(you might want to insert the item as a single BATCH command too)
to find items by shoe size you need to do two queries (sorry, but that's the price you pay for the flexibility -- maybe someone else can come up with a solution that does not require two queries):
SELECT item_id FROM item_index WHERE property = 'shoe_size' AND value = '8';
SELECT * FROM items WHERE item_id = ?;
where the ? is one of the item_ids returned from the first query (because more than one can match, remember).

How does a CQL3 composite index with 3 fields map in the thrift column family world?

After reading this blog at planetcassandra, I'm wondering how does a CQL3 composite index with 3 fields map in the thrift column family word, For e.g.:
CREATE TABLE comments (
article_id uuid,
posted_at timestamp,
author text,
karma int,
content text,
PRIMARY KEY (article_id, posted_at)
)
Here the column article_id will be mapped to the internal row key and posted_at will be mapped to (the first part of) the cell name.
What if the table design will be
CREATE TABLE comments (
author_id varchar,
posted_at timestamp,
article_id uuid,
author text,
karma int,
content text,
PRIMARY KEY (author_id, posted_at, article_id)
)
And will the internal row key mapped to 1st 2 fields of the composite index with article_id mapped to cell name, essentially slicing for as many articles upto 2 billion entries and any query on author_id and posted_at combination is one seek on the disk?
Is the behavior same for any number of fields in a composite key?
Your answers much appreciated.
The above observation is incorrect and the correct one is here
I've personally verified:
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id = partition key, posted_at:article_id = cluster key
First part of composite key (author_id) is called "Partition Key",
rest (posted_at,article_id) are remaining keys.
Cassandra stores columns differently when composite keys are used. Partition key
becomes row key. Remaining keys are concatenated with each column
name (":" as separator) to form column names. Column values remain
unchanged.
Remaining keys (other than partition keys) are ordered,
and it's not allowed to search on any random column, you have to
start with the first one and then you can move to the second one and
so on. This is evident from "Bad Request" error.
There's an excellent explanation by Aaron Morton # his site thelastpickle.
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id + posted_at = partition key, article_id = cluster key
hence be mindful of the disk seeks as you go by second method and see the row is not getting too wide and gives real benefit compared to the first case.
If you aren't crossing the 2 billion and well within the limits, don't overdo by adopting the 2nd method, as the dispersion of records happens on the combo key.

Resources