Cassandra: best usage of partition and row key - cassandra

I have the following data structure:
{
ClientId: string,
ItemId: string,
Item : string
}
I want to store this data in a Cassandra cluster. I know that some clients have much more items than others, yet I want to store data evenly on every node of my cluster since I have only one single query by ClientId and Item id together.
As far as I get I need to specify partition key like to distribute data evenly, so in CQL it would look like:
CREATE TABLE IF NOT EXISTS mykeyspace.mytable
(
ClientId text,
ItemId text,
Item text,
PRIMARY KEY((ClientId, Id))
);
Do I need to specify anything as a row key? ClientId+ItemId uniquely identifies any row, so should I put anything after the first closing ")"?

The one way is to make a hash of your partition keys and then use the hash as the partition key.
Also you could also add the time of the last purchase in there ((ClientId, ItemId, lastPurchaseTime))

Do I need to specify anything as a row key? ClientId+ItemId uniquely identifies any row, so should I put anything after the first closing ")"?
Your example schema will do exactly what you want and work well. There's no need to add anything else to the primary key.
(If you added more columns to the primary key, they would serve as clustering columns, which control the ordering of the rows on disk for a single partition.)

Related

Regarding suggestion of best schema for a cassandra table?

I want to have a table in Cassandra that has a partition key say column 'A', and a column say 'B' which is of 'set' type and can have up to 10000 elements in the set.
But when i retrieve a row from this table then the whole set is retrieved at once and because of that the JVM heap increases rapidly. So should i stick to this schema or go with other schema where 'A' is partition key and i make dynamic columns for each element in the set in my other schema say 'B1', 'B2' ..... 'B10,000'where each of this column is a clustering key.
Which schema is suited best and will give the optimal performance please recommend.
NOTE: cqlsh 5.0.1v
Based off of what you've described, and the documentation I've read, I would not create a collection with 10k elements. Instead I would have two tables, one with everything but the collection, and then use the primary key values of the first table, as the partition key columns of the second table; adding the element name (or whatever you can use to identify an individual element) as a clustering column.
So for a given query, if you wanted everything for a particular primary key value (including all elements), you'd query the first table with the primary key, grab whatever you need, then hit the second table as well, looping/fetching through all elements.
If the query only provides a filter on the partition key (not the primary key - i.e. retrieving multiple rows) , the first query would have to retrieve all columns that make up the primary key for each row, and then query the second table looping for all elements - nested loop here - one loop for each primary key record retrieved from the first table, and a second loop to grab all elements for each pk record.
Probably the best way to go with this. That's how I would probably tackle this.
Does that make sense?
-Jim

Cassandra sort not by primary key

I'm trying to model a table in Cassandra, I'm quite new and stumbled upon one problem. I've got the following:
CREATE TABLE content_registry (
service text,
file text,
type_id tinyint,
container text,
status_id tinyint,
source_location text,
expiry_date timestamp,
modify_date timestamp,
create_date timestamp,
to_overwrite boolean,
PRIMARY KEY ((service), file, type_id)
);
So as I understand:
service is my partition key and based on this value hashes will be generated and values will be split in cluster
file is clustering key
type_id is clustering key
These three bodies combine a composite (compound) primary key
What I've figured out is that whenever I'll insert new data, Cassandra will upsert (either insert or update if the value with that compound primary key exists)
Now what I'm struggling is, that I want my data to come back sorted by create_date in descending order, however create_date is not part of primary key.
If I add create_date to my primary key, I won't be able to upsert data, because create_date means timestamp when record was inserted, so if I add it to primary key every time there's an insert, I'll end up with multiple records.
What are the other options? Order in application? That doesn't seem very efficient.
What I've figured out is that whenever I'll insert new data, Cassandra
will upsert (either insert or update if the value with that compound
primary key exists)
Totally right.
Now what I'm struggling is, that I want my data to come back sorted by
create_date in descending order, however create_date is not part of
primary key.
If I add create_date to my primary key, I won't be able to upsert
data, because create_date means timestamp when record was inserted, so
if I add it to primary key every time there's an insert, I'll end up
with multiple records.
With these sentences you are actually contradicting.
If create_date isn't part of your key but a property and the data is upserted, it means that the records are always the same. Therefore when querying by the key and fetching create_date you always have the latest. If you actually want to have the date when the record got created you should just not override the data anymore after the first time you inserted that record.
If it's the case you want to represent a series of data, you indeed need to avoid upserting, this is could be done by using create_date as additional partition key. I'd rather prefeer using time_uuid which comes with quite handy functions.
Last but not least, the most interesting question is, what actually the usecase is that you want to reflect. When modelling data in cassandra you always should know your queries you need to run in advance.
The key concept in Cassandra is that you have to decide what's your PRIMARY KEY, that is what in your rows can be unique and known at query times. This is a very basic requirement, since failing at recognizing this will lead to a bad model.
From what I can see, you identified service as your PARTITION KEY, so I'm thinking that this field is what "rules" your data. This is something you must really know to perform even a single query (ignoring the inefficient table scan SELECT * FROM content_registry;). Within each service, you currently have your rows ordered by file and then by type_id. I don't know the exact meaning of the latter field, but you can currently have two rows identified by ('service1', 'a.jpg', 1) and ('service1', 'a.jpg', 2). So if type_id is somehow related to the file, the model is a bit incorrect.
Now, assuming you want to fetch the same records for each service in another order, what you really need to do is create another table that will include the create_date as the first clustering column, eg (service, create_date, file, type_id). This will allow you to fetch records ordered by creation date, and when two records are created in the same date, they will be further ordered by file, and then by type_id.
A second approach is to attach a secondary index to the create_date field of your original table. This will allow to query by creation date.
A third approach, probably better than the second, is the use of a Materialized View. It will hide a lot of burdens for you and will probably scale better than secondary indexes.
Please note that having secondary indexes or materialized views usually don't scale well. Check if these approaches are enough for your use case.
If I add create_date to my primary key, I won't be able to upsert data.
Why not? Suppose your key was PRIMAY KEY (service, create_date, file, type_id)? That will let you sort by create_date for each service but not globally.
If you want to do it globally (that is, you want all services and all files sorted by create date) then things are probably more complex if you still want to be able to shard your data. One option would be to make the primary key PRIMARY KEY (create_date, service, file, type_id) and use one of the order preserving partitioners.
Also, a bit more information here: http://www.datastax.com/dev/blog/we-shall-have-order

Primary Key related CQL3 Queries cases & errors when sorting

I have two issues while querying Cassandra:
Query 1
> select * from a where author='Amresh' order by tweet_id DESC;
Order by with 2ndary indexes is not supported
What I learned: secondary indexes are made to be used only with a WHERE clause and not ORDER BY? If so, then how can I sort?
Query 2
> select * from a where user_id='xamry' ORDER BY tweet_device DESC;
Order by currently only supports the ordering of columns following their
declared order in the PRIMARY KEY.
What I learned: The ORDER BY column should be in the 2nd place in the primary key, maybe? If so, then what if I need to sort by multiple columns?
Table:
CREATE TABLE a(
user_id varchar,
tweet_id varchar,
tweet_device varchar,
author varchar,
body varchar,
PRIMARY KEY(user_id,tweet_id,tweet_device)
);
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't1', 'web', 'Amresh', 'Here is my first tweet');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't2', 'sms', 'Saurabh', 'Howz life Xamry');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't1', 'iPad', 'Kuldeep', 'You der?');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't2', 'mobile', 'Vivek', 'Yep, I suppose');
Create index user_index on a(author);
To answer your questions, let's focus on your choice of primary key for this table:
PRIMARY KEY(user_id,tweet_id,tweet_device)
As written, the user_id will be used as the partition key, which distributes your data around the cluster but also keeps all of the data for the same user ID on the same node. Within a single partition, unique rows are identified by the pair (tweet_id, tweet_device) and those rows will be automatically ordered by tweet_id because it is the second column listed in the primary key. (Or put another way, the first column in the PK that is not a part of the partition key determines the sort order of the partition.)
Query 1
The WHERE clause is author='Amresh'. Note that this clause does not involve any of the columns listed in the primary key; instead, it is filtering using a secondary index on author. Since the WHERE clause does not specify an exact value for the partition key column (user_id) using the index involves scanning all cluster nodes for possible matches. Results cannot be sorted when they come from more than one replica (node) because that would require holding the entire result set on the coordinator node before it could return any results to the client. The coordinator can't know what is the real "first" result row until it has confirmed that it has received and sorted every possible matching row.
If you need the information for a specific author name, separate from user ID, and sorted by tweet ID, then consider storing the data again in a different table. The data design philosophy with Cassandra is to store the data in the format you need when reading it and to actually denormalize (store redundant information) as necessary. This is because in Cassandra, writes are cheap (though it places the burden of managing multiple copies of the same logical data on the application developer).
Query 2
Here, the WHERE clause is user_id = 'xamry' which happens to be the partition key for this table. The good news is that this will go directly to the replica(s) holding this partition and not bother asking the other nodes. However, you cannot ORDER BY tweet_device because of what I explained at the top of this answer. Cassandra stores rows (within a single partition) sorted by a single column, generally the second column in the primary key. In your case, you can access data for user_id = 'xamry' ORDER BY tweet_id but not ordered by tweet_device. The answer, if you really need the data sorted by device, is the same as for Query 1: store it in a table where that is the second column in the primary key.
If, when looking up the tweets by user_id you only ever need them sorted by device, simply flip the order of the last two columns in your primary key. If you need to be able to sort either way, store the data twice in two different tables.
The Cassandra storage engine does not offer multi-column sorting other than the order of columns listed in your primary key.

Duplicate partitioning key performance impact in Cassandra

I've read in some posts that having duplicate partitioning key can have a performance impact. I've two tables like:
CREATE TABLE "Test1" ( CREATE TABLE "Test2" (
key text, key text,
column1 text, name text,
value text, age text,
PRIMARY KEY (key, column1) ...
) PRIMARY KEY (key, name,age)
)
In Test1 column1 will contain column name and value will contain its corresponding value.The main advantage of Test1 is that I can add any number of column/value pairs to it without altering the table by just providing same partitioning key each time.
Now my question is how will each of these table schema's impact the read/write performance if I've millions of rows and number of columns can be upto 50 in each row. How will it impact the compaction/repair time if I'm writing duplicate entries frequently?
For efficient queries, you want to hit a parition (i.e. have the first key of your primary key in your query). Inside of your partition, each column is stored in sorted form by the respective clustering keys. Cassandra stores data as "map of sorted maps".
Your Test1 schema will allow you to fetch all columns for a key, or a specific column for a key. Each "entry" will be on a separate parition.
For Test2, you can query by key, (key and name), or (key, name and age). But you won't be able to get to the age for a key without also specifying the name (w/o adding a secondary index). For this schema too, each "entry" will be in its own partition.
Cross partition queries are more expensive than those that hit a single partition. If you're looking for simply key-value lookups, then either schema will suffice. I wouldn't be worried using either for 50 columns. The first will give you direct access to a particular column. The latter will give you access to the whole data for an entry.
What you should focus more on is which structure allows you to do the queries you want. The first won't be very useful for secondary indexes, but the second will, for example.

How to make Cassandra have a varying column key for a specific row key?

I was reading the following article about Cassandra:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.UzIcL-ddVRw
and it seemed to imply you can have varying column keys in cassandra for a given row key. Is that true? And if its true, how do you allow for varying row keys.
The reason I think this might be true is because say we have a user and it can like many items and we simply want the userId to be the rowkey. We let this rowKey (userID) map to all the items that specific user might like. Each specific user might like a different number of items. Therefore, if we could have multiple column keys, one for each itemID each user likes, then we could solve the problem that way.
Therefore, is it possible to have varying length of cassandra column keys for a specific rowKey? (and how do you do it)
Providing an example and/or some cql code would be awesome!
The thing that is confusing me is that I have seen some .cql files and they define keyspaces before hand and it seems pretty inflexible on how to make it dynamic, i.e. allow it to have additional columns as we please. For example:
CREATE TABLE IF NOT EXISTS results (
test blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
How can this even allow growing columns? Don't we need to specify the name before hand anyway?Or additional custom columns as the application desires?
Yes, you can have a varying number of columns per row_key. From a relational perspective, it's not obvious that tid is the name of a variable. It acts as a placeholder for the variable column key. Note in the inserts statements below, "tid", "result", and "data" are never mentioned in the statement.
CREATE TABLE IF NOT EXISTS results (
data blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
So in your example, you need to identify the row_key, column_key, and payload of the table.
The primary key contains both the row_key and column_key.
Test is your row_key.
tid is your column_key.
data is your payload.
The following inserts are all valid:
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_1');
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591712', 'blob_2');
#notice that the column_key changed but the row_key remained the same
INSERT your_keyspace.results('row_key_2', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_3');
See here
Did you thought of exploring collection support in cassandra for handling such relations in colocated way{e.g. on same data node}.
Not sure if it helps, but what about keeping user id as row key and a map containing item id as key and some value?
-Vivel

Resources