How does a CQL3 composite index with 3 fields map in the thrift column family world? - cassandra

After reading this blog at planetcassandra, I'm wondering how does a CQL3 composite index with 3 fields map in the thrift column family word, For e.g.:
CREATE TABLE comments (
article_id uuid,
posted_at timestamp,
author text,
karma int,
content text,
PRIMARY KEY (article_id, posted_at)
)
Here the column article_id will be mapped to the internal row key and posted_at will be mapped to (the first part of) the cell name.
What if the table design will be
CREATE TABLE comments (
author_id varchar,
posted_at timestamp,
article_id uuid,
author text,
karma int,
content text,
PRIMARY KEY (author_id, posted_at, article_id)
)
And will the internal row key mapped to 1st 2 fields of the composite index with article_id mapped to cell name, essentially slicing for as many articles upto 2 billion entries and any query on author_id and posted_at combination is one seek on the disk?
Is the behavior same for any number of fields in a composite key?
Your answers much appreciated.

The above observation is incorrect and the correct one is here
I've personally verified:
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id = partition key, posted_at:article_id = cluster key
First part of composite key (author_id) is called "Partition Key",
rest (posted_at,article_id) are remaining keys.
Cassandra stores columns differently when composite keys are used. Partition key
becomes row key. Remaining keys are concatenated with each column
name (":" as separator) to form column names. Column values remain
unchanged.
Remaining keys (other than partition keys) are ordered,
and it's not allowed to search on any random column, you have to
start with the first one and then you can move to the second one and
so on. This is evident from "Bad Request" error.

There's an excellent explanation by Aaron Morton # his site thelastpickle.
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id + posted_at = partition key, article_id = cluster key
hence be mindful of the disk seeks as you go by second method and see the row is not getting too wide and gives real benefit compared to the first case.
If you aren't crossing the 2 billion and well within the limits, don't overdo by adopting the 2nd method, as the dispersion of records happens on the combo key.

Related

How to model for word search in cassandra

my model design to save word search from checkbox and it must have update word search and status, delete(fake). my old model set pk is uuid(id of word search) and set index is status (enable, disable, deleted)
but I don't want to set index at status column(I think its very bad to set index at update column) and I don't change database
Is it have better way for model this?
sorry for my english grammar
You should not create index on very low cardinality column status
Avoid very low cardinality index e.g. index where the number of distinct values is very low. A good example is an index on the gender of an user. On each node, the whole user population will be distributed on only 2 different partitions for the index: MALE & FEMALE. If the number of users per node is very dense (e.g. millions) we’ll have very wide partitions for MALE & FEMALE index, which is bad
Source : https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive
Best way to handle this type of case :
Create separate table for each type of status
Or Status with a known parameter (year, month etc) as partition key
Example of 2nd Option
CREATE TABLE save_search (
year int,
status int,
uuid uuid,
category text,
word_search text,
PRIMARY KEY((year, status), uuid)
);
Here you can see that i have made a composite partition key with year and status, because of low cardinality issue. If you think huge data will be in a single status then you should also add month as the part of composite partition key
If your dataset is small you can just remove the year field.
CREATE TABLE save_search (
status int,
uuid uuid,
category text,
word_search text,
PRIMARY KEY(status, uuid)
);
Or
If you are using cassandra version 3.x or above then you can use materialized view
CREATE MATERIALIZED VIEW search_by_status AS
SELECT *
FROM your_main_table
WHERE uuid IS NOT NULL AND status IS NOT NULL
PRIMARY KEY (status, uuid);
You can query with status like :
SELECT * FROM search_by_status WHERE status = 0;
All the deleting, updating and inserting you made on your main table cassandra will sync it with the materialized view

Cassandra: Is there a limit to amount of data that a collection column can hold?

In the below table, what is the maximum size phone_numbers column can accommodate ?
Like normal columns, is it 2GB ?
Is it 64K*64K as mentioned here
CREATE TABLE d2.employee (
id int PRIMARY KEY,
doj timestamp,
name text,
phone_numbers map<text, text>
)
Collection types in Cassandra are represented as a set of distinct cells in the internal data model: you will have a cell for each key of your phone_numbers column. Therefore they are not normal columns, but a set of columns. You can verify this by executing the following command in cassandra-cli (1001 stands for a valid employee id):
use d2;
get employee[1001];
The good answer is your point 2.

Create a super column using CQL3

I am upgrading my thrift api to cql3. My data contains SuperColumns as follows:
- User //column family
- Division/name //my row key
-DivHead //SuperColumn
- name //Columns
- address //Columns
I understand all the column families to be changed to tables. And the primary key becomes the rowkey. So rest are the columns.
But my data has supercolumns. how do I create supercolumns using CQL3?
CREATE TABLE user (
rowkey varchar,
division text,
head_name text,
address text,
PRIMARY KEY (rowkey, division)
)
OR
CREATE TABLE user (
rowkey varchar,
division text,
head_name text,
head_address text,
PRIMARY KEY ((rowkey, division))
)
Under the covers the first example will have each rowkey assigned to the same partition. Each rowkey will have a set of logical rows, one for each division. Those rows will contain two columns: head_name and head_address. You can query based on the rowkey and get all divisions (sorted!). Or you can query a rowkey with a range of divisions or a single division and get a subset of the divisions with their division head and address.
The second example will have one partition for each rowkey and division combination. Each such partition will be one logical row as well. The single row for each composite key will have two columns: head_name and head_address. To make a query, you must provide BOTH the rowkey and the division.
EDIT: Cleared up some bad grammar.

How is Cassandra sorting static column families

As far as I know, a comparator is specified on the column family level. So far I have use it with dynamic columns (wide-rows). Which type of comparator is Cassandra using when you create a static column family using CQL?
CREATE TABLE songs (
id uuid PRIMARY KEY,
title text,
album text,
artist text,
data blob
);
and what happens if you throw a composite key into the mix.
CREATE TABLE songs (
id uuid,
title text,
album text,
artist text,
data blob,
PRIMARY KEY ((id, title), album)
);
http://cassandra.apache.org/doc/cql3/CQL.html#createTablepartitionClustering
http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_compound_keys_c.html
On a given physical node, rows for a given partition key are stored in the order induced by the clustering columns.
So in the 2nd case your partition key is (id, title), and clustering key is album, meaning all the rows for a given partition key will be stored ordered by album

Cassandra Composite Columns - How are CompositeTypes chosen?

I'm trying to understand the type used when I create composite columns.
I'm using CQL3 (via cqlsh) to create the CF and then the CLI to issue a describe command.
The Types in the Columns sorted by: ...CompositeType(Type1,Type2,...) are not the ones I'm expecting.
I'm using Cassandra 1.1.6.
CREATE TABLE CompKeyTest1 (
KeyA int,
KeyB int,
KeyC int,
MyData varchar,
PRIMARY KEY (KeyA, KeyB, KeyC)
);
The returned CompositeType is
CompositeType(Int32,Int32,UTF8)
Shouldn't it be (Int32,Int32,Int32)?
CREATE TABLE CompKeyTest2 (
KeyA int,
KeyB varchar,
KeyC int,
MyData varchar,
PRIMARY KEY (KeyA, KeyB, KeyC)
);
The returned CompositeType is
CompositeType(UTF8,Int32,UTF8)
Why isn't it the same as the types used when I define the table? I'm probably missing something basic in the type assignment...
Thanks!
The composite column name is composed of the values of primary keys 2...n and the name of the non-primary key column being saved.
(So if you have 5 non-key fields then you'll have five such columns and their column names will differ only in the last composed value which would be the non-key field name.)
So in both examples the composite column is made up of the values of KeyB, KeyC and the name of the column being stored ("MyData", in both cases). That's why you're seeing those CompositeTypes being returned.
(btw, the first key in the primary key is the partitioning key and its value is only used as the row key (if you're familiar with Cassandra under the covers). It is not used as part of any of the composite column names.)

Resources