Why use a compound clustered key in Cassandra tables? - cassandra

Why might one want to use a clustered index in a cassandra table?
For example; in a table like this:
CREATE TABLE blah (
key text,
a text,
b timestamp,
c double,
PRIMARY KEY ((key), a, b, c)
)
The clustered part is the a, b, c part of the PRIMARY KEY.
What are the benefits? What considerations are there?

Clustering keys do three main things.
1) They affect the available query pattern of your table.
2) They determine the on-disk sort order of your table.
3) They determine the uniqueness of your primary key.
Let's say that I run an ordering system and want to store product data on my website. Additionally I have several distribution centers, as well as customer contracted pricing. So when a certain customer is on my site, they can only access products that are:
Available in a distribution center (DC) in their geographic area.
Defined in their contract (so they may not necessarily have access to all products in a DC).
To keep track of those products, I'll create a table that looks like this:
CREATE TABLE customerDCProducts (
customerid text,
dcid text,
productid text,
productname text,
productPrice int,
PRIMARY KEY (customerid, dcid, productid));
For this example, if I want to see product 123, in DC 1138, for customer B-26354, I can use this query:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354' AND dcid='1138' AND productid='123';
Maybe I want to see products available in DC 1138 for customer B-26354:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354' AND dcid='1138';
And maybe I just want to see all products in all DCs for customer B-26354:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354';
As you can see, the clustering keys of dcid and productid allow me to run high-performing queries on my partition key (customerid) that are as focused as I may need.
The drawback? If I want to query all products for a single DC, regardless of customer, I cannot. I'll need to build a different query table to support that. Even if I want to query just one product, I can't unless I also provide a customerid and dcid.
What if I want my data ordered a certain way? For this example, I'll take a cue from Patrick McFadin's article on Getting Started With Time Series Data Modeling, and build a table to keep track of the latest temperatures for weather stations.
CREATE TABLE latestTemperatures (
weatherstationid text,
eventtime timestamp,
temperature text,
PRIMARY KEY (weatherstationid,eventtime),
) WITH CLUSTERING ORDER BY (eventtime DESC);
By clustering on eventtime, and specifying a DESCending ORDER BY, I can query the recorded temperatures for a particular station like this:
SELECT * FROM latestTemperatures
WHERE weatherstationid='1234ABCD';
When those values are returned, they will be in DESCending order by eventtime.
Of course, the one question that everyone (with a RDBMS background...so yes, everyone) wants to know, is how to query all results ordered by eventtime? And again, you cannot. Of course, you can query for all rows by omitting the WHERE clause, but that won't return your data sorted in any meaningful order. It's important to remember that Cassandra can only enforce clustering order within a partition key. If you don't specify one, your data will not be ordered (at least, not in the way that you want it to be).
Let me know if you have any additional questions, and I'll be happy to explain.

Related

Cassandra partion Key

Currently, I am exploring cassandra and having an special use case to design an support view of an application
My access patterns.
To fetch specific transaction
select * from purchase_by_user where userid='Tom' and transaction_date='1/20/22'
select * from purchase_by_user where userid='Jerry' and transaction_date <=1/21/22 and transaction_date >= '1/16/22'
select * from purchase_by_user where userid='Tom' and amount="100"
select * from purchase by user where user='Jerry' and amount>='50'
Create table purchase_by_user (
order_id uuid,
amount decimal,
transaction_ts timestamp,
user_id text,
Primary key((user_id), uuid)
)
Lets say Tom is making millions of orders, With this above partion key the data will not be evenly spread against the cluster and also the search will be expensive here.
Can anyone help, what would be better partion key here.
I'd go with a PRIMARY KEY definition like this:
PRIMARY KEY((user_id, transaction_year), transaction_date, order_id)
) WITH CLUSTERING ORDER BY (transaction_date DESC, order_id ASC)
This makes use of the "bucketing" concept that Manish mentioned. In this case, if Tom is creating an order every single day, there will only be 365 in each partition.
Lets say Tom is making millions of orders
In fact, even if Tom placed two orders per day, it's still only be 730. So while thinking about throughput extremes is a good exercise, a single user placing even one million orders is probably not realistic.
Also, some of the queries above are using transaction_date in a range query. I've added transaction_date as the first clustering key to support those queries. And if transaction_date is in DESCending order, the most-recent transactions will be at the "top" of the partition (they'll be read first), which is usually how most date/time-driven applications tend to function.
You can use the concept of bucketing to reduce the number of rows in a single partition. For example you can create a key like (user_id int, bucket_number int). Here you can identify the max value of bucket_number on your expected data size. IF you expect this user can make millions order then you can have bucket value till 1000. The main idea is to focus that you dont end up creating partition with large number of rows.

Cassandra order by timestemp desc

I just begin study cassandra.
It was a table and queries.
CREATE TABLE finance.tickdata(
id_symbol int,
ts timestamp,
bid double,
ask double,
PRIMARY KEY(id_symbol,ts)
);
And query is successful,
select ts,ask,bid
from finance.tickdata
where id_symbol=3
order by ts desc;
Next it was decision move id_symbol in table name, new table(s) scripts.
CREATE TABLE IF NOT EXISTS mts_src.ticks_3(
ts timestamp PRIMARY KEY,
bid double,
ask double
);
And now query fails,
select * from mts_src.ticks_3 order by ts desc
I read from docs, that I need use and filter (WHERE) by primary key (partition key),
but technically my both examples same. Why cassandra so restricted in this aspect?
And one more question, It is good idea in general? move id_symbol in table name -
potentially it can be 1000 of unique id_symbol and a lot of data for each. Separate this data on individual tables look like good idea!? But I lose order by possibility, that is so necessary for me to take fresh data by each symbol_id.
Thanks.
You can't sort on the partition key, you can sort only on clustering columns inside the single partition. So you need to model your data accordingly. But you need to be very careful not to create very large partitions (when using ticker_id as partition key, for example). In this case you may need to create a composite keys, like, ticker_id + year, or month, depending on how often you're inserting the data.
Regarding the table per ticker, that's not very good idea, because every table has overhead, it will lead to increased resource consumption. 200 tables is already high number, and 500 is almost "hard limit"

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

How to model for repeated information on many records on cassandra

I have a massively huge table with hundreds of billions of records and I mean to add a field in this table of which the same value would be repeated for millions of records. I don't know how to efficiently model this in cassandra. Allow me to elaborate:
I have a generic table:
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
PRIMARY KEY ((key, key2) time)
)
This table has 700.000.000+ records.
I want to create a field in this table, named source. This field indicates where the record was gotten from (since the software has many ways of receiving the information on the reading table). One possible value for this field is "XML: path\to\file.xml" or "Direct import from the X database" or even "Manually added", I want this to be a descriptive field, used exclusively to allow later maintenance in the database where we want to manipulate only records from a given source.
The queries I want to run that I can't now are:
Which records on the readings table were gotten from a given source?
What is the source of a given record?
A solution would be for me to create a table such as:
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
which would allow me to execute the first query, but would also mean that I would create 700.000.000+ new records on my database with a lot of information, which would take a lot of unnecessary storage space since tens of millions of these records would have the same value for source.
If this was a relational environment, I would create a source_id field on the readings table and a source table with id (PK) and name fields, that would mean storing only an additional integer for each row on the readings table and a new table with as many records as different sources there was.
How does one go about modelling this in cassandra?
Your schema
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
is a very bad idea because source is the partition key and you can have millions of records sharing the same source e.g. having a very very wide partition --> hot spots
For you second query, What is the source of a given record? is it quite trivial if you access the data using the record primary keys (key, key2). The source column can be added as just a regular column into the table
For the first query Which records on the readings table were gotten from a given source? it is trickier. The idea here is to fetch all the records having the same source.
Do you realize that this query can potentially return tens of millions of records ?
If it's what you want to do, there is a solution, use the new SASI secondary index (read my blog post for all details) and create an index on the source column
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
source text,
PRIMARY KEY ((key, key2), time)
)
CREATE CUSTOM INDEX source_idx ON readings(source)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'PREFIX',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
};
Then to fetch all records having the same source, use server-side paging feature of the Java driver (or any other Datastax driver)
http://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise is a pretty good article on how to go about joining tables in Cassandra.
normalized data will always take up less storage than de-normalized (flat) data (provided the related data is larger than the key being used to join the tables together) but requires joins which take more horsepower to compute during queries.
There's always a trade-off. There's also a tradeoff concerning state with fully normalized data, one example being the customer who changes addresses. In a fully normalized schema, once the address change is made, all invoices for the customer, past and present show the new address. This isn't always desirable.
Often it's desirable to partially normalize to provide historic state on records where it's important to show the state of the data at a given time, such as on invoices. In that case you'd store a copy of the customer address data on the invoice at the time of invoice creation.
This is especially important for pricing and taxes as well. You want the price/tax stored with the invoice so you can show what the customer paid at the time the invoice was created, so when accounting runs monthly, yearly and beyond numbers that the prices on a given invoice are correct for the date on the invoice, even though the prices of the products may have changed. Otherwise you have an accounting nightmare!
There is a lot more to consider than simply storage space when deciding how to normalize/de-normalize a schema.
Sorry for rambling...

Cassandra column family design

I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.
According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

Resources