Cassandra column family design - cassandra

I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.

According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

Related

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

How to model for repeated information on many records on cassandra

I have a massively huge table with hundreds of billions of records and I mean to add a field in this table of which the same value would be repeated for millions of records. I don't know how to efficiently model this in cassandra. Allow me to elaborate:
I have a generic table:
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
PRIMARY KEY ((key, key2) time)
)
This table has 700.000.000+ records.
I want to create a field in this table, named source. This field indicates where the record was gotten from (since the software has many ways of receiving the information on the reading table). One possible value for this field is "XML: path\to\file.xml" or "Direct import from the X database" or even "Manually added", I want this to be a descriptive field, used exclusively to allow later maintenance in the database where we want to manipulate only records from a given source.
The queries I want to run that I can't now are:
Which records on the readings table were gotten from a given source?
What is the source of a given record?
A solution would be for me to create a table such as:
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
which would allow me to execute the first query, but would also mean that I would create 700.000.000+ new records on my database with a lot of information, which would take a lot of unnecessary storage space since tens of millions of these records would have the same value for source.
If this was a relational environment, I would create a source_id field on the readings table and a source table with id (PK) and name fields, that would mean storing only an additional integer for each row on the readings table and a new table with as many records as different sources there was.
How does one go about modelling this in cassandra?
Your schema
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
is a very bad idea because source is the partition key and you can have millions of records sharing the same source e.g. having a very very wide partition --> hot spots
For you second query, What is the source of a given record? is it quite trivial if you access the data using the record primary keys (key, key2). The source column can be added as just a regular column into the table
For the first query Which records on the readings table were gotten from a given source? it is trickier. The idea here is to fetch all the records having the same source.
Do you realize that this query can potentially return tens of millions of records ?
If it's what you want to do, there is a solution, use the new SASI secondary index (read my blog post for all details) and create an index on the source column
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
source text,
PRIMARY KEY ((key, key2), time)
)
CREATE CUSTOM INDEX source_idx ON readings(source)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'PREFIX',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
};
Then to fetch all records having the same source, use server-side paging feature of the Java driver (or any other Datastax driver)
http://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise is a pretty good article on how to go about joining tables in Cassandra.
normalized data will always take up less storage than de-normalized (flat) data (provided the related data is larger than the key being used to join the tables together) but requires joins which take more horsepower to compute during queries.
There's always a trade-off. There's also a tradeoff concerning state with fully normalized data, one example being the customer who changes addresses. In a fully normalized schema, once the address change is made, all invoices for the customer, past and present show the new address. This isn't always desirable.
Often it's desirable to partially normalize to provide historic state on records where it's important to show the state of the data at a given time, such as on invoices. In that case you'd store a copy of the customer address data on the invoice at the time of invoice creation.
This is especially important for pricing and taxes as well. You want the price/tax stored with the invoice so you can show what the customer paid at the time the invoice was created, so when accounting runs monthly, yearly and beyond numbers that the prices on a given invoice are correct for the date on the invoice, even though the prices of the products may have changed. Otherwise you have an accounting nightmare!
There is a lot more to consider than simply storage space when deciding how to normalize/de-normalize a schema.
Sorry for rambling...

What are the pros and cons of grouping multiple table together in Cassandra?

The problem is Cassandra cannot handle a lot of tables per cluster (> 1000). I was looking for any means to reduce the number of tables, and one of them was grouping multiple tables that share the same structure to gether.
Let say if we have two table A and B
create table A (
key text,
value text,
primary key(key)
)
and
create table B (
key text,
value text,
primary key(key)
)
We can group them together by adding one more partition key
create table Shared (
original_table_name text, // either 'A' or 'B'
key text,
value text,
primary key(original_table_name, key)
)
My question is, is it a good pattern and what are the consequences of modelling data this way?
Please elaborate what you mean by alot of tables, because our production is running with 50+ tables, and I don't see any issue with it.
Anyways, if your application is using atlot of tables, then most probable cause of it it, normalized table. In cassandra you should always create denormalized tables, because of no join facility. Cassandra is built for very fast writes, so, you can count on it and not worry about that.
Now regarding the new design, I don't see any problem with that, only thing is your partition key should be combination of (table_name, key) and not just table_name so that it will be evenly distributed across nodes.
And ofcourse to query each time, you will have to specify table_name + key.

Why use a compound clustered key in Cassandra tables?

Why might one want to use a clustered index in a cassandra table?
For example; in a table like this:
CREATE TABLE blah (
key text,
a text,
b timestamp,
c double,
PRIMARY KEY ((key), a, b, c)
)
The clustered part is the a, b, c part of the PRIMARY KEY.
What are the benefits? What considerations are there?
Clustering keys do three main things.
1) They affect the available query pattern of your table.
2) They determine the on-disk sort order of your table.
3) They determine the uniqueness of your primary key.
Let's say that I run an ordering system and want to store product data on my website. Additionally I have several distribution centers, as well as customer contracted pricing. So when a certain customer is on my site, they can only access products that are:
Available in a distribution center (DC) in their geographic area.
Defined in their contract (so they may not necessarily have access to all products in a DC).
To keep track of those products, I'll create a table that looks like this:
CREATE TABLE customerDCProducts (
customerid text,
dcid text,
productid text,
productname text,
productPrice int,
PRIMARY KEY (customerid, dcid, productid));
For this example, if I want to see product 123, in DC 1138, for customer B-26354, I can use this query:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354' AND dcid='1138' AND productid='123';
Maybe I want to see products available in DC 1138 for customer B-26354:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354' AND dcid='1138';
And maybe I just want to see all products in all DCs for customer B-26354:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354';
As you can see, the clustering keys of dcid and productid allow me to run high-performing queries on my partition key (customerid) that are as focused as I may need.
The drawback? If I want to query all products for a single DC, regardless of customer, I cannot. I'll need to build a different query table to support that. Even if I want to query just one product, I can't unless I also provide a customerid and dcid.
What if I want my data ordered a certain way? For this example, I'll take a cue from Patrick McFadin's article on Getting Started With Time Series Data Modeling, and build a table to keep track of the latest temperatures for weather stations.
CREATE TABLE latestTemperatures (
weatherstationid text,
eventtime timestamp,
temperature text,
PRIMARY KEY (weatherstationid,eventtime),
) WITH CLUSTERING ORDER BY (eventtime DESC);
By clustering on eventtime, and specifying a DESCending ORDER BY, I can query the recorded temperatures for a particular station like this:
SELECT * FROM latestTemperatures
WHERE weatherstationid='1234ABCD';
When those values are returned, they will be in DESCending order by eventtime.
Of course, the one question that everyone (with a RDBMS background...so yes, everyone) wants to know, is how to query all results ordered by eventtime? And again, you cannot. Of course, you can query for all rows by omitting the WHERE clause, but that won't return your data sorted in any meaningful order. It's important to remember that Cassandra can only enforce clustering order within a partition key. If you don't specify one, your data will not be ordered (at least, not in the way that you want it to be).
Let me know if you have any additional questions, and I'll be happy to explain.

cassandra filtering on an indexed column isn't working

I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?
Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.

Resources