Specify clustering key while creating a secondary index in Cassandra - cassandra

Amazon's DynamoDB supports Global Secondary Indices where one can specify a different Partition key and Sort key to index the data.
Does Cassandra offers functionality to create secondary indices using a partition key and clustering key?

Cassandra's feature which is analogous to DynamoDB's GSI (Global Secondary Index) feature is Materialized Views, and is almost identical to the DynamoDB feature and is probably what you are looking for. Don't be confused by Cassandra's "Secondary Index" feature, which is a different feature from DynamoDB's secondary indexes...
There is just one limitation in materialized views that I don't know if you care about (I didn't understand your exact use case): (https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/cqlCreateMaterializedView.html)
You can add a single non-primary key column from the base table.
In other words, if you have a base table with partition key p and clustering key c, and also two regular columns x and y, Cassandra does not allow you to create a materialized view (i.e., a GSI) whose partition key is x and clustering key is y (and p and c). The problem is that you're trying to add both x and y to the primary key of the view, and that is not currently supported. If you want to add just one (just x or just y), it will work.
ScyllaDB, which implements both CQL (the Cassandra API) and the DynamoDB API, supports this use case because it is needed for DynamoDB compatibility.

Related

Should I set foreign keys in Cassandra tables?

I am new to Cassandra and coming from relational background. I learned Cassandra does not support JOINs hence no concept of foreign keys. Suppose I have two tables:
Users
id
name
Cities
id
name
In RDBMS world I should pass city_id into users table. Since there is no concept of joins and you are allowed to duplicate data, is it still work passing city_id into users table while I can create a table users_by_cities?
The main Cassandra concept is that you design tables based off of your queries (as writes to the table have no restrictions). The design is based off of the query filters. An application that queries a table by some ID is somewhat unnatural as the CITY_ID could be any value and typically is unknown (unless you ran a prior query to get it). Something more natural may be CITY_NAME. Anyway, assuming there are no indexes on the table (which are mere tables themselves), there are rules in Cassandra regarding the filters you provide and the table design, mainly that, at a minimum, one of the filters MUST be the partition key. The partition key helps direct cassandra to the correct node for the data (which is how the reads are optimized). If none of your filters are the partition key, you'll get an error (unless you use ALLOW FILTERING, which is a no-no). The other filters, if there are any, must be the clustering columns (you can't have a filter that is neither the partition key nor the clustering columns - again, unless you use ALLOW FILTERING).
These restrictions, coming from the RDBMS world, are unnatural and hard to adjust to, and because of them, you may have to duplicate data into very similar structures (maybe the only difference is the partition keys and clustering columns). For the most part, it is up to the application to manipulate each structure when changes occur, and the application must know which table to query based off of the filters provided. All of these are considered painful coming from a relational world (where you can do whatever you want to one structure). These "constraints" need to be weighed against the reasons why you chose Cassandra for your storage engine.
Hope this helps.
-Jim

Composite partition key (Cassandra) vs. interleaved indexes (Accumulo, BigTable) for time-spatial series

I'm working on a project in which we import 50k - 100k datapoints every day, located both temporally (YYYYMMDDHHmm) and spatially (lon, lat), which we then dynamically render onto maps according to the query parameters set by our users. We do use pre-computed clusters below a given zoom level.
Within this context and given the fact that we're in the process of selecting a database engine for our storage layer, I'm currently evaluating Cassandra and BigTable's variants.
Specifically, I'm trying to understand the difference between using composite partition keys in Cassandra vs. interleaved index keys in BigTable, such as the one GeoMesa uses.
As far as I understand, both these approaches can leverage COTS hardware and can be tuned to reduce hotspotting and maximize space-filling.
What are the logical steps I should follow in order to discriminate between the two? Even though I am planning on testing both approaches in the near future, I'd like to hear a more reasoned and educated approach.
GeoMesa actually supports both BigTable clones like Accumulo and Cassandra. The Cassandra support, at the time of writing, is currently in an early phase. The README has a description of the indexing scheme.
Both implementations utilize Z2 or Z3 (depending on whether the index is just spatial or spatio-temporal) interleaved indexes. The BigTable clone indexing puts the full resolution Z3 into the primary key. Queries are just range scans on the sorted keys. Cassandra requires that partition keys be explicitly enumerated (unless you're doing full table scans). Because of that face, GeoMesa's Cassandra indexing uses composite keys to spread the information across both the partition key and the range key. The partition key is a coarse spatio-temporal key that buckets the world into NxN cells. Then, the range key is the full resolution Z3 interleaved index. Queries are decomposed into an enumeration of the overlapping buckets (partition key) and Z3 ranges within each bucket (range key). Having to enumerate the partition keys can cause a lot of network chattiness in order to satisfy a query. Setting up the bucket resolution is key to reducing this chattiness.

Cassandra Data Modelling and designing the Clustering

I am little confused on designing the data model for Cassandra, coming from SQL background! I have gone through Datastax documentation several times to understand many things about Cassandra! This seems to be problem and not sure how can I overcome this and type of data model which I should opt for!
Primary Key along with Clustering is something really explained well here!
The documentation says that, Primary Key (Partition key, Clustering keys) is the most important thing in data model.
My use-case is pretty simple:
ITEM_ID CREATED_ON MOVED_FROM MOVED_TO COMMENT
ITEM_ID will be unique (partition_key) and each item might have 10-20 movement records! I wanted to get the movement records of an item sorted by time it's created on. So I decided go with CREATED_ON as clustering key.
According to documentation, clustering_key comes under secondary index which should be as much repeatable value as possible unlike partition key. My data-model exactly fails here! How do I preserve order using clustering to achieve the same?
Obviously I can't create some ID generation login in Application since it runs on many instances and if I have to relay on some logic, eventually the purpose of Cassandra goes for toss here.
You actually do not need a secondary index for this particular example and secondary indexes are not created by default. Your clustering key all by itself will will allow you to do queries that look like
SELECT * from TABLE where ITEM_ID = SOMETHING;
Which will automatically give you back results sorted on your clustering key CREATED_ON.
The reason for this is your key will basically make partitions internally that looks like
ITEM_ID => [Row with first Created_ON], [Row with second Created_ON] ...

Why do we need secondary indexes in cassandra and how do they really work?

I was trying to understand why secondary indexes were even necessary on Cassandra.
I know that secondary indexes are used because:
"Secondary indexes allow for efficient querying by specific values using equality predicates (where column x = value y). Also, queries on indexed values can apply additional filters to perform operations such as range queries."
from: http://www.datastax.com/docs/0.7/data_model/secondary_indexes
But what I did not understand is why a query like:
get users where birth_date = 1973;
required that the birth_date had a secondary index. Why is it necessary for secondary indexes to even exist? Can't cassandra just go through the table and then return the values when the constrained is matched? Why do we need to treat things that we might want to query in that way in any special way?
I am assuming that the fact that cassandra is distributed and going through the whole table might not be easy due to each row key being allocated to a different node making it a little complicated. But I didn't really understand how making it distributed complicated the problem and how secondary indices resolved it (i.e. how does cassandra resolve this issue?).
Related to this question, is it true that secondary indexes and primary keys are the only things that can be queried in the for of SELECT * FROM column_family_table WHERE col_x = constraint? Why is the primary key special?
With amount of data these nosql databases meant to deal with, going for table scan or region scan is not an option. That's what Cassandra has restricted and allowed queries over non row key columns only if secondary indxes are enabled. That way such indices and data would be co located on same data node.
Hope it helps.
-Vivek

Primary key in an Azure SQL database

I'm working on a distributed system that uses CQRS and DDD principles. Based on that I decided that the primary keys of my entities should be guids, which are generated by my domain (and not by the database).
I have been reading about guids as primary keys. However, it seems that some of the best practices are not valid anymore if applied to Azure SQL database.
Sequential guids are nice if you use an on premise SQL server machine - the sequential guids that are generated will always be unique. However, on Azure, this is not the case anymore. As discussed in this thread, it's not even supported anymore; and generating them is also a bad idea as it becomes a single point of failure and it will not be guaranteed unique anymore across servers. I guess sequential guids don't make sense on Azure, so I should stick to regular guids. Is this correct?
Columns of type Guid are bad candidates for clustering. But this article states that this is not the case on Azure, and this one suggests the opposite! Which one should I believe? Should I just make my primary key a guid and leave it clustered (as it is the default for primary keys); or should I not make it clustered and choose another column for clustering?
Thanks for any insight!
the sequential guids that are generated will always be unique.
However, on Azure, this is not the case anymore.
Have a look at the bottom of this post here - http://blogs.msdn.com/b/sqlazure/archive/2010/05/05/10007304.aspx
The issue with Guid's (which rely on NEWID()) is that they will be randomly distributed which has performance issues when it comes to applying a clustered index to them.
What I'd suggest is that you use a GUID for your Primary Key. Then remove the default clustered index from that column. Apply the Clustered Index to some other field on your table (i.e. the created date) so that the records will be sequentially/contiguously indexed as they are created. And then apply a non-clustered index to your PK Guid Column.
Chances are, that will be fine from a *SELECT * FROM TABLE WHERE Id = " point of view for returning single instances.
Similarly, if you're returning lists or ranges of records for display in a list, if you specifiy the default order by CreatedDate, your clustered index will work for that
Considering the following
Sql Azure requires a clustered index to perform replication. Note, the index does not have to be unique. http://blogs.msdn.com/b/sqlazure/archive/2010/05/12/10011257.aspx
The advantage of a clustered index is that range queries on the index are performed optimally with minimum seeks.
The disadvantages of a clustered index is that, if data is added in out of sequence order, page split may occur and inserts may be relatively slower.
Referencing the above, I suggest the following
If you have a real key range you need to query upon, for example date, sequential number etc
create a (unique/non-unique) clustered index for that key.
create an additional unique index with domain generated GUIDs.
If no real key range exists, just create the clustered unique index with domain generated GUIDs. (The overheads of adding a fake unneeded clustered index would be more of a hindrance than a help.)

Resources