Efficient modeling of versioned hierarchies in Cassandra - cassandra

Disclaimer:
This is quite a long post. I first explain the data I am dealing with, and what I want to do with it.
Then I detail three possible solutions I have considered, because I've tried to do my homework (I swear :]). I end up with a "best guess" which is a variation of the first solution.
My ultimate question is: what's the most sensible way to solve my problem using Cassandra? Is it one of my attempts, or is it something else?
I am looking for advice/feedback from experienced Cassandra users...
My data:
I have many SuperDocuments that own Documents in a tree structure (headings, subheadings, sections, …).
Each SuperDocument structure can change (renaming of headings mostly) over time, thus giving me multiple versions of the structure as shown below.
What I'm looking for:
For each SuperDocument I need to timestamp those structures by date as above and I'd like, for a given date, to find the closest earlier version of the SuperDocument structure. (ie. the most recent version for which version_date < given_date)
These considerations might help solving the problem more easily:
Versions are immutable: changes are rare enough, I can create a new representation of the whole structure each time it changes.
I do not need to access a subtree of the structure.
I'd say it is OK to say that I do not need to find all the ancestors of a given leaf, nor do I need to access a specific node/leaf inside the tree. I can work all of this out in my client code once I have the whole tree.
OK let's do it
Please keep in mind I am really just starting using Cassandra. I've read/watched a lot of resources about data modeling, but haven't got much (any!) experience in the field!
Which also means everything will be written in CQL3... sorry Thrift lovers!
My first attempt at solving this was to create the following table:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
pre_pos int,
post_pos int,
title text,
PRIMARY KEY ((doc_id, version_date), pre_pos, post_pos)
) WITH CLUSTERING ORDER BY (pre_pos ASC);
That would give me the following structure:
I'm using a Nested Sets model for my trees here; I figured it would work well to keep the structure ordered, but I am open to other suggestions.
I like this solution: each version has its own row, in which each column represents a level of the hierarchy.
The problem though is that I (candidly) intended to query my data as follows:
SELECT * FROM superdoc_structures
WHERE doc_id="3399c35...14e1" AND version_date < '2014-03-11' LIMIT 1
Cassandra quickly reminded me I was not allowed to do that! (because the partitioner does not preserve row order on the cluster nodes, so it is not possible to scan through partition keys)
What then...?
Well, because Cassandra won't let me use inequalities on partition keys, so be it!
I'll make version_date a clustering key and all my problems will be gone. Yeah, not really...
First try:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
pre_pos int,
post_pos int,
title text,
PRIMARY KEY (doc_id, version_date, pre_pos, post_pos)
) WITH CLUSTERING ORDER BY (version_date DESC, pre_pos ASC);
I find this one less elegant: all versions and structure levels are made into columns of a now very wide row (compared to my previous solution):
Problem: with the same request, using LIMIT 1 will only return the first heading. And using no LIMIT would return all versions structure levels, which I would have to filter to only keep the most recent ones.
Second try:
there's no second try yet... I have an idea though, but I feel it's not using Cassandra wisely.
The idea would be to cluster by version_date only, and somehow store whole hierarchies in each column values. Sounds bad doesn't it?
I would do something like this:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
nested_sets map<int, int>,
titles list<text>,
PRIMARY KEY (doc_id, version_date)
) WITH CLUSTERING ORDER BY (version_date DESC);
The resulting row structure would then be:
It looks kind of all right to me in fact, but I will probably have more data than the level title to de-normalize into my columns. If it's only two attributes, I could go with another map (associating titles with ids for instance), but more data would lead to more lists, and I have the feeling it would quickly become an anti-pattern.
Plus, I'd have to merge all lists together in my client app when the data comes in!
ALTERNATIVE & BEST GUESS
After giving it some more thought, there's an "hybrid" solution that might work and may be efficient and elegant:
I could use another table that would list only the version dates of a SuperDocument & cache these dates into a Memcache instance (or Redis or whatever) for real quick access.
That would allow me to quickly find the version I need to fetch, and then request it using the composite key of my first solution.
That's two queries, plus a memory cache store to manage. But I may end up with one anyway, so maybe that'd be the best compromise?
Maybe I don't even need a cache store?
All in all, I really feel the first solution is the most elegant one to model my data. What about you?!

First, you don't need to use memcache or redis. Cassandra will give you very fast access to that information. You could certainly have a table that was something like:
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
/* stuff */
primary key (doc_id, version_date)
} with clustering order by (version_date desc);
which would give you a quick way to access a given version (this query may look familiar ;-):
select * from superdoc_structures
where doc_id="3399c35...14e1" and
version_date < '2014-03-11'
order by version_date desc
limit 1;
Since nothing about the document tree structure seems to be relevant from the schema's point of view, and you are happy as a clam to create the document in its entirety every time there is a new version, I don't see why you'd even bother breaking out the tree in to separate rows. Why not just have the entire document in the table as a text or blob field?
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
contents text;
primary key (doc_id, version_date)
} with clustering order by (version_date desc);
So to get the contents of the document as existed at the new year, you'd do:
select contents from superdoc_structures
where doc_id="...." and
version_date < '2014-01-1'
order by version_date > 1
Now, if you did want to maintain some kind of hierarchy of the document components, I'd recommend doing something like a closure table table to represent it. Alternatively, since you are willing to copy the entire document on each write anyway, why not copy the entire section info on each write, why not do so and have a schema like:
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
section_path varchar;
contents text;
primary key (doc_id, version_date, section_path)
) with clustering order by (version_date desc, section_path asc);
Then have section path have a syntax like, "first_level next_level sub_level leaf_name". As a side benefit, when you have the version_date of the document (or if you create a secondary index on section_path), because a space is lexically "lower" than any other valid character, you can actually grab a subsection very cleanly:
select section_path, contents from superdoc_structures
where doc_id = '....' and
version_date = '2013-12-22' and
section_path >= 'chapter4 subsection2' and
section_path < 'chapter4 subsection2!';
Alternatively, you can store the sections using Cassandra's support for collections, but again... I'm not sure why you'd even bother breaking them out as doing them as one big chunk works just great.

Related

How can you query all the data with an empty set in cassandra db?

As the title says, I'm trying to query all the data I got with no value stored in it. I've been searching for a while, and the only operation allowed that I've found is CONTAINS, which doesn't fit my need.
consider the following table:
CREATE TABLE environment(
id uuid,
name varchar,
message text,
public Boolean,
participants set<varchar>,
PRIMARY KEY (id)
)
How can I get all entries in the table with an empty set? E.g. participants = {} or null?
Unfortunately, you really can't. Cassandra makes queries like this difficult by design, because there's no way it can be done without doing a full table scan (scanning each and every node). This is why a big part of Cassandra data modeling is understanding all the ways that table will be queried, and building it to support those queries.
The other issue that you'll have to deal with, is that (generally speaking) Cassandra does not allow filtering by nulls. Again, it's a design choice...it's much easier to query for data that exists, rather than data that does not exist. Although, when writing with lightweight transactions, there are ways around this one (using the IF clause).
If you knew all of the ids ahead of time, you could write something to iterate through them, SELECT and check for null on the app-side. Although that approach will be slow (but it won't stress the cluster). Probably the better approach, is to use a distributed OLAP layer like Apache Spark. It still wouldn't be fast, but this is probably the best way to handle a situation like this.

Cassandra map collection support in thrift

Is it possible to define map collection in CQL3 which would then be readable as group of columns in Thrift?
Based on https://www.datastax.com/dev/blog/thrift-to-cql3 when mixing static and dynamic columns, the dynamic ones are not visible from CQL3. The recommended way is to use collections. However, when defining collections in non-compact-storage table, it is not visible in thrift at all. When defining it as frozen and compact-storage, it's visible as single column.
create table main.exp2 (
article_id blob primary key,
text text,
scantime int,
info text,
rest frozen<map<text, blob>>) WITH COMPACT STORAGE;
Did I miss some, possibly undocumented option? Do I have too old version of library?
Alternatively, where would be good place for asking for this as new feature? Obviously, with thrift being deprecated, I can't expect any change in storage itself, however based on explanation on the thrift-to-cql3 page, I would expect there is some way how to provide the data to thrift with composite column names.
Motivation is sort of forward compatibility - preparing table so it can be used by current applications with thrift, but also by new applications with CQL3. Currently it seems that such combination is only possible with completely dynamic tables.
Although if there is 64KB limit for length of value in map collection, as mentioned on https://docs.datastax.com/en/cql/3.3/cql/cql_reference/refLimits.html , we can't use it anyway.
So maybe asking for the feature of reading dynamic columns of mixed table from CQL3 makes more sense ...
EDIT: Note that I am aware that I can define the table without any static columns, like this:
create table main.exp1 (
article_id blob,
column_key text,
value blob,
PRIMARY KEY (article_id, column_key)
) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (column_key ASC);
but that means all my values are blob, including the ones I know about in advance.
I would simply rollout the map into separate rows by adding a new clustering column to distinguish the values. For your design you can also move the common parts into static columns, so they will be always returned. The table structure could look something like this:
create table main.exp2 (
article_id blob,
text text static,
scantime int static,
info text static,
rest_key text,
rest_value blob,
primary key (article_id, rest_key)
);
P.S. Thrift is going away in next version, so it's better to convert everything into CQL

Search for more than one element in a list in Cassandra

I'm learning how the data model works in Cassandra, what things you can do and what not, etc.
I've seen you can have collections and I'm wondering if you can search for the elements inside the collection. I've seen that you can look for one element with contains, but if you want to look for more than one you need to add more filters, is there any way to do this better? is it a bad practice?.
This my table definition:
CREATE TABLE data (
group_id int,
user timeuuid,
friends LIST<VARCHAR>,
PRIMARY KEY (group_id, user)
);
And this what I know i can use to look for more than one item in the list:
SELECT * FROM groups where friends contains 'bob' and friends contains 'Pete' ALLOW FILTERING;
Thank you
Secondary indexes are generally not recommended for performance reasons.
Generally, in Cassandra, Query based modelling should be followed.
So,
That would mean another table:
CREATE TABLE friend_group_relation (
friend VARCHAR,
group_id int,
<user if needed>
PRIMARY KEY ((friend), group_id)
);
Now you can use either IN query (not recommended) or async queries (strongly recommended, very fast response) on this table.
You can follow 2 different approaches
Pure cassandra: use a secondary index on your collection type as defined here documentation
You may also be able to use Solr and create a query against solr to retrieve your entries. Although this may look like a more complicated solution because it will require to use an extra tool it will avoid using secondary indexes on Cassandra. Secondary indexes on Cassandra are really expensive and based on on your schema definition may impact your performances.

Cassandra Data Modelling and designing the Clustering

I am little confused on designing the data model for Cassandra, coming from SQL background! I have gone through Datastax documentation several times to understand many things about Cassandra! This seems to be problem and not sure how can I overcome this and type of data model which I should opt for!
Primary Key along with Clustering is something really explained well here!
The documentation says that, Primary Key (Partition key, Clustering keys) is the most important thing in data model.
My use-case is pretty simple:
ITEM_ID CREATED_ON MOVED_FROM MOVED_TO COMMENT
ITEM_ID will be unique (partition_key) and each item might have 10-20 movement records! I wanted to get the movement records of an item sorted by time it's created on. So I decided go with CREATED_ON as clustering key.
According to documentation, clustering_key comes under secondary index which should be as much repeatable value as possible unlike partition key. My data-model exactly fails here! How do I preserve order using clustering to achieve the same?
Obviously I can't create some ID generation login in Application since it runs on many instances and if I have to relay on some logic, eventually the purpose of Cassandra goes for toss here.
You actually do not need a secondary index for this particular example and secondary indexes are not created by default. Your clustering key all by itself will will allow you to do queries that look like
SELECT * from TABLE where ITEM_ID = SOMETHING;
Which will automatically give you back results sorted on your clustering key CREATED_ON.
The reason for this is your key will basically make partitions internally that looks like
ITEM_ID => [Row with first Created_ON], [Row with second Created_ON] ...

Collection of embedded objects using Cassandra CQL

I am trying to put my domain model into Cassandra using CQL. Let's say I have USER_FAVOURITES table. Each favourites has ID as a PRIMARY KEY. I want to store the list of up to 10 records of multiple fields, field_name, field_location and so on in order.
Is this a good idea to model a table like this
CREATE TABLE USER_FAVOURITES (
fav_id text PRIMARY KEY,
field_name_list list<text>,
field_location_list list<text>
);
And object is going to be constructed from list items of matching indicies (e.g.
Object(field_name_list[3],field_location_list[3]))
I query favourites always together. I may want to add and item to some position, start, end or middle.
Is this a good practice? Doesn't look like, but I am just not sure how to group objects in this case, also when i want to keep them in order by, for example, field_location or more complex ordering rule
I'd suggest the following structure:
CREATE TABLE USER_FAVOURITES (
fav_id text PRIMARY KEY,
favs map<int, blob>
);
This would allow you to get access to any item via index. The value part of map is blob, as one can easily serialize a whole needed object into binary and deserialize later.
My personal suggestion will be don't emphasize too much on cassandra collection, as it is bloom further in future. Though above specified scenario is very much possible and no harm in doing so.

Resources