Cassandra map collection support in thrift - cassandra

Is it possible to define map collection in CQL3 which would then be readable as group of columns in Thrift?
Based on https://www.datastax.com/dev/blog/thrift-to-cql3 when mixing static and dynamic columns, the dynamic ones are not visible from CQL3. The recommended way is to use collections. However, when defining collections in non-compact-storage table, it is not visible in thrift at all. When defining it as frozen and compact-storage, it's visible as single column.
create table main.exp2 (
article_id blob primary key,
text text,
scantime int,
info text,
rest frozen<map<text, blob>>) WITH COMPACT STORAGE;
Did I miss some, possibly undocumented option? Do I have too old version of library?
Alternatively, where would be good place for asking for this as new feature? Obviously, with thrift being deprecated, I can't expect any change in storage itself, however based on explanation on the thrift-to-cql3 page, I would expect there is some way how to provide the data to thrift with composite column names.
Motivation is sort of forward compatibility - preparing table so it can be used by current applications with thrift, but also by new applications with CQL3. Currently it seems that such combination is only possible with completely dynamic tables.
Although if there is 64KB limit for length of value in map collection, as mentioned on https://docs.datastax.com/en/cql/3.3/cql/cql_reference/refLimits.html , we can't use it anyway.
So maybe asking for the feature of reading dynamic columns of mixed table from CQL3 makes more sense ...
EDIT: Note that I am aware that I can define the table without any static columns, like this:
create table main.exp1 (
article_id blob,
column_key text,
value blob,
PRIMARY KEY (article_id, column_key)
) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (column_key ASC);
but that means all my values are blob, including the ones I know about in advance.

I would simply rollout the map into separate rows by adding a new clustering column to distinguish the values. For your design you can also move the common parts into static columns, so they will be always returned. The table structure could look something like this:
create table main.exp2 (
article_id blob,
text text static,
scantime int static,
info text static,
rest_key text,
rest_value blob,
primary key (article_id, rest_key)
);
P.S. Thrift is going away in next version, so it's better to convert everything into CQL

Related

Is it possible to search for a tag in cql using a set?

Is it possible to query items in a CQL table that exactly match a specific element in a set. For example, can we search for where tag="music" on the following table?
create table if not exists example (
website text,
uuid text,
message text,
tags set<text>,
created timestamp,
primary key ((site), uuid))
It's possible to search for elements in the collection using the CONTAINS operator (see documentation). But it will be very slow, so index on collection may help (see docs for syntax) but it's also not necessary optimal. Better solution would be something like Storage Attached Indexes from DataStax, but they aren't in the OSS Cassandra yet. Another solution could be integration of real search engine, like, Solr (known as DSE Search), or Elasticsearch (as in Elassandra).
The main thing to take into account would be to analyze the latency requirements, in some cases, it would be better to have a separate table with tag as partition key, but in this case you may have data skew as there are not so many tags, and distribution of data is not uniform.

Should I set foreign keys in Cassandra tables?

I am new to Cassandra and coming from relational background. I learned Cassandra does not support JOINs hence no concept of foreign keys. Suppose I have two tables:
Users
id
name
Cities
id
name
In RDBMS world I should pass city_id into users table. Since there is no concept of joins and you are allowed to duplicate data, is it still work passing city_id into users table while I can create a table users_by_cities?
The main Cassandra concept is that you design tables based off of your queries (as writes to the table have no restrictions). The design is based off of the query filters. An application that queries a table by some ID is somewhat unnatural as the CITY_ID could be any value and typically is unknown (unless you ran a prior query to get it). Something more natural may be CITY_NAME. Anyway, assuming there are no indexes on the table (which are mere tables themselves), there are rules in Cassandra regarding the filters you provide and the table design, mainly that, at a minimum, one of the filters MUST be the partition key. The partition key helps direct cassandra to the correct node for the data (which is how the reads are optimized). If none of your filters are the partition key, you'll get an error (unless you use ALLOW FILTERING, which is a no-no). The other filters, if there are any, must be the clustering columns (you can't have a filter that is neither the partition key nor the clustering columns - again, unless you use ALLOW FILTERING).
These restrictions, coming from the RDBMS world, are unnatural and hard to adjust to, and because of them, you may have to duplicate data into very similar structures (maybe the only difference is the partition keys and clustering columns). For the most part, it is up to the application to manipulate each structure when changes occur, and the application must know which table to query based off of the filters provided. All of these are considered painful coming from a relational world (where you can do whatever you want to one structure). These "constraints" need to be weighed against the reasons why you chose Cassandra for your storage engine.
Hope this helps.
-Jim

Efficient modeling of versioned hierarchies in Cassandra

Disclaimer:
This is quite a long post. I first explain the data I am dealing with, and what I want to do with it.
Then I detail three possible solutions I have considered, because I've tried to do my homework (I swear :]). I end up with a "best guess" which is a variation of the first solution.
My ultimate question is: what's the most sensible way to solve my problem using Cassandra? Is it one of my attempts, or is it something else?
I am looking for advice/feedback from experienced Cassandra users...
My data:
I have many SuperDocuments that own Documents in a tree structure (headings, subheadings, sections, …).
Each SuperDocument structure can change (renaming of headings mostly) over time, thus giving me multiple versions of the structure as shown below.
What I'm looking for:
For each SuperDocument I need to timestamp those structures by date as above and I'd like, for a given date, to find the closest earlier version of the SuperDocument structure. (ie. the most recent version for which version_date < given_date)
These considerations might help solving the problem more easily:
Versions are immutable: changes are rare enough, I can create a new representation of the whole structure each time it changes.
I do not need to access a subtree of the structure.
I'd say it is OK to say that I do not need to find all the ancestors of a given leaf, nor do I need to access a specific node/leaf inside the tree. I can work all of this out in my client code once I have the whole tree.
OK let's do it
Please keep in mind I am really just starting using Cassandra. I've read/watched a lot of resources about data modeling, but haven't got much (any!) experience in the field!
Which also means everything will be written in CQL3... sorry Thrift lovers!
My first attempt at solving this was to create the following table:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
pre_pos int,
post_pos int,
title text,
PRIMARY KEY ((doc_id, version_date), pre_pos, post_pos)
) WITH CLUSTERING ORDER BY (pre_pos ASC);
That would give me the following structure:
I'm using a Nested Sets model for my trees here; I figured it would work well to keep the structure ordered, but I am open to other suggestions.
I like this solution: each version has its own row, in which each column represents a level of the hierarchy.
The problem though is that I (candidly) intended to query my data as follows:
SELECT * FROM superdoc_structures
WHERE doc_id="3399c35...14e1" AND version_date < '2014-03-11' LIMIT 1
Cassandra quickly reminded me I was not allowed to do that! (because the partitioner does not preserve row order on the cluster nodes, so it is not possible to scan through partition keys)
What then...?
Well, because Cassandra won't let me use inequalities on partition keys, so be it!
I'll make version_date a clustering key and all my problems will be gone. Yeah, not really...
First try:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
pre_pos int,
post_pos int,
title text,
PRIMARY KEY (doc_id, version_date, pre_pos, post_pos)
) WITH CLUSTERING ORDER BY (version_date DESC, pre_pos ASC);
I find this one less elegant: all versions and structure levels are made into columns of a now very wide row (compared to my previous solution):
Problem: with the same request, using LIMIT 1 will only return the first heading. And using no LIMIT would return all versions structure levels, which I would have to filter to only keep the most recent ones.
Second try:
there's no second try yet... I have an idea though, but I feel it's not using Cassandra wisely.
The idea would be to cluster by version_date only, and somehow store whole hierarchies in each column values. Sounds bad doesn't it?
I would do something like this:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
nested_sets map<int, int>,
titles list<text>,
PRIMARY KEY (doc_id, version_date)
) WITH CLUSTERING ORDER BY (version_date DESC);
The resulting row structure would then be:
It looks kind of all right to me in fact, but I will probably have more data than the level title to de-normalize into my columns. If it's only two attributes, I could go with another map (associating titles with ids for instance), but more data would lead to more lists, and I have the feeling it would quickly become an anti-pattern.
Plus, I'd have to merge all lists together in my client app when the data comes in!
ALTERNATIVE & BEST GUESS
After giving it some more thought, there's an "hybrid" solution that might work and may be efficient and elegant:
I could use another table that would list only the version dates of a SuperDocument & cache these dates into a Memcache instance (or Redis or whatever) for real quick access.
That would allow me to quickly find the version I need to fetch, and then request it using the composite key of my first solution.
That's two queries, plus a memory cache store to manage. But I may end up with one anyway, so maybe that'd be the best compromise?
Maybe I don't even need a cache store?
All in all, I really feel the first solution is the most elegant one to model my data. What about you?!
First, you don't need to use memcache or redis. Cassandra will give you very fast access to that information. You could certainly have a table that was something like:
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
/* stuff */
primary key (doc_id, version_date)
} with clustering order by (version_date desc);
which would give you a quick way to access a given version (this query may look familiar ;-):
select * from superdoc_structures
where doc_id="3399c35...14e1" and
version_date < '2014-03-11'
order by version_date desc
limit 1;
Since nothing about the document tree structure seems to be relevant from the schema's point of view, and you are happy as a clam to create the document in its entirety every time there is a new version, I don't see why you'd even bother breaking out the tree in to separate rows. Why not just have the entire document in the table as a text or blob field?
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
contents text;
primary key (doc_id, version_date)
} with clustering order by (version_date desc);
So to get the contents of the document as existed at the new year, you'd do:
select contents from superdoc_structures
where doc_id="...." and
version_date < '2014-01-1'
order by version_date > 1
Now, if you did want to maintain some kind of hierarchy of the document components, I'd recommend doing something like a closure table table to represent it. Alternatively, since you are willing to copy the entire document on each write anyway, why not copy the entire section info on each write, why not do so and have a schema like:
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
section_path varchar;
contents text;
primary key (doc_id, version_date, section_path)
) with clustering order by (version_date desc, section_path asc);
Then have section path have a syntax like, "first_level next_level sub_level leaf_name". As a side benefit, when you have the version_date of the document (or if you create a secondary index on section_path), because a space is lexically "lower" than any other valid character, you can actually grab a subsection very cleanly:
select section_path, contents from superdoc_structures
where doc_id = '....' and
version_date = '2013-12-22' and
section_path >= 'chapter4 subsection2' and
section_path < 'chapter4 subsection2!';
Alternatively, you can store the sections using Cassandra's support for collections, but again... I'm not sure why you'd even bother breaking them out as doing them as one big chunk works just great.

Using CF's as an extra key level? Is there a limit on tables/column famillies in Cassandra?

As usual, I don't know if this is a good idea, so that's why I'm asking StackOverflow!
I'm toying with the idea of using CF's as an extra layer of partitioning data. For example, (and using the sensor example which seems to be pretty common) a traditional schema would be something like:
CREATE TABLE data (
area_id int,
sensor varchar,
date ascii,
event_time timeuuid,
some_property1 varchar,
some_property2 varchar,
some_property3 varchar
PRIMARY KEY ((area_id, sensor, date), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
This is a bit problematic if some_property1,2,3 etc are not known at design time and can be changed over the life the platform. One possibility is to just declare more properties as needed, but then I think it makes more sense to bring the sensors into their own CF as each will have different schemas. You could do this just by naming the CF something composite (managed outside Cassandra), e.g. {area_id}_{sensor_name}, and then alter the schema as needed when new properties are requested for insert.
My question is 2 fold. a) Is this a reasonable idea? and b) Are there any limitations of Cassandra (such as a cap of number of CFs) that this might fall foul of?
For reference this is a possible design to a previous question, but I think the question is valid to stand-alone.
Andy,
Adding an excessively large number of column families will create maintainability issues for you down the road. I'd advise against it.
Consider using CQL3 collections to address the unknown property issue - these will allow your objects in this column family to have a variable number of properties that may not be known at design-time. You can use the Map type to give each of your dynamic properties a strong name and a correlated value (we do this.)
However, you if you need wildly different data types for each property and if you need more than 10-15 properties per sensor, then CQL3 collections might not be the right tool for the job. You can technically store up to 65,000 objects in a CQL3 collection, but the truth is that they should never approach that size. CQL3 collections aren't indexed and working with really large CQL3 collections will incur performance penalties.

Collection of embedded objects using Cassandra CQL

I am trying to put my domain model into Cassandra using CQL. Let's say I have USER_FAVOURITES table. Each favourites has ID as a PRIMARY KEY. I want to store the list of up to 10 records of multiple fields, field_name, field_location and so on in order.
Is this a good idea to model a table like this
CREATE TABLE USER_FAVOURITES (
fav_id text PRIMARY KEY,
field_name_list list<text>,
field_location_list list<text>
);
And object is going to be constructed from list items of matching indicies (e.g.
Object(field_name_list[3],field_location_list[3]))
I query favourites always together. I may want to add and item to some position, start, end or middle.
Is this a good practice? Doesn't look like, but I am just not sure how to group objects in this case, also when i want to keep them in order by, for example, field_location or more complex ordering rule
I'd suggest the following structure:
CREATE TABLE USER_FAVOURITES (
fav_id text PRIMARY KEY,
favs map<int, blob>
);
This would allow you to get access to any item via index. The value part of map is blob, as one can easily serialize a whole needed object into binary and deserialize later.
My personal suggestion will be don't emphasize too much on cassandra collection, as it is bloom further in future. Though above specified scenario is very much possible and no harm in doing so.

Resources